Automatic keyword extraction for a partial search engine index

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2023-10-09
Department
Major/Subject
Data Science
Mcode
SCI3115
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
79+5
Series
Abstract
Full-text search engines play a critical role in many enterprise applications, where the quantity and complexity of the information are overwhelming. Promptly finding documents that contain relevant information for pressing questions is a necessity for efficient operation. This is especially the case for financial and legal teams executing Mergers and Acquisitions deals. The goal of the thesis is to provide search services for such teams without storing the sensitive documents involved, minimising the risk of potential data leaks. A literature review of related methods and concepts is presented. As search engine technologies that use encrypted indices for commercial applications are still in their early stages, the solution proposed in the thesis is the use of partial indexing by keyword extraction. A cosine similarity-based evaluation was used to measure the performance difference between the keyword-based partial index and the complete index. The partial indices were constructed using unsupervised keyword extraction methods based on term frequency, document graphs, and topic modelling. The frequency-based methods were term frequency, TF-IDF, and YAKE!. The graph-based method was TextRank. The topic modelling-based methods were NMF, LDA, and LSI. The methods were evaluated by running 51 reference queries on the LEDGAR data set, which contains 60,540 contracts. The results show that using only five keywords per document from the TF-IDF or YAKE! methods, the best matching documents in the result lists have a cosine similarity of 0.7 on average. This value is reasonably high, particularly considering the small number of keywords. The topic modelling-based methods were found to perform poorly due to being too general. The term frequency and TextRank methods were mediocre.
Description
Supervisor
Hämäläinen, Wilhelmiina
Thesis advisor
Jones, Matthew
Keywords
full-text search engine, partial indexing, similarity-based evaluation, keyword extraction, financial documents
Other note
Citation