Finding semantically similar documents to a given document

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
School of Science | Master's thesis
Date
2010
Major/Subject
Ohjelmistotekniikka
Mcode
T-106
Degree programme
Language
en
Pages
63 + [10]
Series
Abstract
Contemporary research on information retrieval is dominated by statistical methods. Finding related documents to a document is basically an information retrieval problem. The document in question is equivalent to the search query in information retrieval. To be able to compare documents, the documents have to be represented in a form that is suitable for comparison. Information lost in the transformation of a document to a representative form cannot be accessed afterwards. The representative form has to capture all the main topics and concepts in the document to he successfully used in information retrieval applications. This thesis will investigate statistical techniques in representing a document and explore ways to compare them. The environment to search related documents is an idea collaboration tool named Sproodle. In Sproodle users write short ideas and there is a need to find those ideas that are semantically the closest to the idea. In Sproodle the ideas are represented as keywords assigned by the author of the ideas. Related ideas are then computed based on simple keyword comparison. A prototype solution for generating keywords into the existing Sproodle system was made. The purpose is to discover if extracted keywords are sufficient for describing the idea. The prototype for generating keywords is based on Term Frequency - Inverse Document Frequency (TF x IDF). A coarse study was made to evaluate the effectiveness of the prototype in addition to the authors own judgment.
Description
Supervisor
Tarhio, Jorma
Thesis advisor
Löfgren, Peter
Keywords
document representation, dokument representation, information retrieval, nyckelord, related documents, term frekvens, keywords
Other note
Citation
Collections