Finding semantically similar documents to a given document

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis
Checking the digitized thesis and permission for publishing
Instructions for the author

Date

Major/Subject

Mcode

T-106

Degree programme

Language

en

Pages

63 + [10]

Series

Abstract

Contemporary research on information retrieval is dominated by statistical methods. Finding related documents to a document is basically an information retrieval problem. The document in question is equivalent to the search query in information retrieval. To be able to compare documents, the documents have to be represented in a form that is suitable for comparison. Information lost in the transformation of a document to a representative form cannot be accessed afterwards. The representative form has to capture all the main topics and concepts in the document to he successfully used in information retrieval applications. This thesis will investigate statistical techniques in representing a document and explore ways to compare them. The environment to search related documents is an idea collaboration tool named Sproodle. In Sproodle users write short ideas and there is a need to find those ideas that are semantically the closest to the idea. In Sproodle the ideas are represented as keywords assigned by the author of the ideas. Related ideas are then computed based on simple keyword comparison. A prototype solution for generating keywords into the existing Sproodle system was made. The purpose is to discover if extracted keywords are sufficient for describing the idea. The prototype for generating keywords is based on Term Frequency - Inverse Document Frequency (TF x IDF). A coarse study was made to evaluate the effectiveness of the prototype in addition to the authors own judgment.

Description

Supervisor

Tarhio, Jorma

Thesis advisor

Löfgren, Peter

Other note

Citation