Finding semantically similar documents to a given document

dc.contributorAalto Universityen
dc.contributor.advisorLöfgren, Peter
dc.contributor.authorIhrcke, Erik
dc.contributor.departmentInformaatio- ja luonnontieteiden tiedekuntafi
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.schoolSchool of Scienceen
dc.contributor.supervisorTarhio, Jorma
dc.description.abstractContemporary research on information retrieval is dominated by statistical methods. Finding related documents to a document is basically an information retrieval problem. The document in question is equivalent to the search query in information retrieval. To be able to compare documents, the documents have to be represented in a form that is suitable for comparison. Information lost in the transformation of a document to a representative form cannot be accessed afterwards. The representative form has to capture all the main topics and concepts in the document to he successfully used in information retrieval applications. This thesis will investigate statistical techniques in representing a document and explore ways to compare them. The environment to search related documents is an idea collaboration tool named Sproodle. In Sproodle users write short ideas and there is a need to find those ideas that are semantically the closest to the idea. In Sproodle the ideas are represented as keywords assigned by the author of the ideas. Related ideas are then computed based on simple keyword comparison. A prototype solution for generating keywords into the existing Sproodle system was made. The purpose is to discover if extracted keywords are sufficient for describing the idea. The prototype for generating keywords is based on Term Frequency - Inverse Document Frequency (TF x IDF). A coarse study was made to evaluate the effectiveness of the prototype in addition to the authors own judgment.en
dc.format.extent63 + [10]
dc.subject.keyworddocument representationen
dc.subject.keyworddokument representationsv
dc.subject.keywordinformation retrievalen
dc.subject.keywordrelated documentsen
dc.subject.keywordterm frekvenssv
dc.titleFinding semantically similar documents to a given documenten
dc.titleHitta semantiskt relaterade dokument till ett givet dokumentsv
dc.type.okmG2 Pro gradu, diplomityö
dc.type.ontasotMaster's thesisen
dc.type.ontasotPro gradu -tutkielmafi