A text-based approach to industry classification

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Business | Master's thesis

Date

2018

Major/Subject

Mcode

Degree programme

Information and Service Management (ISM)

Language

en

Pages

52 + 2

Series

Abstract

Industry classification schemes are a critical topic in academic research due to their use in combining companies into smaller groups that share similar characteristics. Although many studies in the domains of economics, accounting and finance depend heavily on these schemes, existing ones have significant limitations mainly due to their stagnant nature, which makes the schemes incapable of adapting to constant innovation and technological development. The objective of this thesis is to propose an automated, text-based industry classification scheme that can reflect constant changes in industry scope. This thesis approaches the research problem by answering two research questions. First, it studies whether it is possible to build an industry classification scheme by using word-embedding vectors extracted from news article. Second, this thesis identifies the benefits of a text-based industry classification scheme in comparison with existing classification schemes. To identify benefits, both qualitative and quantitative assessments are conducted to measure the performance. In the construction of an industry classification scheme, word-embedding vectors generated from news articles are used. The vectors are built using the Word2Vec algorithm. Word2Vec is a recently developed text-mining tool and is excellent at capturing the relationships between words and expressing them in a quantifiable format. The key findings of this thesis are twofold. First, it is technically possible to build an automated, text-based industry classification scheme by using word-embedding vectors. Two methods of building the scheme are proposed. Second, the proposed text-based scheme performs well in classifying companies into relevant business categories. In addition, the cluster-based scheme exhibits better performance in grouping companies into financially homogenous groups when parameters are optimized. The results suggest that a text-based industry classification scheme can serve as an alternative to existing industry classification schemes if parameters are optimized to the purpose of its use. The usefulness of the scheme is expected to increase due to the accelerating speed of innovation and technological development.

Description

Thesis advisor

Malo, Pekka
Vilkkumaa, Eeva

Keywords

industry classification, cluster analysis, text mining, Word2Vec

Other note

Citation