Machine Learning Methods for Classification of Unstructured Data

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorEirola, Emil, Dr., SILO AI, Finland
dc.contributor.authorSayfullina, Luiza
dc.contributor.departmentTietotekniikan laitosfi
dc.contributor.departmentDepartment of Computer Scienceen
dc.contributor.labComputer Vision Groupen
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.schoolSchool of Scienceen
dc.contributor.supervisorKannala, Juho, Prof., Aalto University, Department of Computer Science, Finland
dc.contributor.supervisorKarhunen Juha, Prof., Aalto University, Department of Computer Science, Finland
dc.date.accessioned2019-09-04T09:01:21Z
dc.date.available2019-09-04T09:01:21Z
dc.date.defence2019-09-27
dc.date.issued2019
dc.description.abstractNatural language processing is a field that studies automatic computational processing of human languages. Although natural language is symbolic and full of rules and ontologies, the state-of-the-art approaches are typically based on statistical machine learning. With the invention of word embeddings, researchers have managed to circumvent a problem of sparse feature space and to take into account word semantics learned from large corpora. When it comes to artificial strings, e.g. in source code, the usage of embeddings is restricted due to extremely large vocabulary. This dissertation covers two interesting applications using both embedding based and bag-of-words approaches: one related to industrial scale Android malware classification and another to extraction of soft skills and their impact on occupational gender segregation. Data coming from both applications is unstructured since Android applications consist of a set of files belonging to mainly unstructured data or semi-structured data, while job postings used for soft skill analysis represent free text where no clear structure is defined. The first part of the dissertation is dedicated to industrial scale Android malware classification covering a full pipeline from feature extraction to deployment. Various groups of features are extracted from Android installation package files, resulting in large high-dimensional sparse feature space. We investigated the ways to reduce feature space from millions to thousands of features efficiently and managed to improve the decision boundary. Finally, we addressed the problem of fair model assessment by separating training and test samples in time and evaluated proposed ensemble-based methods accordingly. The second part of the dissertation is dedicated to statistical and machine learning based soft skill analysis and their impact on occupational gender segregation. Soft skills are personality traits facilitating human interaction. Our work is pioneering with respect to large scale soft skill requirements analysis and their impact on salary. We show that not only soft skills are useful in predicting gender ratio estimate of the corresponding job category, but also most of them comply with gender stereotypes. Besides curating a soft skill list using job postings, we also propose various input representations to increase the precision of soft skill extraction using the context where soft skill occurs.en
dc.format.extent80 + app. 85
dc.format.mimetypeapplication/pdfen
dc.identifier.isbn978-952-60-8675-0 (electronic)
dc.identifier.isbn978-952-60-8674-3 (printed)
dc.identifier.issn1799-4942 (electronic)
dc.identifier.issn1799-4934 (printed)
dc.identifier.issn1799-4934 (ISSN-L)
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/40155
dc.identifier.urnURN:ISBN:978-952-60-8675-0
dc.language.isoenen
dc.opnAugenstein, Isabelle, Prof., University of Copenhagen, Denmark
dc.publisherAalto Universityen
dc.publisherAalto-yliopistofi
dc.relation.haspart[Publication 1]: Luiza Sayfullina, Emil Eirola, Dmitry Komashinsky, Paolo Palumbo, Yoan Miche, Amaury Lendasse, Juha Karhunen. Efficient Detection of Zero-day Android Malware Using Normalized Bernoulli Naive Bayes. InInternational Conference on Trust, Security and Privacy in Computing and Communications, 198–205, August 2015. DOI: 10.1109/Trustcom.2015.375
dc.relation.haspart[Publication 2]: Luiza Sayfullina, Emil Eirola, Dmitry Komashinsky, Paolo Palumbo. Android malware detection: Building Useful Representations. In IEEE 15th International Conference on Machine Learning and Applications, 201–206, December 2016. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201612165986. DOI: 10.1109/ICMLA.2016.0041
dc.relation.haspart[Publication 3]: Paolo Palumbo, Luiza Sayfullina, Dmitry Komashinsky, Emil Eirola, Juha Karhunen. Pragmatic Android Malware Detection. Computers and Security, 689–701, July 2017. DOI: 10.1016/j.cose.2017.07.013
dc.relation.haspart[Publication 4]: Luiza Sayfullina, Eric Malmi, YiPing Liao, Alex Jung. Domain Adaptation for Resume Classification Using Convolutional Neural Networks. In The 6th International Conference on Analysis of Images, Social Networks, and Texts, 82–93, December 2017. DOI: 10.1007/978-3-319-73013-4_8
dc.relation.haspart[Publication 5]: Federica Calanca, Luiza Sayfullina, Lara Minkus, Claudia Wagner, Eric Malmi. Responsible team players wanted: An analysis of soft skill requirements in job advertisements. EPJ Data Science, p. 13, April 2019. Full text in Acris/Aaltodoc: http://urn.fi/URN:NBN:fi:aalto-201906033403. DOI: 10.1140/epjds/s13688-019-0190-z
dc.relation.haspart[Publication 6]: Luiza Sayfullina, Eric Malmi, Juho Kannala. Learning Representations for Soft Skills Matching. In The 7th International Conference on Analysis of Images, Social Networks, and Texts, p. 12, December 2018.
dc.relation.ispartofseriesAalto University publication series DOCTORAL DISSERTATIONSen
dc.relation.ispartofseries146/2019
dc.revFirat, Orhan, Dr., Google, USA
dc.revStakhanova, Natalia, Prof., University of Saskatchewan, Canada
dc.subject.keywordmachine learningen
dc.subject.keywordnatural language processingen
dc.subject.keywordneural networksen
dc.subject.keywordandroid malwareen
dc.subject.keywordsoft skillsen
dc.subject.keywordjob recommender systemsen
dc.subject.keywordtext classificationen
dc.subject.keywordoccupational segregationen
dc.subject.otherComputer scienceen
dc.titleMachine Learning Methods for Classification of Unstructured Dataen
dc.typeG5 Artikkeliväitöskirjafi
dc.type.dcmitypetexten
dc.type.ontasotDoctoral dissertation (article-based)en
dc.type.ontasotVäitöskirja (artikkeli)fi
local.aalto.acrisexportstatuschecked 2019-10-29_1517
local.aalto.archiveyes
local.aalto.formfolder2019_09_03_klo_12_26
local.aalto.infraScience-IT

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
isbn9789526086750.pdf
Size:
1.83 MB
Format:
Adobe Portable Document Format
Description: