Browsing by Author "Sayfullina, Luiza"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item Android Malfare Detection(2017) Sayfullina, Luiza; Eirola, Emil; Komashinskiy, Dmitri; Palumbo, Paolo; Karhunen, Juha; Department of Computer Science; Arcada University of Applied Sciences; F-SecureThe problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size. In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features. In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections. With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude, this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers. We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.Item Domain Adaptation for Resume Classification Using Convolutional Neural Networks(Springer, 2018) Sayfullina, Luiza; Malmi, Eric; Liao, Yiping; Jung, Alexander; Department of Computer Science; Adj. Prof. Gionis Aris group; Professorship Kaski Samuel; van der Aalst, Wil M.P.; Ignatov, Dmitry I.; Khachay, Michael; Kuznetsov, Sergei O.; Lempitsky, Victor; Lomazova, Irina A.; Loukachevitch, Natalia; Napoli, Amedeo; Panchenko, Alexander; Pardalos, Panos M.; Savchenko, Andrey V.; Wasserman, StanleyWe propose a novel method for classifying resume data of job applicants into 27 different job categories using convolutional neural networks. Since resume data is costly and hard to obtain due to its sensitive nature, we use domain adaptation. In particular, we train a classifier on a large number of freely available job description snippets and then use it to classify resume data. We empirically verify a reasonable classification performance of our approach despite having only a small amount of labeled resume data available.Item Learning representations for soft skill matching(2018-01-01) Sayfullina, Luiza; Malmi, Eric; Kannala, Juho; Department of Computer Science; School services,SCI; Panchenko, Alexander; van der Aalst, Wil M.; Khachay, Michael; Pardalos, Panos M.; Batagelj, Vladimir; Loukachevitch, Natalia; Glavaš, Goran; Ignatov, Dmitry I.; Kuznetsov, Sergei O.; Koltsova, Olessia; Lomazova, Irina A.; Savchenko, Andrey V.; Napoli, Amedeo; Pelillo, Marcello; Professorship Kannala Juho; Adj. Prof. Gionis Aris group; Helsinki Institute for Information Technology (HIIT)Employers actively look for talents having not only specific hard skills but also various soft skills. To analyze the soft skill demands on the job market, it is important to be able to detect soft skill phrases from job advertisements automatically. However, a naive matching of soft skill phrases can lead to false positive matches when a soft skill phrase, such as friendly, is used to describe a company, a team, or another entity, rather than a desired candidate. In this paper, we propose a phrase-matching-based approach which differentiates between soft skill phrases referring to a candidate vs. something else. The disambiguation is formulated as a binary text classification problem where the prediction is made for the potential soft skill based on the context where it occurs. To inform the model about the soft skill for which the prediction is made, we develop several approaches, including soft skill masking and soft skill tagging. We compare several neural network based approaches, including CNN, LSTM and Hierarchical Attention Model. The proposed tagging-based input representation using LSTM achieved the highest recall of 83.92% on the job dataset when fixing a precision to 95%.Item Machine Learning Methods for Classification of Unstructured Data(Aalto University, 2019) Sayfullina, Luiza; Eirola, Emil, Dr., SILO AI, Finland; Tietotekniikan laitos; Department of Computer Science; Computer Vision Group; Perustieteiden korkeakoulu; School of Science; Kannala, Juho, Prof., Aalto University, Department of Computer Science, Finland; Karhunen Juha, Prof., Aalto University, Department of Computer Science, FinlandNatural language processing is a field that studies automatic computational processing of human languages. Although natural language is symbolic and full of rules and ontologies, the state-of-the-art approaches are typically based on statistical machine learning. With the invention of word embeddings, researchers have managed to circumvent a problem of sparse feature space and to take into account word semantics learned from large corpora. When it comes to artificial strings, e.g. in source code, the usage of embeddings is restricted due to extremely large vocabulary. This dissertation covers two interesting applications using both embedding based and bag-of-words approaches: one related to industrial scale Android malware classification and another to extraction of soft skills and their impact on occupational gender segregation. Data coming from both applications is unstructured since Android applications consist of a set of files belonging to mainly unstructured data or semi-structured data, while job postings used for soft skill analysis represent free text where no clear structure is defined. The first part of the dissertation is dedicated to industrial scale Android malware classification covering a full pipeline from feature extraction to deployment. Various groups of features are extracted from Android installation package files, resulting in large high-dimensional sparse feature space. We investigated the ways to reduce feature space from millions to thousands of features efficiently and managed to improve the decision boundary. Finally, we addressed the problem of fair model assessment by separating training and test samples in time and evaluated proposed ensemble-based methods accordingly. The second part of the dissertation is dedicated to statistical and machine learning based soft skill analysis and their impact on occupational gender segregation. Soft skills are personality traits facilitating human interaction. Our work is pioneering with respect to large scale soft skill requirements analysis and their impact on salary. We show that not only soft skills are useful in predicting gender ratio estimate of the corresponding job category, but also most of them comply with gender stereotypes. Besides curating a soft skill list using job postings, we also propose various input representations to increase the precision of soft skill extraction using the context where soft skill occurs.Item Reducing Sparsity in Sentiment Analysis Data using Novel Dimensionality Reduction Approaches(2014-11-03) Sayfullina, Luiza; Miche, Yoan; Perustieteiden korkeakoulu; Karhunen, JuhaNo aspect of our mental life is more important to the quality and meaning of our existence than emotions and sentiments. Recently researches have introduced many Machine Learning approaches to analyse sentiment from public blogs, social networks, etc. Due to the sparse and high-dimensional textual datasets one needs Feature Selection before applying classifiers. The scope of my thesis are Dimensionality Reduction techniques for predicting one of the two opposite sentiments, specifically for Polarity Classification. The greatest challenge for Text Classification problems in general is data sparsity. Especially it is for Bag-of-words model, where the document is represented by the number of occurrences of each term in the vocabulary. Hence it can be hard for a classifier to understand the relationships between all the words in the initial vocabulary when training set is not large enough. In this thesis I investigate possible steps required to decrease the sparsity: setting the vocabulary, using sentiment dictionaries, choosing data representation and Dimensionality Reduction methods and their underlying strategies. I describe fast and intuitive unsupervised and supervised tf-idf scores for Feature Ranking. In addition, Word Clustering algorithm for merging the words with very close semantical meaning is introduced. By clustering semantically close words we decrease the feature space with minimum loss of information compared to Feature Selection, where we simply omit the features. Polarity Classification problem is investigated on two datasets: SemEval 2013 Twitter Sentiment Analysis and KDD Project Excitement Prediction using Extreme Learning Machine. Best performance for both datasets was achieved by using the proposed Word Clustering and supervised tf-idf score with 20 times less features than original vocabulary size.Item Responsible team players wanted(Springer Science + Business Media, 2019-04-27) Calanca, Federica; Sayfullina, Luiza; Minkus, Lara; Wagner, Claudia; Malmi, Eric; Sapienza University of Rome; Professorship Kannala Juho; University of Bremen; Leibniz Institute for the Social Sciences; Department of Computer ScienceDuring the past decades the importance of soft skills for labour market outcomes has grown substantially. This carries implications for labour market inequality, since previous research shows that soft skills are not valued equally across race and gender. This work explores the role of soft skills in job advertisements by drawing on methods from computational science as well as on theoretical and empirical insights from economics, sociology and psychology. We present a semi-automatic approach based on crowdsourcing and text mining for extracting a list of soft skills. We find that soft skills are a crucial component of job ads, especially of low-paid jobs and jobs in female-dominated professions. Our work shows that soft skills can serve as partial predictors of the gender composition in job categories and that not all soft skills receive equal wage returns at the labour market. Especially “female” skills are frequently associated with wage penalties. Our results expand the growing literature on the association of soft skills on wage inequality and highlight their importance for occupational gender segregation at labour markets.