Android Malfare Detection

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Conference article in proceedings
This publication is imported from Aalto University research portal.
View publication in the Research portal
View/Open full text file from the Research portal
Degree programme
Proc. of The IEEE 15th Int. Conf. on Machine Learning and Applications (ICMLA 2016)
The problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size. In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features. In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections. With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude, this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers. We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.
Android, Dimensionality reduction, Feature selection, Logistic regression, Malware classification, Random projection
Other note
Sayfullina , L , Eirola , E , Komashinskiy , D , Palumbo , P & Karhunen , J 2017 , Android Malfare Detection : Building Useful Representations . in 2016 15th IEEE International Conference on Machine Learning and Applications, ICMLA 2016, Proceedings : Anaheim, California, USA, December 18-20, 2016. . IEEE , pp. 201-206 , IEEE International Conference on Machine Learning and Applications , Anaheim , California , United States , 18/12/2016 .