Machine Learning for Internet Security: Malware Detection and Web Image Classification

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
School of Science | Master's thesis
Checking the digitized thesis and permission for publishing
Instructions for the author
Date
2013
Major/Subject
Informaatiotekniikka
Mcode
T-115
Degree programme
Language
en
Pages
70
Series
Abstract
In today's fast-moving Internet-driven world, new opportunities are emerging to take advantage of the latest technologies. However, this trend of empowerment is not only available for the good, but also for various questionable and criminal activities. The first part of the thesis addresses the problem of the automatic mal ware detection. An unusual restriction applied to malware classification is the strict zero False Positives rate. To satisfy this restriction, a two-stage methodology is proposed. Due to nominal features representation, an adaptation of the Min Hash algorithm is used on the first stage, balanced in accuracy and running time. The second stage classifier uses two ELMs, each with a hyper-parameter adjusting the trade-off between coverage and an amount of False Positives/Negatives. Final outputs include the third "unknown" class; sacrificing some coverage to achieve a really low zero False Positives rate (2 out of 38,000 on test set). The second half of the thesis explores the web image classification for the web content filtering. The training dataset inherits properties of real web images: high variability, often weak clues to the website class, and a high amount of semantic noise. For the classification, a suitable image representation and a two-stage methodology are proposed. Images are represented by their local features, with the local feature descriptors being the smallest processing unit. On the first stage, the class probability density in the descriptor space is estimated with a random Vector Quantization. On the second stage, classes of images are derived from their classified descriptors, in the image-to-class fashion. The approach provides the average accuracy of 35% in a 10-class setting, with the particular accuracy for an "Adult" class over 70%.
Description
Supervisor
Simula, Olli
Thesis advisor
Lendasse, Amaury
Miche, Yoan
Keywords
classification, nominal data, image processing, local features, ELM
Other note
Citation