Machine Learning for Internet Security: Malware Detection and Web Image Classification

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorLendasse, Amaury
dc.contributor.advisorMiche, Yoan
dc.contributor.authorAkusok, Anton
dc.contributor.departmentPerustieteiden korkeakoulufi
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.schoolSchool of Scienceen
dc.contributor.supervisorSimula, Olli
dc.date.accessioned2020-12-28T15:01:28Z
dc.date.available2020-12-28T15:01:28Z
dc.date.issued2013
dc.description.abstractIn today's fast-moving Internet-driven world, new opportunities are emerging to take advantage of the latest technologies. However, this trend of empowerment is not only available for the good, but also for various questionable and criminal activities. The first part of the thesis addresses the problem of the automatic mal ware detection. An unusual restriction applied to malware classification is the strict zero False Positives rate. To satisfy this restriction, a two-stage methodology is proposed. Due to nominal features representation, an adaptation of the Min Hash algorithm is used on the first stage, balanced in accuracy and running time. The second stage classifier uses two ELMs, each with a hyper-parameter adjusting the trade-off between coverage and an amount of False Positives/Negatives. Final outputs include the third "unknown" class; sacrificing some coverage to achieve a really low zero False Positives rate (2 out of 38,000 on test set). The second half of the thesis explores the web image classification for the web content filtering. The training dataset inherits properties of real web images: high variability, often weak clues to the website class, and a high amount of semantic noise. For the classification, a suitable image representation and a two-stage methodology are proposed. Images are represented by their local features, with the local feature descriptors being the smallest processing unit. On the first stage, the class probability density in the descriptor space is estimated with a random Vector Quantization. On the second stage, classes of images are derived from their classified descriptors, in the image-to-class fashion. The approach provides the average accuracy of 35% in a 10-class setting, with the particular accuracy for an "Adult" class over 70%.en
dc.format.extent70
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/100678
dc.identifier.urnURN:NBN:fi:aalto-2020122859509
dc.language.isoenen
dc.programme.majorInformaatiotekniikkafi
dc.programme.mcodeT-115fi
dc.rights.accesslevelclosedAccess
dc.subject.keywordclassificationen
dc.subject.keywordnominal dataen
dc.subject.keywordimage processingen
dc.subject.keywordlocal featuresen
dc.subject.keywordELMen
dc.titleMachine Learning for Internet Security: Malware Detection and Web Image Classificationen
dc.type.okmG2 Pro gradu, diplomityö
dc.type.ontasotMaster's thesisen
dc.type.ontasotPro gradu -tutkielmafi
dc.type.publicationmasterThesis
local.aalto.digiauthask
local.aalto.digifolderAalto_10409
local.aalto.idinssi46037
local.aalto.openaccessno

Files