Browsing by Author "Tatti, Nikolaj"
Now showing 1 - 12 of 12
- Results Per Page
- Sort Options
- Adapting to Concept Drift in Malware Detection
Perustieteiden korkeakoulu | Master's thesis(2020-03-16) Kumpulainen, IiroThis Master's thesis studies methods for detecting malware while adapting to how malware change over time. Malware detection methods are necessary to defend against cyber attacks, which could otherwise compromise information systems that are integral in the daily lives of everyone living in modern societies. Machine learning models have been shown to be effective in classifying files as either malicious or benign. However, malware and other software are constantly evolving, which makes classifiers unable to learn what malware in the future will be like. Therefore, methods are required for adapting to this concept drift in malware detection. In this thesis, we compare three different methods called retrain, threshold, and hybrid that update the classifier to adapt to changes in the data over time. First, we create a classifier for detecting if a Windows portable executable file is malicious or benign. We use a dataset of 61466 files consisting of 19648 known malware and 41818 clean files. The classifier uses features that are extracted using static analysis without executing the files. To choose which classifier to use in malware detection, we compare random forest, gradient boosted decision trees, neural network, and support vector machines. Gradient boosted decision trees achieve the highest accuracy and AUC, and using a random forest for selecting most important features reduces the training time of the model. This classifier achieves an accuracy of 98.7\% when using cross-validation to evaluate the model. In contrast, when using temporally ordered files, the classifier has an accuracy of 96.7\% showing the presence of concept drift in malware detection. Each of the proposed retrain, threshold, and hybrid methods improve the accuracy of the classifier by adapting to concept drift. The best performing method is hybrid, which is a combination of the two other methods. This method is able to correctly classify 98.5\% of previously unseen files yielding a significant improvement in the malware detection capabilities of the classifier. - Advances in Mining Binary Data: Itemsets as Summaries
School of Science | Doctoral dissertation (monograph)(2008) Tatti, NikolajMining frequent itemsets is one of the most popular topics in data mining. Itemsets are local patterns, representing frequently cooccurring sets of variables. This thesis studies the use of itemsets to give information about the whole dataset. We show how to use itemsets for answering queries, that is, finding out the number of transactions satisfying some given formula. While this is a simple procedure given the original data, the task transforms into a computationally infeasible problem if we seek the solution using the itemsets. By making some assumptions of the structure of the itemsets and applying techniques from the theory of Markov Random Fields we are able to reduce the computational burden of query answering. We can also use the known itemsets to predict the unknown itemsets. The difference between the prediction and the actual value can be used for ranking itemsets. In fact, this method can be seen as generalisation for ranking itemsets based on their deviation from the independence model, an approach commonly used in the data mining literature. The next contribution is to use itemsets to define a distance between the datasets. We achieve this by computing the difference between the frequencies of the itemsets. We take into account the fact that the itemset frequencies may be correlated and by removing the correlation we show that our distance transforms into Euclidean distance between the frequencies of parity formulae. The last contribution concerns calculating the effective dimension of binary data. We apply fractal dimension, a known concept that works well with realvalued data. Applying fractal dimension dimension directly is problematic because of the unique nature of binary data. We propose a solution to this problem by introducing a new concept called normalised correlation dimension. We study our approach theoretically and empirically by comparing it against other methods. - Balancing information exposure in social networks
A4 Artikkeli konferenssijulkaisussa(2017) Garimella, Kiran; Gionis, Aristides; Parotsidis, Nikos; Tatti, NikolajSocial media has brought a revolution on how people are consuming news. Beyond the undoubtedly large number of advantages brought by social-media platforms, a point of criticism has been the creation of echo chambers and filter bubbles, caused by social homophily and algorithmic personalization. In this paper we address the problem of balancing the information exposure} in a social network. We assume that two opposing campaigns (or viewpoints) are present in the network, and that network nodes have different preferences towards these campaigns. Our goal is to find two sets of nodes to employ in the respective campaigns, so that the overall information exposure for the two campaigns is balanced. We formally define the problem, characterize its hardness, develop approximation algorithms, and present experimental evaluation results. Our model is inspired by the literature on influence maximization, but we offer significant novelties. First, balance of information exposure is modeled by a symmetric difference function, which is neither monotone nor submodular, and thus, not amenable to existing approaches. Second, while previous papers consider a setting with selfish agents and provide bounds on best response strategies (i.e., move of the last player), we consider a setting with a centralized agent and provide bounds for a global objective function. - A combinatorial approach to role discovery
Perustieteiden korkeakoulu | Master's thesis(2017-01-16) Arockiasamy, AlbertWe provide a new formulation for the problem of role discovery in graphs. Our definition is structural and recursive: two vertices should be assigned to the same role if the roles of their neighbors, when viewed as multi-sets, are similar enough. An attractive characteristic of our approach is that it is based on optimizing a well-defined objective function, and thus, contrary to previous approaches, the role-discovery task can be studied with the tools of combinatorial optimization. We demonstrate that, when fixing the number of roles to be used, the proposed role-discovery problem is NP-hard, while another (seemingly easier) version of the problem is NP-hard to approximate. On the positive side, despite the recursive nature of our objective function, we can show that finding a perfect (zero-cost) role assignment with the minimum number of roles can be solved in polynomial time. We do this by connecting the zero-cost role assignment with the notion of equitable partition. For the more practical version of the problem with fixed number of roles we present two natural heuristic methods, and discuss how to make them scalable in large graphs. - Decomposable families of itemsets
Faculty of Information and Natural Sciences | D4 Julkaistu kehittämis- tai tutkimusraportti taikka -selvitys(2008) Tatti, Nikolaj; Heikinheimo, HannesThe problem of selecting a small, yet high quality subset of patterns from a larger collection of itemsets has recently attracted a lot of research. Here we discuss an approach to this problem using the notion of decomposable families of itemsets. Such itemset families define a probabilistic model for the data from which the original collection of itemsets was derived. Furthermore, they induce a special tree structure, called a junction tree, familiar from the theory of Markov Random Fields. The method has several advantages. The junction trees provide an intuitive representation of themining results. From the computational point of view, the model provides leverage for problems that could be intractable using the entire collection of itemsets. We provide an efficient algorithm to build decomposable itemset families, and give an application example with frequency bound querying using the model. An empirical study show that our algorithm yields high quality results. - Discovering dynamic communities in interaction networks
Perustieteiden korkeakoulu | Master's thesis(2014-08-21) Rozenshtein, PolinaVery often online social networks are defined by aggregating information regarding the interaction between the nodes of the network. For example, a call graph is defined by considering an edge for each pair of individuals who have called each other at least once --- or at least k times. Similarly, an implicit social network in a social-media site is defined by considering an edge for each pair of users who have interacted in some way, e.g., have made a conversation, commented to each other's content, etc. Despite the fact that this type of definitions have been used to obtain a lot of insights regarding the structure of social networks, it is obvious that they suffer from a severe limitation: they neglect the precise time that the interaction between network nodes occurs. In this thesis we propose to study interaction networks, where one considers not only the underlying topology of the social network, but also the exact time instances that nodes interact. In an interaction network an edge is associated with a time stamp, and multiple edges may occur for the same pair of nodes. Consequently, interaction networks offer a more fine-grained representation that can be used to reveal otherwise hidden dynamic phenomena in the network. In the context of interaction networks, we study the problem of discovering communities, which are dense in terms of the underlying network structure, and whose edges occur in short time intervals. Such communities represent groups of individuals who interact with each other in some specific time instances, for example, a group of employees who work on a project and whose interaction intensifies before certain project milestones. We prove that the problem we define is NP-hard, and we provide effective algorithms by adapting techniques used to find dense subgraphs. We perform extensive evaluation of the proposed methods on synthetic and real datasets, which demonstrates the validity of our concepts and the good performance of our algorithms. - Dissimilarity measures between binary data sets
Helsinki University of Technology | Master's thesis(2004) Tatti, NikolajEroavaisuusmitoilla kahden abstraktin objektin välillä on tärkeä osa tiedonlouhinnassa. Perinteisesti mitat on määritelty kahden datapisteen välille. Tässä työssä tutkitaan mittoja kahden binääridatan välillä. Binääridatalla tarkoitetaan joukkoa, joka koostuu 0-1 vektoreista. Esimerkiksi tällainen data voisi olla myyntidata siten, että jokainen vektori edustaisi yhtä ostostapahtumaa. Jos sellainen myyntidata olisi kerätty eri kuukausina, niin siinä tapauksessa voitaisiin tutkia miten ostoskäyttäytyminen eroaa eri aikoina. Työssä oletetaan, että binääridata on generoitu jostain tuntemattomasta jakaumasta. Mitta määritellään epäsuorasti jakaumien kautta. Estimoidakseen jakauma datasta käytetään hyväksi kattavia joukkoja ja tunnettuja informaatioteoreettisia työkaluja: Estimaatti on jakauma, jolla on korkein entropia ja joka täyttää tietyt kattavien joukkojen asettamat ehdot. Kahden datajoukon väliseksi mitaksi määritellään näitten datajoukkojen jakaumien estimaattien Kullback-Leibler informaatio. Suurin ongelma tässä lähestymistavassa on, että kyseinen mitta ei ole yleisessä tapauksessa laskettavissa polynomisessa ajassa. Työssä tarkastellaan kahta mallia, tarkemmin sanottuna riippumattomuusmallia ja Chow-Liu -puumallia, joita käyttämällä mitta voidaan laskea tehokkaasti. Yleisempää tapausta varten Kullback-Leibler korvataan toisen asteen estimaatilla. Työssä esitetään, miten tällainen estimaatti voidaan laskea tehokkaasti. Testatakseen mittoja työssä käytetään tunnettua datajoukkoa, joka koostuu 20 000 artikkelista, jotka on kerätty 20 eri uutisryhmästä. Uutisryhmistä muodostetaan bag-of-words -esitykset, joita käytetään mittojen testaamiseen. Työssä tarkastellaan saatuja tuloksia ja päädytään lopputulokseen, että ne ovat järkeviä. - Finding events in temporal networks: segmentation meets densest subgraph discovery
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2019-01-01) Rozenshtein, Polina; Bonchi, Francesco; Gionis, Aristides; Sozio, Mauro; Tatti, NikolajIn this paper, we study the problem of discovering a timeline of events in a temporal network. We model events as dense subgraphs that occur within intervals of network activity. We formulate the event discovery task as an optimization problem, where we search for a partition of the network timeline into k non-overlapping intervals, such that the intervals span subgraphs with maximum total density. The output is a sequence of dense subgraphs along with corresponding time intervals, capturing the most interesting events during the network lifetime. A naïve solution to our optimization problem has polynomial but prohibitively high running time. We adapt existing recent work on dynamic densest subgraph discovery and approximate dynamic programming to design a fast approximation algorithm. Next, to ensure richer structure, we adjust the problem formulation to encourage coverage of a larger set of nodes. This problem is NP-hard; however, we show that on static graphs a simple greedy algorithm leads to approximate solution due to submodularity. We extend this greedy approach for temporal networks, but we lose the approximation guarantee in the process. Finally, we demonstrate empirically that our algorithms recover solutions with good quality. - Inferring the strength of social ties: A community-driven approach
A4 Artikkeli konferenssijulkaisussa(2017-08-13) Rozenshtein, Polina; Tatti, Nikolaj; Gionis, AristidesOnline social networks are growing and becoming denser. The social connections of a given person may have very high variability: from close friends and relatives to acquaintances to people who hardly know. Inferring the strength of social ties is an important ingredient for modeling the interaction of users in a network and understanding their behavior. Furthermore, the problem has applications in computational social science, viral marketing, and people recommendation. In this paper we study the problem of inferring the strength of social ties in a given network. Our work is motivated by a recent approach [27], which leverages the strong triadic closure (STC) principle, a hypothesis rooted in social psychology [13]. To guide our inference process, in addition to the network structure, we also consider as input a collection of tight communities. Those are sets of vertices that we expect to be connected via strong ties. Such communities appear in different situations, e.g., when being part of a community implies a strong connection to one of the existing members. We consider two related problem formalizations that reflect the assumptions of our setting: small number of STC violations and strong-tie connectivity in the input communities. We show that both problem formulations are NP-hard. We also show that one problem formulation is hard to approximate, while for the second we develop an algorithm with approximation guarantee. We validate the proposed method on real-world datasets by comparing with baselines that optimize STC violations and community connectivity separately. - The network-untangling problem: from interactions to activity timelines
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2021-01) Rozenshtein, Polina; Tatti, Nikolaj; Gionis, AristidesIn this paper we study a problem of determining when entities are active based on their interactions with each other. We consider a set of entities V and a sequence of time-stamped edges E among the entities. Each edge (u, v, t) ∈ E denotes an interaction between entities u and v at time t. We assume an activity model where each entity is active during at most k time intervals. An interaction (u, v, t) can be explained if at least one of u or v are active at time t. Our goal is to reconstruct the activity intervals for all entities in the network, so as to explain the observed interactions. This problem, the network-untangling problem, can be applied to discover event timelines from complex entity interactions. We provide two formulations of the network-untangling problem: (i) minimizing the total interval length over all entities (sum version), and (ii) minimizing the maximum interval length (max version). We study separately the two problems for k= 1 and k> 1 activity intervals per entity. For the case k= 1 , we show that the sum problem is NP-hard, while the max problem can be solved optimally in linear time. For the sum problem we provide efficient algorithms motivated by realistic assumptions. For the case of k> 1 , we show that both formulations are inapproximable. However, we propose efficient algorithms based on alternative optimization. We complement our study with an evaluation on synthetic and real-world datasets, which demonstrates the validity of our concepts and the good performance of our algorithms. - The Network-Untangling Problem: From Interactions to Activity Timelines
A3 Kirjan tai muun kokoomateoksen osa(2017) Rozenshtein, Polina; Tatti, Nikolaj; Gionis, AristidesIn this paper we study a problem of determining when entities are active based on their interactions with each other. More formally, we consider a set of entities V and a sequence of time-stamped edges E among the entities. Each edge (u,v,t)∈E denotes an interaction between entities u and v that takes place at time t. We view this input as a temporal network. We then assume a simple activity model in which each entity is active during a short time interval. An interaction (u, v, t) can be explained if at least one of u or v are active at time t. Our goal is to reconstruct the activity intervals, for all entities in the network, so as to explain the observed interactions. This problem, which we refer to as the network-untangling problem, can be applied to discover timelines of events from complex interactions among entities. We provide two formulations for the network-untangling problem: (i) minimizing the total interval length over all entities, and (ii) minimizing the maximum interval length. We show thatthe sum problem is NP-hard, while, surprisingly, the max problem can be solved optimally in linear time, using a mapping to 2-SAT. For the sum problem we provide efficient and effective algorithms based on realistic assumptions. Furthermore, we complement our study with an evaluation on synthetic and real-world datasets, which demonstrates the validity of our concepts and the good performance of our algorithms. - Reconstructing a cascade from temporal observations
A4 Artikkeli konferenssijulkaisussa(2018) Xiao, Han; Rozenshtein, Polina; Tatti, Nikolaj; Gionis, AristidesGiven a subset of active nodes in a network can we reconstruct the cascade that has generated these observations? This is a problem that has been studied in the literature, but here we focus in the case that temporal information is available about the active nodes. In particular, we assume that in addition to the subset of active nodes we also know their activation time. We formulate this cascade-reconstruction problem as a variant of a Steiner-tree problem: we ask to find a tree that spans all reported active nodes while satisfying temporal-consistency constraints. For the proposed problem we present three approximation algorithms. The best algorithm in terms of quality achieves a O(√k)-approximation guarantee, where k is the number of active nodes, while the most efficient algorithm has linearithmic running time, making it scalable to very large graphs. We evaluate our algorithms on real-world networks with both simulated and real cascades. Our results indicate that utilizing the available temporal information allows for more accurate cascade reconstruction. Furthermore, our objective leads to finding the “backbone” of the cascade and it gives solutions of very high precision.