### Browsing by Author "Gionis, Aristides, Associate Prof., Aalto University, Department of Computer Science, Finland"

Now showing 1 - 2 of 2

###### Results Per Page

###### Sort Options

Item Advances in Analysing Temporal Data(Aalto University, 2017) Kostakis, Orestis; Tietotekniikan laitos; Department of Computer Science; Data Mining Group; Perustieteiden korkeakoulu; School of Science; Gionis, Aristides, Associate Prof., Aalto University, Department of Computer Science, FinlandModern technical capabilities and systems have resulted in an abundance of data. A significant portion of all that data is of temporal nature. Hence, it becomes imperative to design effective and efficient algorithms, and solutions that enable searching and analysing large databases of temporal data. This thesis contains several contributions related to the broad scientific field of temporal-data analysis. First, we present a distance function for pairs of event-interval sequences, together with proofs of important properties, such as that the function is a metric, and a lower-bounding function. An embedding-based indexing method is proposed for searching through large databases of event-interval sequences, under this distance function. Second, we study the problem of subsequence search for event-interval sequences. This includes hardness results, an exact worst-case exponential-time algorithm, two upper bounds and a scheme for approximation algorithms. In addition, an equivalence is established between graphs and event-interval sequences. This equivalence allows to derive hardness results for several problems of event-interval sequences. Most importantly, it raises the question which techniques, results, and methods from each of the fields of graph mining and temporal data mining can be applied to the other that would advance the current state of the art. Third, for the problem of subsequence search, we propose an indexing method based on decomposing event-interval sequences into 2-interval patterns. The proposed indexing method is benchmarked against other approaches. In addition, we examine different variations of the problem and propose exact algorithms for solving them. Fourth, we describe a complete system that enables the clustering of a stream of graphs. The graphs are clustered into groups based on their distances, via approximating the graph edit distance. The proposed clustering algorithm achieves a good clustering with few graph comparisons. The effectiveness and usefulness of the systems is demonstrated by clustering function call-graphs of binary executable files for the purpose of malware detection. Finally, we solve the problem of summarising temporal networks. We assume that networks operate in certain modes and that the total sequence of interactions can be modelled as a series of transitions between these modes. We prove hardness results and provide heuristic procedures for finding approximate solutions. We demonstrate the quality of our methods via benchmarking and performing case-studies on datasets taken from sports and social networks.Item Sampling from scarcely defined distributions: Methods and applications in data mining(Aalto University, 2016) Kallio, Aleksi; Mannila, Heikki, Prof., Aalto University, Department of Computer Science, Finland; Puolamäki, Kai, Docent, Aalto University, Department of Computer Science, Finland; Tietotekniikan laitos; Department of Computer Science; Perustieteiden korkeakoulu; School of Science; Gionis, Aristides, Associate Prof., Aalto University, Department of Computer Science, FinlandThe importance of data is widely acknowledged in the modern society. Increasing volumes of information and growing interest in data driven decision making are creating new demands for analytical methods. In data mining applications, users are often required to operate with limited background knowledge. Specifically, one needs to analyze data and derived statistics without exact information on underlying statistical distributions. This work introduces the term scarcely defined distributions to describe such statistical distributions. In traditional statistical testing one often makes assumptions about the source of data, such as those related to normal distribution. If data are produced by a controlled experiment and originate from a well-known source, these assumptions can be justified. In data mining strong presuppositions about the data source typically cannot be made, as the data source is not under the control of the analyst, is not well known or is too complex to understand. The present research discusses methods and applications of data mining, in which scarcely defined distributions emerge. Several strategies are put forth that allow to analyze the dataset even when distributions are not well known, both in frequentist and information-theoretic statistical frameworks. A recurring theme is how to employ controls at the analysis phase, if the data were not produced in a controlled experiment. In most cases presented, control is achieved by adopting randomization and other empirical sampling methods that rely on large data sizes and computational power. Data mining applications reviewed in this work are from several fields. Biomedical measurement data are explored in multiple cases, involving both microarray and high-throughput sequencing data types. In ecological and paleontological domains the analysis of presence-absence data of taxa is discussed. A common factor for all of the application areas is the complexity of the underlying processes and the biased error sources of the measurement process. Finally, the study discusses the future trend of growing data volumes and the relevance of the proposed methods and solutions in that context. It is noted that the growing complexity and the needs for quickly adaptable methods favor the general approach taken in the thesis, while increasing data volumes and computational power makes it practically feasible.