Computational methods for comparison and exploration of event sequences

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Doctoral thesis (monograph) | Defence date: 2013-12-16
Checking the digitized thesis and permission for publishing
Instructions for the author

Date

2013

Major/Subject

Mcode

Degree programme

Language

en

Pages

116

Series

Aalto University publication series DOCTORAL DISSERTATIONS, 205/2013

Abstract

Many types of data, e.g., natural language texts, biological sequences, or time series of sensor data, contain sequential structure. Analysis of such sequential structure is interesting for various reasons, for example, to detect that data consists of several homogeneous parts, that data contains certain recurring patterns, or to find parts that are different or surprising compared to the rest of the data. The main question studied in this thesis is how to identify global and local patterns in event sequences. Within this broad topic, we study several subproblems. The first problem that we address is how to compare event frequencies across event sequences and databases of event sequences. Such comparisons are relevant, for example, to linguists who are interested in comparing word counts between two corpora to identify linguistic differences, e.g., between groups of speakers, or language change over time. The second problem that we address is how to find areas in an event sequence where an event has a surprisingly high or low frequency. More specifically, we study how to take into account the multiple testing problem when looking for local frequency deviations in event sequences. Many algorithms for finding local patterns in event sequences require that the person applying the algorithm chooses the level of granularity at which the algorithm operates, and it is often not clear how to choose that level. The third problem that we address is which granularities to use when looking for local patterns in an event sequence. The main contributions of this thesis are computational methods that can be used to compare and explore (databases of) event sequences with high computational efficiency, increased accuracy, and that offer new perspectives on the sequential structure of data. Furthermore, we illustrate how the proposed methods can be applied to solve practical data analysis tasks, and describe several experiments and case studies where the methods are applied on various types of data. The primary focus is on natural language texts, but we also study DNA sequences and sensor data. We find that the methods work well in practice and that they can efficiently uncover various types of interesting patterns in the data.

Description

Supervising professor

Rousu, Juho, Prof., Aalto University, Department of Information and Computer Science, Finland

Thesis advisor

Mannila, Heikki, Prof., Aalto University, Department of Information and Computer Science, Finland

Keywords

pattern mining, event sequence, statistical significance, multiple testing, sliding window, window length

Other note

Citation