The smallest set of constraints that explains the data : a randomization approach

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
Faculty of Information and Natural Sciences |
Date
2010
Department
Department of Information and Computer Science
Tietojenkäsittelytieteen laitos
Major/Subject
Mcode
Degree programme
Language
en
Pages
22
Series
TKK reports in information and computer science, 31
Abstract
Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and a test statistic. If the value of the test statistic on the original data set is more extreme than the test statistic on randomized data sets we can reject the null hypothesis. It is often not immediately clear why the null hypothesis is rejected. For example, the cost of clustering can be significantly lower in the original data than in the randomized data, but usually we would also like to know why the cost is small. We introduce a methodology for finding the smallest possible set of constraints, or patterns, that explains the data. In principle any type of patterns can be used as long as there exists an appropriate randomization method. We show that the problem is, in its general form, NP-hard, but that in a special case an exact solution can be computed fast, and propose a greedy algorithm that solves the problem. The proposed approach is demonstrated on time series data as well as on frequent itemsets in 0-1 matrices, and validated theoretically and experimentally.
Description
Keywords
hypothesis testing, randomization, significant patterns, time series, frequent patterns
Citation
Permanent link to this item
https://urn.fi/urn:nbn:fi:tkk-013141