The smallest set of constraints that explains the data : a randomization approach

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Faculty of Information and Natural Sciences |

Date

Major/Subject

Mcode

Degree programme

Language

en

Pages

22

Series

TKK reports in information and computer science, 31

Abstract

Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and a test statistic. If the value of the test statistic on the original data set is more extreme than the test statistic on randomized data sets we can reject the null hypothesis. It is often not immediately clear why the null hypothesis is rejected. For example, the cost of clustering can be significantly lower in the original data than in the randomized data, but usually we would also like to know why the cost is small. We introduce a methodology for finding the smallest possible set of constraints, or patterns, that explains the data. In principle any type of patterns can be used as long as there exists an appropriate randomization method. We show that the problem is, in its general form, NP-hard, but that in a special case an exact solution can be computed fast, and propose a greedy algorithm that solves the problem. The proposed approach is demonstrated on time series data as well as on frequent itemsets in 0-1 matrices, and validated theoretically and experimentally.

Description

Other note

Citation

Permanent link to this item

https://urn.fi/urn:nbn:fi:tkk-013141