Randomization algorithms for assessing the significance of data mining results

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Doctoral thesis (article-based)
Checking the digitized thesis and permission for publishing
Instructions for the author
Degree programme
Verkkokirja (1100 KB, 74 s.)
Aalto University publication series DOCTORAL DISSERTATIONS , 99/2011
Data mining is an interdisciplinary research area that develops general methods for finding interesting and useful knowledge from large collections of data. This thesis addresses from the computational point of view the problem of assessing whether the obtained data mining results are merely random artefacts in the data or something more interesting. In randomization based significance testing, a result is compared with the results obtained on randomized data. The randomized data are assumed to share some basic properties with the original data. To apply the randomization approach, the first step is to define these properties. The next step is to develop algorithms that can produce such randomizations. Results on the real data that clearly differ from the results on the randomized data are not directly explained by the studied properties of the data. In this thesis, new randomization methods are developed for four specific data mining scenarios. First, randomizing matrix data while preserving the distributions of values in rows and columns is studied. Next, a general randomization approach is introduced for iterative data mining. Randomization in multi-relational databases is also considered. Finally, a simple permutation method is given for assessing whether dependencies between features are exploited in classification. The properties of the new randomization methods are analyzed theoretically. Extensive experiments are performed on real and artificial datasets. The randomization methods introduced in this thesis are useful in various data mining applications. The methods work well on different types of data, are easy to use, and provide meaningful information to further improve and understand the data mining results.
Supervising professor
Mannila, Heikki, Prof.
Thesis advisor
Mannila, Heikki, Prof.
data mining, randomization, significance testing, MCMC, matrix, relational database, clustering, classification, iterative analysis
Other note
  • [Publication 1]: Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. 2009. Randomization methods for assessing data analysis results on real-valued matrices. Statistical Analysis and Data Mining, volume 2, number 4, pages 209-230. © 2009 Wiley Periodicals. By permission.
  • [Publication 2]: Markus Ojala. 2010. Assessing data mining results on matrices with randomization. In: Geoffrey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu (editors). Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010). Sydney, Australia. 14-17 December 2010. IEEE. Pages 959-964. ISBN 978-1-4244-9131-5. © 2010 Institute of Electrical and Electronics Engineers (IEEE). By permission.
  • [Publication 3]: Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, and Heikki Mannila. 2009. Tell me something I don't know: Randomization strategies for iterative data mining. In: John Elder, Françoise Soulié Fogelman, Peter Flach, and Mohammed Zaki (editors). Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009). Paris, France. 28 June - 1 July 2009. New York, NY, USA. ACM. Pages 379-388. ISBN 978-1-60558-495-9. © 2009 Association for Computing Machinery (ACM). By permission.
  • [Publication 4]: Markus Ojala, Gemma C. Garriga, Aristides Gionis, and Heikki Mannila. 2010. Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM International Conference on Data Mining (SDM 2010). Columbus, Ohio, USA. 29 April - 1 May 2010. Society for Industrial and Applied Mathematics. Pages 906-917. © 2010 Society for Industrial and Applied Mathematics (SIAM). By permission.
  • [Publication 5]: Markus Ojala and Gemma C. Garriga. 2010. Permutation tests for studying classifier performance. Journal of Machine Learning Research, volume 11, pages 1833-1863. © 2010 by authors.