Mutual dependency-based modeling of relevance in co-occurrence data

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Aalto-yliopiston teknillinen korkeakoulu | Doctoral thesis (article-based)
Checking the digitized thesis and permission for publishing
Instructions for the author
Degree programme
Verkkokirja (1596 KB, 80 s.)
TKK dissertations in information and computer science, 17
In the analysis of large data sets it is increasingly important to distinguish the relevant information from the irrelevant. This thesis outlines how to find what is relevant in so-called co-occurrence data, where there are two or more representations for each data sample. The modeling task sets the limits to what we are interested in, and in its part defines the relevance. In this work, the problem of finding what is relevant in data is formalized via dependence, that is, the variation that is found in both (or all) co-occurring data sets was deemed to be more relevant than variation that is present in only one (or some) of the data sets. In other words, relevance is defined through dependencies between the data sets. The method development contributions of this thesis are related to latent topic models and methods of dependency exploration. The dependency-seeking models were extended to nonparametric models, and computational algorithms were developed for the models. The methods are applicable to mutual dependency modeling and co-occurrence data in general, without restriction to the applications presented in the publications of this work. The application areas of the publications included modeling of user interest, relevance prediction of text based on eye movements, analysis of brain imaging with fMRI and modeling of gene regulation in bioinformatics. Additionally, frameworks for different application areas were suggested. Until recently it has been a prevalent convention to assume the data to be normally distributed when modeling dependencies between different data sets. Here, a distribution-free nonparametric extension of Canonical Correlation Analysis (CCA) was suggested, together with a computationally more efficient semi-parametric variant. Furthermore, an alternative view to CCA was derived which allows a new kind of interpretation of the results and using CCA in feature selection that regards dependency as the criterion of relevance. Traditionally, latent topic models are one-way clustering models, that is, one of the variables is clustered by the latent variable. We proposed a latent topic model that generalizes in two ways and showed that when only a small amount of data has been gathered, two-way generalization becomes necessary. In the field of brain imaging, natural stimuli in fMRI studies imitate real-life situations and challenge the analysis methods used. A novel two-step framework was proposed for analyzing brain imaging measurements from fMRI. This framework seems promising for the analysis of brain signal data measured under natural stimulation, once such measurements are more widely available.
Supervising professor
Kaski, Samuel, Prof.
canonical correlation analysis, collaborative filtering, co-occurence data, dependency modeling, eye movements, fMRI, gene regulation, latent topic models, natural stimulation, two-way grouping
Other note
  • [Publication 1]: Kai Puolamäki, Jarkko Salojärvi, Eerika Savia, Jaana Simola, and Samuel Kaski. 2005. Combining eye movements and collaborative filtering for proactive information retrieval. In: Gary Marchionini, Alistair Moffat, John Tait, Ricardo Baeza-Yates, and Nivio Ziviani (editors). Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005). Salvador, Brazil. 15-19 August 2005. New York, USA. ACM Press. Pages 146-153.
  • [Publication 2]: Eerika Savia, Samuel Kaski, Ville Tuulos, and Petri Myllymäki. 2004. On text-based estimation of document relevance. In: Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IJCNN 2004). Budapest, Hungary. 25-29 July 2004. Piscataway, NJ, USA. IEEE. Volume 4, pages 3275-3280. © 2004 Institute of Electrical and Electronics Engineers (IEEE). By permission.
  • [Publication 3]: Eerika Savia, Kai Puolamäki, Janne Sinkkonen, and Samuel Kaski. 2005. Two-way latent grouping model for user preference prediction. In: Fahiem Bacchus and Tommi Jaakkola (editors). Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI 2005). Edinburgh, Scotland. 26-29 July 2005. Corvallis, OH, USA. AUAI Press. Pages 518-525. ISBN 0-9749039-1-4. © 2005 by authors.
  • [Publication 4]: Eerika Savia, Kai Puolamäki, and Samuel Kaski. 2009. Latent grouping models for user preference prediction. Machine Learning, volume 74, number 1, pages 75-109.
  • [Publication 5]: Eerika Savia, Kai Puolamäki, and Samuel Kaski. 2009. Two-way grouping by one-way topic models. In: Niall M. Adams, Céline Robardet, Arno Siebes, and Jean-François Boulicaut (editors). Proceedings of the 8th International Symposium on Intelligent Data Analysis (IDA 2009). Lyon, France. 31 August - 2 September 2009. Berlin, Heidelberg, Germany. Springer. Lecture Notes in Computer Science, volume 5772, pages 178-189. ISBN 978-3-642-03914-0.
  • [Publication 6]: Janne Nikkilä, Christophe Roos, Eerika Savia, and Samuel Kaski. 2005. Exploratory modeling of yeast stress response and its regulation with gCCA and associative clustering. International Journal of Neural Systems, volume 15, number 4, pages 237-246. © 2005 World Scientific Publishing Company. By permission.
  • [Publication 7]: Jarkko Ylipaavalniemi, Eerika Savia, Sanna Malinen, Riitta Hari, Ricardo Vigário, and Samuel Kaski. 2009. Dependencies between stimuli and spatially independent fMRI sources: Towards brain correlates of natural stimuli. NeuroImage, volume 48, number 1, pages 176-185.
  • [Publication 8]: Jarkko Ylipaavalniemi, Eerika Savia, Ricardo Vigário, and Samuel Kaski. 2007. Functional elements and networks in fMRI. In: Wei Zhang and Ilya Shmulevich (editors). Proceedings of the 15th European Symposium on Artificial Neural Networks: Advances in Computational Intelligence and Learning (ESANN 2007). Bruges, Belgium. 25-27 April 2007. Bruxelles, Belgium. d-side publications. Pages 561-566.
  • [Publication 9]: Eerika Savia, Arto Klami, and Samuel Kaski. 2009. Fast dependent components for fMRI analysis. In: Proceedings of the 34th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2009). Taipei, Taiwan. 19-24 April 2009. Piscataway, NJ, USA. IEEE. Pages 1737-1740. ISBN 978-1-4244-2354-5. © 2009 Institute of Electrical and Electronics Engineers (IEEE). By permission.