Exploratory cluster analysis of genomic high-throughput data sets and their dependencies

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
Doctoral thesis (article-based)
Checking the digitized thesis and permission for publishing
Instructions for the author
Degree programme
79, [126]
Dissertations in computer and information science. Report D, 11
This thesis studies exploratory cluster analysis of genomic high-throughput data sets and their interdependencies. In modern biology, new high-throughput measurements generate numerical data simultaneously from thousands of molecules in the cell. This enables a new perspective to biology, which is called systems biology. The discipline developing methods for the analysis of the systems biology data is called bioinformatics. The work in this thesis contributes mainly to bioinformatics, but the approaches presented are general purpose machine learning methods and can be applied in many problem areas. A main problem in analyzing genomic high-throughput data is that the potentially useful new findings are hidden in a huge data mass. They need to be extracted and visualized to the analyst as overviews. This thesis introduces new exploratory cluster analysis methods for extracting and visualizing findings of high-throughput data. Three kinds of methods are presented to solve progressively better-focused problems. First, visualizations and clusterings using the self-organizing map are applied to genomic data sets. Second, the recently developed methods for improving the visualization and clustering of a data set with auxiliary data are applied. Third, new methods for exploring the dependency between data sets are developed and applied. The new methods are based on maximizing the Bayes factor between the model of independence and the model of dependence for finite data. The methods outperform their alternatives in numerical comparisons. In applications they proved capable of confirming known biological findings, which validates the methods, and also generated new hypotheses. The applications included exploration of yeast gene expression data, yeast gene expression data in a new metric learned with auxiliary data, the regulation of yeast gene expression by transcription factors, and the dependencies between human and mouse gene expression.
bioinformatics, clustering, data integration, dependency modeling, exploratory data analysis, gene expression, human, learning metrics, mouse, self-organizing map, systems biology, transcription, yeast
Other note
  • Samuel Kaski, Janne Nikkilä, and Teuvo Kohonen. Methods for Exploratory Cluster Analysis. In: Szczepaniak, Segovia, Kacprzyk, Zadeh (Eds.): Intelligent Exploration of the Web, pp. 136-151, Springer, Berlin, 2003.
  • Janne Nikkilä, Petri Törönen, Samuel Kaski, Jarkko Venna, Eero Castrén, and Garry Wong. Analysis and Visualization of Gene Expression Data using Self-Organizing Maps. Neural Networks, Special Issue on New Developments on Self-Organizing Maps, vol. 15, issue 8-9, pages 953-966, 2002.
  • Samuel Kaski, Janne Sinkkonen, and Janne Nikkilä. Clustering Gene Expression Data by Mutual Information with Gene Function. In: Dorffner, Bischof, Hornik (Eds.): Proceedings of the International Conference on Artificial Neural Networks (ICANN 2001), pages 81-86, Springer-Verlag, Berlin, Germany, 2001.
  • Merja Oja, Janne Nikkilä, Petri Törönen, Garry Wong, Eero Castrén, and Samuel Kaski. Exploratory Clustering of Gene Expression Profiles of Mutated Yeast Strains. In: Zhang and Shmulevich (Eds.): Computational And Statistical Approaches To Genomics, pages 65-78, Kluwer Academic Publishers, 2002.
  • Janne Sinkkonen, Samuel Kaski, and Janne Nikkilä. Discriminative Clustering: Optimal Contingency Tables by Learning Metrics. In: Elomaa, Mannila, Toivonen (Eds.): Proceedings of the 13th European Conference on Machine Learning (ECML 2002), Lecture Notes in Artificial Intelligence 2430, pages 418-430, Springer, Berlin, 2002.
  • Samuel Kaski, Janne Nikkilä, Merja Oja, Jarkko Venna, Petri Törönen, and Eero Castrén. Trustworthiness and Metrics in Visualizing Similarity of Gene Expression. BMC Bioinformatics, 4: 48, 2003. [article6.pdf] © 2003 by authors.
  • Samuel Kaski, Janne Nikkilä, Eerika Savia, and Christophe Roos. Discriminative Clustering of Yeast Stress Response. In: Seiffert, Jain, Schweizer (Eds.): Bioinformatics using Computational Intelligence Paradigms, pages 75-92, Springer, Berlin, 2005.
  • Samuel Kaski, Janne Nikkilä, Janne Sinkkonen, Leo Lahti, Juha Knuuttila, and Christophe Roos. Associative Clustering for Exploring Dependencies between Functional Genomics Data Sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Special Issue on Machine Learning for Bioinformatics - Part 2, vol. 2, no. 3, pages 203-216, July-September 2005. [article8.pdf] © 2005 IEEE. By permission.
  • Janne Nikkilä, Christophe Roos, Eerika Savia, and Samuel Kaski. Exploratory Modeling of Yeast Stress Response and its Regulation with gCCA and Associative Clustering. International Journal of Neural Systems, Special Issue on Bioinformatics, vol. 15, no. 4, pages 237-246, 2005. [article9.pdf] © 2005 World Scientific Publishing Company. By permission.
Permanent link to this item