Graphical models for biclustering and information retrieval in gene expression data

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2012-04-20
Checking the digitized thesis and permission for publishing
Instructions for the author
Degree programme
Aalto University publication series DOCTORAL DISSERTATIONS, 33/2012
The cell coordinates its biological response to the environment partly via the selective synthesis of thousands of unique RNA and protein molecules. Understanding the molecular biology of the cell is thus essential to the advancement of areas such as health care, agriculture, and energy production, but requires the ability to simultaneously acquire information about thousands of molecules in a sample. Recent high-throughput measurement technologies address this concern. While being useful, they generate a high volume of data and bring in methodological challenges, effectively shifting the bottleneck in molecular biology research from data acquisition to data analysis. In particular, an important challenge is the genome-wide analysis of how RNA is transcribed under different conditions, organisms, and tissues, a process known as gene expression. When developing computational methods for biological data analysis tasks, probabilistic frameworks constitute promising approaches due to their flexibility, soundness, and ability to handle noisy data. In this thesis, the contributions are in the development of probabilistic methods for two relevant tasks in genome-wide gene expression analysis, namely biclustering and information retrieval. Biclustering concerns the simultaneous grouping of objects, e.g., genes, and conditions. The first contribution is the development of a Bayesian extension to an existing biclustering model. The second contribution is a novel probabilistic method that allows deriving a hierarchical organization of microarrays in a gene expression data set and at the same time indicate the genes that characterize the hierarchy. Finally, the third contribution is a general probabilistic biclustering framework that easily lends itself to different data types and model assumptions. Information retrieval in gene expression data is needed because of the increasing amount of available data stored in public databases. Two probabilistic methods for information retrieval are proposed. The models are used in a series of biological case studies that show how the proposed approaches have the potential to accelerate biological research by jointly analyzing data from different studies. In particular, several connections between biological conditions found by the models either correspond to existing biological knowledge or were used in a confirmatory follow-up study to obtain novel biological findings.
Supervising professor
Kaski, Samuel, D.Sc. (Tech.)
Thesis advisor
Lahti, Leo, D.Sc. (Tech.)
probabilistic modelling, Bayesian network, biclustering, information retrieval, transcriptomics
Other note
  • [Publication 1]: José Caldas and Samuel Kaski. Bayesian biclustering with the plaid model. In Proceedings of the 2008 IEEE International Workshop on Machine Learning for Signal Processing XVIII, José Príncipe, Deniz Erdogmus, and Tulay Adali (editors), pages 291-296, IEEE, Piscataway, N.J., October 2008.
  • [Publication 2]: José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma, and Samuel Kaski. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25(12):i145-i153 (ISMB/ECCB 2009 Conference Proceedings), June 2009.
  • [Publication 3]: José Caldas and Samuel Kaski. Hierarchical generative biclustering for microRNA expression analysis. Journal of Computational Biology, 18(3):251-261 (RECOMB 2010 Special Issue), March 2011.
  • [Publication 4]: José Caldas and Samuel Kaski. A mixture-of-experts approach to biclustering. Submitted to a journal, 10 pages, 2011.
  • [Publication 5]: José Caldas, Nils Gehlenborg, Eeva Kettunen, Ali Faisal, Mikko Rönty, Andrew G. Nicholson, Sakari Knuutila, Alvis Brazma, and Samuel Kaski. Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma. Bioinformatics, 28(2):246-253, January 2012.