Retrieval of Gene Expression Measurements with Probabilistic Models

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2014-08-15
Checking the digitized thesis and permission for publishing
Instructions for the author
Degree programme
99 + app. 156
Aalto University publication series DOCTORAL DISSERTATIONS, 108/2014
A crucial problem in current biological and medical research is how to utilize the diverse set of existing biological knowledge and heterogeneous measurement data in order to gain insights on new data. As datasets continue to be deposited in public repositories it is becoming important to develop search engines that can efficiently integrate existing data and search for relevant earlier studies given a new study. The search task is encountered in several biological applications including cancer genomics, pharmacokinetics, personalized medicine and meta-analysis of functional genomics.  Most existing search engines rely on classical keyword or annotation based retrieval which is limited to discovering known information and requires careful downstream annotation of the data. Data-driven model-based methods, that retrieve studies based on similarities in the actual measurement data, have a greater potential for uncovering novel biological insights. In particular, probabilistic modeling provides promising model-based tools due to its ability to encode prior knowledge, represent uncertainty in model parameters and handle noise associated to the data. By introducing latent variables it is further possible to capture relationships in data features in the form of meaningful biological components underlying the data.  This thesis adapts existing and develops new probabilistic models for retrieval of relevant measurement data in three different cases of background repositories. The first case is a background collection of data samples where each sample is represented by a single data type. The second case is a collection of multimodal data samples where each sample is represented by more than one data type. The third case is a background collection of datasets where each dataset, in turn, is a collection of multiple samples. In all three setups the proposed models are evaluated quantitatively and with case studies the models are demonstrated to facilitate interpretable retrieval of relevant data, rigorous integration of diverse information sources and learning of latent components from partly related dataset collections.
Supervising professor
Kaski, Samuel, Prof., Aalto University, Department of Information and Computer Science, Finland
Thesis advisor
Peltonen, Jaakko, Dr., Aalto University, Department of Information and Computer Science, Finland
machine learning, bioinformatics, probabilistic modeling, information retrieval, Bayesian generative models
Other note
  • [Publication 1]: José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma and Samuel Kaski. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25(12):i145–i153, 2009. doi:10.1093/bioinformatics/btp215.
  • [Publication 2]: Ali Faisal, Frank Dondelinger, Dirk Husmeier, Colin M. Beale. Inferring species interaction networks from species abundance data: A comparative evaluation of various statistical and machine learning methods. Ecological Informatics, 5(6):451–464, 2010. doi:10.1016/j.ecoinf.2010.06.005.
  • [Publication 3]: José Caldas, Nils Gehlenborg, Eeva Kettunen, Ali Faisal, Mikko Rönty, Andrew G. Nicholson, Sakari Knuutila, Alvis Brazma and Samuel Kaski. Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma. Bioinformatics, 28(2):246–253, 2012. doi:10.1093/bioinformatics/btr634.
  • [Publication 4]: Suleiman A Khan, Ali Faisal, John P. Mpindi, Juuso A. Parkkinen, Tuomo Kalliokoski, Antti Poso, Olli P. Kallioniemi, Krister Wennerberg and Samuel Kaski. Comprehensive data-driven analysis of the impact of chemoinformatic structure on the genome-wide biological response profiles of cancer cells to 1159 drugs. BMC Bioinformatics, 13:112, 2012. doi:10.1186/1471-2105-13-112.
  • [Publication 5]: Riku Louhimo, Viljami Aittomaki*, Ali Faisal*, Marko Laakso*, Ping Chen, Kristian Ovaska, Erkka Valo, Leo Lahti, Vladimir Rogojin, Samuel Kaski and Sampsa Hautaniemi. Systematic use of computational methods allows stratification of treatment responders in glioblastoma multiforme. Systems Biomedicine, 1(2):130–136, 2013. doi:10.4161/sysb.28904.
  • [Publication 6]: Ali Faisal, Jussi Gillberg, Gayle Leen and Jaakko Peltonen. Transfer Learning using a Nonparametric Sparse Topic Model. Neurocomputing, 112:124–137, 2013. doi:10.1016/j.neucom.2012.12.038.
  • [Publication 7]: Ali Faisal, Jaakko Peltonen, Elisabeth Georgii, Johan Rung and Samuel Kaski. Toward computational cumulative biology by combining models of biological datasets. Submitted to a journal, 6 pages, 2013.