Browsing by Author "Kaski, Samuel, Prof."
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
- Multivariate multi-way modelling of multiple high-dimensional data sources
School of Science | Doctoral dissertation (article-based)(2012) Huopaniemi, IlkkaA widely employed strategy in current biomedical research is to study samples from patients using high-throughput measurement techniques, such as transcriptomics, proteomics, and metabolomics. In contrast to the static information obtained from the DNA sequence, these techniques deliver a "dynamic fingerprint" describing the phenotypic status of the patient in the form of absolute or relative concentrations of hundreds, or even tens of thousands of molecules: mRNA, proteins, metabolites and lipids. The huge number of variables measured opens up new possibilities for biomedical research; harnessing the information contained in such 'omics' data requires advanced data analysis methods. The standard setup in biomedical research is comparing case (diseased) and control (healthy) samples and determining differentially expressed molecules that are then considered potential bio-markers for disease. In modern biomedical experiments, more complicated research questions are common. For instance, diet or drug treatments, gender and age play central roles in many case-control experiments and the measurements are often in the form of a time-series. Due to these additional covariates, the experimental setting becomes a multi-way experimental design, but few tools for proper data-analysis of high-dimensional data with such a design exist. Moreover, the task of integrating multiple data sources with different variables is nowadays often encountered in two classes of biomedical experiments: (i) Multiple omics types or samples from several tissues are measured from each patient (paired samples), (ii) Translating biomarkers between human studies and model organisms (no paired samples). These data integration tasks usually additionally involve a multi-way experimental design. In this dissertation, a novel Bayesian machine learning model for multi-way modelling of data from such multi-way, single-source or multi-source setups is presented, covering the majority of situations commonly encountered in statistical analysis of omics data coming from current biomedical research. The problem of high dimensionality is solved by assuming that the data can be described as highly correlated groups of variables. The Bayesian modelling approach involves training a single, unified, interpretable model to explain all the data. This approach can overcome the main difficulties in omics analysis: small sample-size and high dimensionality, multicollinearity of data, and the problem of multiple testing. This approach also enables rigorous uncertainty estimation, dimensionality reduction and easy interpretability of results from a complex setup involving multiple covariates and multiple data sources. - Mutual dependency-based modeling of relevance in co-occurrence data
Aalto-yliopiston teknillinen korkeakoulu | Doctoral dissertation (article-based)(2010) Savia, EerikaIn the analysis of large data sets it is increasingly important to distinguish the relevant information from the irrelevant. This thesis outlines how to find what is relevant in so-called co-occurrence data, where there are two or more representations for each data sample. The modeling task sets the limits to what we are interested in, and in its part defines the relevance. In this work, the problem of finding what is relevant in data is formalized via dependence, that is, the variation that is found in both (or all) co-occurring data sets was deemed to be more relevant than variation that is present in only one (or some) of the data sets. In other words, relevance is defined through dependencies between the data sets. The method development contributions of this thesis are related to latent topic models and methods of dependency exploration. The dependency-seeking models were extended to nonparametric models, and computational algorithms were developed for the models. The methods are applicable to mutual dependency modeling and co-occurrence data in general, without restriction to the applications presented in the publications of this work. The application areas of the publications included modeling of user interest, relevance prediction of text based on eye movements, analysis of brain imaging with fMRI and modeling of gene regulation in bioinformatics. Additionally, frameworks for different application areas were suggested. Until recently it has been a prevalent convention to assume the data to be normally distributed when modeling dependencies between different data sets. Here, a distribution-free nonparametric extension of Canonical Correlation Analysis (CCA) was suggested, together with a computationally more efficient semi-parametric variant. Furthermore, an alternative view to CCA was derived which allows a new kind of interpretation of the results and using CCA in feature selection that regards dependency as the criterion of relevance. Traditionally, latent topic models are one-way clustering models, that is, one of the variables is clustered by the latent variable. We proposed a latent topic model that generalizes in two ways and showed that when only a small amount of data has been gathered, two-way generalization becomes necessary. In the field of brain imaging, natural stimuli in fMRI studies imitate real-life situations and challenge the analysis methods used. A novel two-step framework was proposed for analyzing brain imaging measurements from fMRI. This framework seems promising for the analysis of brain signal data measured under natural stimulation, once such measurements are more widely available. - Probabilistic analysis of the human transcriptome with side information
Aalto-yliopiston teknillinen korkeakoulu | Doctoral dissertation (article-based)(2010) Lahti, LeoRecent advances in high-throughput measurement technologies and efficient sharing of biomedical data through community databases have made it possible to investigate the complete collection of genetic material, the genome, which encodes the heritable genetic program of an organism. This has opened up new views to the study of living organisms with a profound impact on biological research. Functional genomics is a subdiscipline of molecular biology that investigates the functional organization of genetic information. This thesis develops computational strategies to investigate a key functional layer of the genome, the transcriptome. The time- and context-specific transcriptional activity of the genes regulates the function of living cells through protein synthesis. Efficient computational techniques are needed in order to extract useful information from high-dimensional genomic observations that are associated with high levels of complex variation. Statistical learning and probabilistic models provide the theoretical framework for combining statistical evidence across multiple observations and the wealth of background information in genomic data repositories. This thesis addresses three key challenges in transcriptome analysis. First, new preprocessing techniques that utilize side information in genomic sequence databases and microarray collections are developed to improve the accuracy of high-throughput microarray measurements. Second, a novel exploratory approach is proposed in order to construct a global view of cell-biological network activation patterns and functional relatedness between tissues across normal human body. Information in genomic interaction databases is used to derive constraints that help to focus the modeling in those parts of the data that are supported by known or potential interactions between the genes, and to scale up the analysis. The third contribution is to develop novel approaches to model dependency between co-occurring measurement sources. The methods are used to study cancer mechanisms and transcriptome evolution; integrative analysis of the human transcriptome and other layers of genomic information allows the identification of functional mechanisms and interactions that could not be detected based on the individual measurement sources. Open source implementations of the key methodological contributions have been released to facilitate their further adoption by the research community.