Browsing by Author "Kaski, Samuel, Prof., Aalto University, Department of Information and Computer Science, Finland"
Now showing 1 - 6 of 6
- Results Per Page
- Sort Options
- Bayesian latent variable models for learning dependencies between multiple data sources
School of Science | Doctoral dissertation (article-based)(2014) Virtanen, SeppoMachine learning focuses on automated large-scale data analysis extracting useful information from data collections. The data are frequently high-dimensional and may correspond, for example, to images, text documents, or measurements of neural responses. In many applications data can be collected from multiple data sources, that is, views. This thesis presents novel machine learning methods for analyzing multiple data sources, especially for understanding relationships between them. The analysis provides a comprehensive summary of the data generating process, which may be used for exploring the relationships and for predicting observations of one or more sources. The methods are based on two assumptions: each view provides complementary information of the data generating process, and each view is corrupted by noise. The methods aim to utilize all available information (views), accumulating partly overlapping information and reducing view-specific noise. In particular, this thesis presents several Bayesian latent variable models that learn a decomposition of latent variables; some of the variables capture information shared by multiple sources, whereas the remaining variables explain noise in each view. The latent variables may be efficiently inferred based on the observed data by using sparsity assumptions and Bayesian inference. The models are applied for analyzing neural responses to natural stimulation as well as for jointly modeling images and text documents. - Bayesian Multi-Way Models for Data Translation in Computational Biology
School of Science | Doctoral dissertation (article-based)(2014) Suvitaival, TommiThe inference of differences between samples is a fundamental problem in computational biology and many other sciences. Hypothesis about a complex system can be studied via a controlled experiment. The design of the controlled experiment sets the conditions, or covariates, for the system in such a way that their effect on the system can be studied through independent measurements. When the number of measured variables is high and the variables are correlated, the assumptions of standard statistical methods are no longer valid. In this thesis, computational methods are presented to this problem and its follow-up problems. A similar experiment done on different systems, such as multiple biological species, leads to multiple "views" of the experiment outcome, observed in different data spaces or domains. However, cross-domain experimentation brings uncertainty about the similarity of the systems and their outcomes. Thus, a new question emerges: which of the covariate effects generalize across the domains? In this thesis, novel computational methods are presented for the integration of data views, in order to detect weaker covariate effects and to generalize covariate effects to views with unobserved data. Five main contributions to the inference of covariate effects are presented: (1) When the data are high-dimensional and collinear, the problem of false discovery is curbed by assuming a cluster structure on the observed variables and by handling the uncertainty with Bayesian methods. (2) Prior information about the measurement process can be used to further improve the inference of covariate effects for metabolomic experiments by modeling the multiple layers of uncertainty in the mass spectral data. (3-4) When the data come from multiple measurement sources on the same subjects - that is, from data views with co-occurring samples - it is unknown, whether the covariate effects generalize across the views and whether the outcome of a new intervention can be generalized to a view with no observed data on that intervention. These problems are shown to be possible to solve by assuming a shared generative process for the multiple data views. (5) When the data come from different domains with no co-occurring samples, the inference of between-domain dependencies is not possible in the same way as with co-occurring samples. It is shown that even in this situation, it is possible to identify covariate effects that generalize across the domains, when the experimental design at least weakly binds the domains together. Then, effects that generalize are identified by assuming a shared generative process for the covariate effects. - Dimensionality reduction methods for fMRI analysis and visualization
School of Science | Licentiate thesis(2015) Nybo, KristianThe need to model and understand high-dimensional, noisy data sets is common in many domains these day, among them neuroimaging and fMRI analysis. Dimensionality reduction and variable selection are two common strategies for dealing with high-dimensional data, either as a pre-processing step prior to further analysis, or as an analysis step itself. This thesis discusses both dimensionality reduction and variable selection, with a focus on fMRI analysis, visualization, and applications of visualization in fMRI analysis. Three new algorithms are introduced. The first algorithm uses a sparse Canonical Correlation Analysis model and a high-dimensional stimulus representation to find relevant voxels (variables) in fMRI experiments with complex natural stimuli. Experiments on a data set involving music show that the algorithm successfully retrieves voxels relevant to the experimental condition. The second algorithm, NeRV, is a dimensionality reduction method for visualization high-dimensional data using scatterplots. A simple abstract model of the way a human studies a scatterplot is formulated, and NeRV is derived as an algorithm for producing optimal visualizations in terms of this model. Experiments show that NeRV is superior to conventional dimensionality reduction methods in terms of this model. NeRV is also used to perform a novel form of exploratory data analysis on the fMRI voxels selected by the first algorithm; the analysis simultaneously demonstrates the usefulness of NeRV in practice and offers further insights into the performance of the voxel selection algorithm. The third algorithm, LDA-NeRV, combines a Bayesian latent-variable model for graphs with NeRV to produce one of the first principled graph drawing methods. Experiments show that LDA-NeRV is capable of visualizing structure that conventional graph drawing methods fail to reveal. - Probabilistic components of molecular interactions and drug responses
School of Science | Doctoral dissertation (article-based)(2014) Parkkinen, JuusoA fundamental question in medicine is how cancer and other complex diseases operate on the molecular level. Identifying the detailed mechanisms and interactions of how diseases progress and respond to drug treatments is essential for developing effective therapies. High-throughput molecular profiling technologies have provided vast amounts of measurement data of these phenomena. However, making sense of these masses of data is far from straightforward and requires advanced computational analysis methods. Probabilistic component models have been proven an effective tool in analysing and integrating high-dimensional and noisy molecular profiling data sources, such as gene expression. Such models can identify coherent components from the data, and interpreting these components provides insights about the underlying biological processes, such as disease progression and drug responses. In this thesis, probabilistic component models are applied and extended to identify and analyse molecular interaction and drug response patterns. Identifying functionally coherent gene modules from high-throughput measurements is a central task in many biomedical applications. In this thesis, an earlier component model for network data is extended for capturing functional modules from combinations of gene expression and protein interaction data. The identified modules provide hypotheses for novel molecular pathways and protein functions. High-throughput drug treatment measurements have made possible the detailed analysis of molecular drug responses and toxicity. In this thesis, probabilistic component models are applied to detect coherent drug response patterns from gene expression data. These patterns provide detailed insights to drug mechanisms of action and are highly applicable in cancer therapy development. Moreover, by associating the identified drug response components to toxicological outcomes, the first comprehensive view of molecular toxicogenomic responses is constructed with high performance in drug toxicity prediction. - Probabilistic Modelling of Multiresolution Biological Data
School of Science | Doctoral dissertation (article-based)(2014) Adhikari, Prem RajWhen the measurements from the ever improving measurement technology are accumulated over a period of time, the result is the collection of data in different representations. However, most machine learning and data mining algorithms, in their standard form, are designed to operate on data in single representation. This thesis proposes machine learning and data mining algorithms to analyze data in different representation with respect to the resolution within a single analysis. The novel algorithms proposed to analyze multiresolution data are in the field of probabilistic modelling and semantic data mining. First, three different deterministic data transformation methods are proposed to transform data across different resolutions. After the data transformation, the resulting data in same resolution are integrated and modeled using mixture models. Second, similar mixture components in a mixture model are merged one by one repetitively to generate a chain of mixture models. A new fast approximation of the KL-divergence is derived to determine the similarity of the mixture components. The chain of generated mixture models are useful for comparison, for example, in model selection. Third, mixture components in different resolutions are iteratively merged to model multiresolution data generating models in each modeled resolution that incorporate information from data in other resolution. Fourth, a single multiresolution mixture model with multiresolution mixture components is proposed whose mixture components independently have the capabilities of a Bayesian network. Finally, three--part methodology consisting of clustering using mixture models, rule learning using semantic subgroup discovery, and pattern visualization using banded matrices is developed for comprehensive analysis of multiresolution data. The multiresolution data analysis methods presented in this thesis improves the performance of the methods in comparison with the their single resolution counterparts. Furthermore, developed methods aims to make the results understandable to the domain experts. Therefore, the developed methods are useful addition in the analysis of chromosomal aberration patterns and the cancer research in general. - Retrieval of Gene Expression Measurements with Probabilistic Models
School of Science | Doctoral dissertation (article-based)(2014) Faisal, AliA crucial problem in current biological and medical research is how to utilize the diverse set of existing biological knowledge and heterogeneous measurement data in order to gain insights on new data. As datasets continue to be deposited in public repositories it is becoming important to develop search engines that can efficiently integrate existing data and search for relevant earlier studies given a new study. The search task is encountered in several biological applications including cancer genomics, pharmacokinetics, personalized medicine and meta-analysis of functional genomics. Most existing search engines rely on classical keyword or annotation based retrieval which is limited to discovering known information and requires careful downstream annotation of the data. Data-driven model-based methods, that retrieve studies based on similarities in the actual measurement data, have a greater potential for uncovering novel biological insights. In particular, probabilistic modeling provides promising model-based tools due to its ability to encode prior knowledge, represent uncertainty in model parameters and handle noise associated to the data. By introducing latent variables it is further possible to capture relationships in data features in the form of meaningful biological components underlying the data. This thesis adapts existing and develops new probabilistic models for retrieval of relevant measurement data in three different cases of background repositories. The first case is a background collection of data samples where each sample is represented by a single data type. The second case is a collection of multimodal data samples where each sample is represented by more than one data type. The third case is a background collection of datasets where each dataset, in turn, is a collection of multiple samples. In all three setups the proposed models are evaluated quantitatively and with case studies the models are demonstrated to facilitate interpretable retrieval of relevant data, rigorous integration of diverse information sources and learning of latent components from partly related dataset collections.