Gaussian Process Modelling of Genome-wide High-throughput Sequencing Time Series
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Science |
Doctoral thesis (article-based)
| Defence date: 2018-12-14
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2018
Major/Subject
Mcode
Degree programme
Language
en
Pages
113 + app. 79
Series
Aalto University publication series DOCTORAL DISSERTATIONS, 245/2018
Abstract
During the last decade, high-throughput sequencing (HTS) has become the mainstream technique for simultaneously studying enormous number of genetic features present in the genome, transcriptome, or epigenome of an organism. Besides the static experiments which compare genetic features between two or more distinct biological conditions, time series experiments which monitor genetic features over time provide valuable information about the dynamics of complex mechanisms in various biological processes. However, analysis of the currently available HTS time series data sets involves challenges as these data sets often consist of short and irregularly sampled time series which lack sufficient biological replication. In addition, quantification of the genetic features from HTS data is inherently subject to uncertainty due to the limitations of HTS platforms such as short read lengths and varying sequencing depths. This thesis presents a Gaussian process (GP)-based approach for modelling and ranking HTS time series by taking into account the characteristics of the data sets. GPs are one of the most suitable tools for modelling sparse and irregularly sampled time series and they can capture the temporal correlations between observations at different time points via suitable covariance functions. On the other hand, naive application of GP modelling may suffer from over-fitting, leading to increased number of false positives if the characteristics of the data are not taken into account. In this thesis, this problem has been mitigated by regularizing the models by introducing bounds to the hyperparameter values of the GP prior. Firstly, the range of the values of length-scale parameters has been restricted to values compatible with the spacing of the sampled time points. Secondly, application-dependent variance models have been developed to infer the uncertainty levels on the observations, which have then been incorporated into the GP models as lower bounds for the noise variance. Regularizing the GP models by setting realistic bounds to their hyperparameters makes the GP models more robust against the uncertainty in the data without increasing the complexity of the models, and thus makes the method applicable to large genome-wide studies. The publications included in this thesis suggest a number of techniques for modelling the variance in RNA-seq and Pool-seq applications, which are the HTS techniques specifically designed to sequence RNA transcripts and pooled DNA sequences, respectively. Variance models utilize the information obtained through pre-processing stages of the data depending on, for example, the number of replicates or varying sequencing depth levels. Performance evaluation of the GP models under different experiment settings indicates that the variance incorporation into the GP models can yield a higher average precision than the naive application of GP modelling. Motivated by results, an open-source software package, GPrank, has been implemented in R in order to enable researchers to easily apply the proposed GP-based method in their own HTS time series data sets for detecting temporally most active genetic features.Description
Supervising professor
Kaski, Samuel, Prof., Aalto University, Department of Computer Science, FinlandThesis advisor
Honkela, Antti, Asst. Prof., University of Helsinki, FinlandKeywords
gaussian process, high-throughput sequencing, time series, probabilistic modelling
Other note
Parts
- [Publication 1]: Hande Topa, Antti Honkela. Gaussian process modelling of multiple short time series. In ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and MachineLearning, Bruges (Belgium), i6doc.com publ. pp. 83-88, April 2015.
-
[Publication 2]: Hande Topa, Ágnes Jónás, Robert Kofler, Carolin Kosiol, Antti Honkela. Gaussian process test for high-throughput sequencing time series: application to experimental evolution. Bioinformatics, 31(11):1762-1770, 2015.
DOI: 10.1093/bioinformatics/btv014 View at publisher
-
[Publication 3]: Hande Topa, Antti Honkela. Analysis of differential splicing suggests different modes of short-term splicing regulation. Bioinformatics, 32(12):i147-i155, 2016. Full Text in Aalto/Acris: http://urn.fi/URN:NBN:fi:aalto-201703283214.
DOI: 10.1093/bioinformatics/btw283 View at publisher
-
[Publication 4]: Hande Topa, Antti Honkela. GPrank: an R package for detecting dynamic elements from genome-wide time series. BMC Bioinformatics, 19:367, 2018. Full Text in Aaltodoc/Acris: http://urn.fi/URN:NBN:fi:aalto-201810245514.
DOI: 10.1186/s12859-018-2370-4 View at publisher