Gaussian Process Modelling of Genome-wide High-throughput Sequencing Time Series

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Doctoral thesis (article-based) | Defence date: 2018-12-14

Date

2018

Major/Subject

Mcode

Degree programme

Language

en

Pages

113 + app. 79

Series

Aalto University publication series DOCTORAL DISSERTATIONS, 245/2018

Abstract

During the last decade, high-throughput sequencing (HTS) has become the mainstream technique for simultaneously studying enormous number of genetic features present in the genome, transcriptome, or epigenome of an organism. Besides the static experiments which compare genetic features between two or more distinct biological conditions, time series experiments which monitor genetic features over time provide valuable information about the dynamics of complex mechanisms in various biological processes. However, analysis of the currently available HTS time series data sets involves challenges as these data sets often consist of short and irregularly sampled time series which lack sufficient biological replication. In addition, quantification of the genetic features from HTS data is inherently subject to uncertainty due to the limitations of HTS platforms such as short read lengths and varying sequencing depths. This thesis presents a Gaussian process (GP)-based approach for modelling and ranking HTS time series by taking into account the characteristics of the data sets. GPs are one of the most suitable tools for modelling sparse and irregularly sampled time series and they can capture the temporal correlations between observations at different time points via suitable covariance functions. On the other hand, naive application of GP modelling may suffer from over-fitting, leading to increased number of false positives if the characteristics of the data are not taken into account. In this thesis, this problem has been mitigated by regularizing the models by introducing bounds to the hyperparameter values of the GP prior. Firstly, the range of the values of length-scale parameters has been restricted to values compatible with the spacing of the sampled time points. Secondly, application-dependent variance models have been developed to infer the uncertainty levels on the observations, which have then been incorporated into the GP models as lower bounds for the noise variance. Regularizing the GP models by setting realistic bounds to their hyperparameters makes the GP models more robust against the uncertainty in the data without increasing the complexity of the models, and thus makes the method applicable to large genome-wide studies. The publications included in this thesis suggest a number of techniques for modelling the variance in RNA-seq and Pool-seq applications, which are the HTS techniques specifically designed to sequence RNA transcripts and pooled DNA sequences, respectively. Variance models utilize the information obtained through pre-processing stages of the data depending on, for example, the number of replicates or varying sequencing depth levels. Performance evaluation of the GP models under different experiment settings indicates that the variance incorporation into the GP models can yield a higher average precision than the naive application of GP modelling. Motivated by results, an open-source software package, GPrank, has been implemented in R in order to enable researchers to easily apply the proposed GP-based method in their own HTS time series data sets for detecting temporally most active genetic features.

Description

Supervising professor

Kaski, Samuel, Prof., Aalto University, Department of Computer Science, Finland

Thesis advisor

Honkela, Antti, Asst. Prof., University of Helsinki, Finland

Keywords

gaussian process, high-throughput sequencing, time series, probabilistic modelling

Other note

Parts

  • [Publication 1]: Hande Topa, Antti Honkela. Gaussian process modelling of multiple short time series. In ESANN 2015 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and MachineLearning, Bruges (Belgium), i6doc.com publ. pp. 83-88, April 2015.
  • [Publication 2]: Hande Topa, Ágnes Jónás, Robert Kofler, Carolin Kosiol, Antti Honkela. Gaussian process test for high-throughput sequencing time series: application to experimental evolution. Bioinformatics, 31(11):1762-1770, 2015.
    DOI: 10.1093/bioinformatics/btv014 View at publisher
  • [Publication 3]: Hande Topa, Antti Honkela. Analysis of differential splicing suggests different modes of short-term splicing regulation. Bioinformatics, 32(12):i147-i155, 2016. Full Text in Aalto/Acris: http://urn.fi/URN:NBN:fi:aalto-201703283214.
    DOI: 10.1093/bioinformatics/btw283 View at publisher
  • [Publication 4]: Hande Topa, Antti Honkela. GPrank: an R package for detecting dynamic elements from genome-wide time series. BMC Bioinformatics, 19:367, 2018. Full Text in Aaltodoc/Acris: http://urn.fi/URN:NBN:fi:aalto-201810245514.
    DOI: 10.1186/s12859-018-2370-4 View at publisher

Citation