Machine learning methods for structural elucidation in untargeted metabolomics

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2023-01-13
Degree programme
108 + app. 76
Aalto University publication series DOCTORAL THESES, 177/2022
The structural elucidation of small molecules remains a bottleneck in untargeted metabolomics and hence is a limitation in many research fields, such as drug-discovery, biotechnology or environment science. The chemical space of small molecules is vast and highly complex, making structural elucidation a challenging task. Liquid chromatography (LC) coupled with tandem-mass spectrometry (MS²) is one of the leading analysis platform in untargeted metabolomics. This platform, called LC-MS², allows for high-throughput and can detect thousands of molecules simultaneously. However, only a small fraction of the detected molecules can be elucidated using reference databases. For the remaining "dark matter" automated computation tools are indispensable, which use large structure databases for the sample annotation. This thesis introduces different machine learning frameworks for the prediction of molecular structure annotations from LC-MS². Publication I presents a novel kernel-based method for molecular structure prediction given an MS² spectrum. It integrates structure databases into the model training instead of using them only in the prediction phase. This is achieved by so-called Magnitude-Preserving Input Output Kernel Regression, which can significantly improve the structure annotation accuracy compared to state-of-the-art methods. LC retention times (RT) are a valuable information source and readily available in LC-MS². However, RTs remain underutilized in automated structure annotation tools. One reason for this is that RTs are LC specific and hence generally not directly transferable between analysis platforms. Publication II introduces a novel framework for retention order (RO) prediction using a Ranking Support Vector Machine. Retention orders are better preserved across LC methods. We demonstrate that our model, integrating multiple RT datasets, predicts ROs with high accuracy. Publication III presents a Markov Random Field model integrating RO and MS² information for structure annotation. It jointly annotates the molecules in an LC-MS² dataset, thereby exploiting pairwise RO dependencies between the molecules. We demonstrate that the integration of ROs can significantly improved the structure annotations. Publication IV introduces a framework for the joint prediction of structure annotation using a Structure Support Vector Machine model called LC-MS²Struct. The novel LC-MS²Struct model is trained using ground-truth annotated full LC-MS² datasets and learns to optimally combine the RO and MS² information. LC-MS²Struct outperforms alternative approaches by a large margin and annotates stereoisomers with high accuracy. The methods presented in this thesis are of significance for the metabolomics community as they improve the structure annotations in LC-MS² analyses and demonstrate how LC RTs can be integrated into automated workflows.
Supervising professor
Rousu, Juho, Prof., Aalto University, Department of Computer Science, Finland
machine learning, computational metabolomics, kernel methods
Other note
  • [Publication 1]: Celine Brouard, Eric Bach, Sebastian Bocker and Juho Rousu. Magnitude-Preserving Ranking for Structured Outputs. In Proceedings of the Ninth Asian Conference on Machine Learning (ACML 2017), Seoul, Korea, 2017. Proceedings of Machine Learning Research (PMLR) Volume 77, Pages 407–422, November 2017.
    Full text in Acris/Aaltodoc:
  • [Publication 2]: Eric Bach, Sandor Szedmak, Celine Brouard, Sebastian Bocker and Juho Rousu. Liquid-chromatography retention order prediction for metabolite identification. In Proceedings of the 17th European Conference on Computational Biology (ECCB 2018), Athens, Greece, 2018. Bioinformatics Volume 34, Issue 17, Pages i875—i883, September 2018.
    Full text in Acris/Aaltodoc:
    DOI: 10.1093/bioinformatics/bty590 View at publisher
  • [Publication 3]: Eric Bach, Simon Rogers, John Williamson and Juho Rousu. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics, Volume 37, Issue 12, Pages 1724-–1731, June 2021.
    DOI: 10.1093/bioinformatics/btaa998 View at publisher
  • [Publication 4]: Eric Bach, Emma L. Schymanski and Juho Rousu. Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data. Biorxiv, Accepted for publication, October 2022.
    DOI: 10.1101/2022.02.11.480137 View at publisher