Browsing by Author "Bach, Eric"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data(SPRINGER, 2022-12) Bach, Eric; Schymanski, Emma L.; Rousu, Juho; Department of Computer Science; Helsinki Institute for Information Technology (HIIT); Professorship Rousu Juho; Computer Science Professors; Computer Science - Computational Life Sciences (CSLife); Computer Science - Artificial Intelligence and Machine Learning (AIML); Computer Science - Large-scale Computing and Data Analysis (LSCA); University of LuxembourgStructural annotation of small molecules in biological samples remains a key bottleneck in untargeted metabolomics, despite rapid progress in predictive methods and tools during the past decade. Liquid chromatography–tandem mass spectrometry, one of the most widely used analysis platforms, can detect thousands of molecules in a sample, the vast majority of which remain unidentified even with best-of-class methods. Here we present LC-MS2Struct, a machine learning framework for structural annotation of small-molecule data arising from liquid chromatography–tandem mass spectrometry (LC-MS2) measurements. LC-MS2Struct jointly predicts the annotations for a set of mass spectrometry features in a sample, using a novel structured prediction model trained to optimally combine the output of state-of-the-art MS2 scorers and observed retention orders. We evaluate our method on a dataset covering all publicly available reversed-phase LC-MS2 data in the MassBank reference database, including 4,327 molecules measured using 18 different LC conditions from 16 contributors, greatly expanding the chemical analytical space covered in previous multi-MS2 scorer evaluations. LC-MS2Struct obtains significantly higher annotation accuracy than earlier methods and improves the annotation accuracy of state-of-the-art MS2 scorers by up to 106%. The use of stereochemistry-aware molecular fingerprints improves prediction performance, which highlights limitations in existing approaches and has strong implications for future computational LC-MS2 developments.Item Liquid-chromatography retention order prediction for metabolite identification(2018-09-01) Bach, Eric; Szedmak, Sandor; Brouard, Celine; Boecker, Sebastian; Rousu, Juho; Department of Computer Science; Professorship Rousu Juho; Helsinki Institute for Information Technology (HIIT); Friedrich Schiller University JenaMotivation: Liquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning. Results: We present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from different LC systems and configurations without tedious pre-processing, significantly increasing the amount of available training data. Our experiments demonstrate that retention order prediction is an effective way to learn retention behaviour of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run.Item Machine learning methods for structural elucidation in untargeted metabolomics(Aalto University, 2022) Bach, Eric; Tietotekniikan laitos; Department of Computer Science; Kernel Methods, Pattern Analysis and Computational Biology (KEPACO); Perustieteiden korkeakoulu; School of Science; Rousu, Juho, Prof., Aalto University, Department of Computer Science, FinlandThe structural elucidation of small molecules remains a bottleneck in untargeted metabolomics and hence is a limitation in many research fields, such as drug-discovery, biotechnology or environment science. The chemical space of small molecules is vast and highly complex, making structural elucidation a challenging task. Liquid chromatography (LC) coupled with tandem-mass spectrometry (MS²) is one of the leading analysis platform in untargeted metabolomics. This platform, called LC-MS², allows for high-throughput and can detect thousands of molecules simultaneously. However, only a small fraction of the detected molecules can be elucidated using reference databases. For the remaining "dark matter" automated computation tools are indispensable, which use large structure databases for the sample annotation. This thesis introduces different machine learning frameworks for the prediction of molecular structure annotations from LC-MS². Publication I presents a novel kernel-based method for molecular structure prediction given an MS² spectrum. It integrates structure databases into the model training instead of using them only in the prediction phase. This is achieved by so-called Magnitude-Preserving Input Output Kernel Regression, which can significantly improve the structure annotation accuracy compared to state-of-the-art methods. LC retention times (RT) are a valuable information source and readily available in LC-MS². However, RTs remain underutilized in automated structure annotation tools. One reason for this is that RTs are LC specific and hence generally not directly transferable between analysis platforms. Publication II introduces a novel framework for retention order (RO) prediction using a Ranking Support Vector Machine. Retention orders are better preserved across LC methods. We demonstrate that our model, integrating multiple RT datasets, predicts ROs with high accuracy. Publication III presents a Markov Random Field model integrating RO and MS² information for structure annotation. It jointly annotates the molecules in an LC-MS² dataset, thereby exploiting pairwise RO dependencies between the molecules. We demonstrate that the integration of ROs can significantly improved the structure annotations. Publication IV introduces a framework for the joint prediction of structure annotation using a Structure Support Vector Machine model called LC-MS²Struct. The novel LC-MS²Struct model is trained using ground-truth annotated full LC-MS² datasets and learns to optimally combine the RO and MS² information. LC-MS²Struct outperforms alternative approaches by a large margin and annotates stereoisomers with high accuracy. The methods presented in this thesis are of significance for the metabolomics community as they improve the structure annotations in LC-MS² analyses and demonstrate how LC RTs can be integrated into automated workflows.Item Magnitude-Preserving Ranking for Structured Outputs(PMLR, 2017-11-03) Brouard, Celine; Bach, Eric; Böcker, Sebastian; Rousu, Juho; Department of Computer Science; Professorship Rousu Juho; Friedrich Schiller University Jena; Zhang, Min-Ling; Noh, Yung-KyunIn this paper, we present a novel method for solving structured prediction problems, based on combining Input Output Kernel Regression (IOKR) with an extension of magnitude-preserving ranking to structured output spaces. In particular, we concentrate on the case where a set of candidate outputs has been given, and the associated pre-image problem calls for ranking the set of candidate outputs. Our method, called magnitude-preserving IOKR, both aims to produce a good approximation of the output feature vectors, and to preserve the magnitude differences of the output features in the candidate sets. For the case where the candidate set does not contain corresponding ’correct’ inputs, we propose a method for approximating the inputs through application of IOKR in the reverse direction. We apply our method to two learning problems: cross-lingual document retrieval and metabolite identification. Experiments show that the proposed approach improves performance over IOKR, and in the latter application obtains thecurrent state-of-the-art accuracy.Item Predicting Drug Bioactivities from Tandem Mass Spectra(2019-06-17) Jägerroos, Vilma; Bach, Eric; Perustieteiden korkeakoulu; Rousu, JuhoNatural products have been the single most productive source of lead compounds for the modern drug development. In traditional drug discovery from natural products, concentrated extracts prepared from, e.g., plant samples were screened to determine their bioactivity. These extracts are complicated mixtures. Thus, a signal from the screening assay may be confounded, e.g., by synergistic effects of several compounds. However, isolating each compound from the extract prior to the screening would be inefficient when a large number of samples are screened. Structures of compounds in a natural product sample are unknown in advance. Analytical methods, such as tandem mass spectrometry (MS/MS), are used to identify the constituents of the samples in almost every stage of the drug discovery process from natural products. We argue that predicting bioactivities based on MS/MS spectra could be used to prioritize the most promising samples for further experimental testing. We introduce two machine learning pipelines to predict bioactivities from MS/MS spectra. First, we predict bioactivities directly from MS/MS spectra. Second, we train a model to identify an unknown compound based on its MS/MS spectrum and another model to predict bioactivities given a compound with known structure. In the testing phase, structure predicted from an MS/MS spectrum is used to predict bioactivities. In the first pipeline, only drugs which have both MS/MS spectrum and bioactivities available can be used in the training. However, the overlap of MS/MS and bioactivity datasets is limited. Advantage of the second approach is its ability to use drugs which have either MS/MS spectrum or bioactivities available in the training. We show that the second approach results in more accurate predictions compared to the first approach. Additionally, we show that we can a build predictive model even in case there is no overlap of the drugs in the MS/MS and the bioactitivity datasets which is not possible with the first approach.Item Probabilistic Framework for Integration of Mass Spectrum and Retention Time Information in Small Molecule Identification(OXFORD UNIV PRESS INC, 2020-11-27) Bach, Eric; Rogers, Simon; Williamson, John; Rousu, Juho; Professorship Rousu Juho; University of Glasgow; Helsinki Institute for Information Technology (HIIT); Department of Computer ScienceMotivation Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve identifications solely based on MS information, such as precursor mass-per-charge and tandem mass spectra (MS2). Results We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining MS2 data and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features have MS2 measurements available besides MS1. Availability and implementation Software and data is freely available at https://github.com/aalto-ics-kepaco/msms_rt_score_integration.