Machine Learning for Small Molecule Identification

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Doctoral thesis (article-based) | Defence date: 2017-03-30

Date

2017

Major/Subject

Mcode

Degree programme

Language

en

Pages

61 + app. 99

Series

Aalto University publication series DOCTORAL DISSERTATIONS, 25/2017

Abstract

Metabolites are small molecules involved in biological process of organisms. For example, ethylene serves as plants hormone to stimulate or regulate the opening of flowers, ripening of fruit and shedding of leaves. Metabolite identification is to figure out the molecular structure of the metabo-lite contained in some biological sample, which is considered as a major bottleneck for metabolo-mics. The backbone analytical technology for metabolite identification is tandem mass spectrometry. It consists two rounds of mass spectrometry: In the first round all the metabolites in a sample are measured and one particular metabolite being interested is selected and fragmented by a process of dissociation. In the second round, the fragments as well as their abundance are measured. The resulting tandem mass spectra contain the information on the structure and composition of the molecules. This thesis aims to solve the problem of identifying the molecular structures that produce the observed tandem mass spectra from some biological sample. The traditional methods are mostly based on matching the observed tandem mass spectra to the reference spectra in some database. However, these methods could fail if there are no reference spectra for the molecules in the underlying sample, which is not uncommon especially considering only 220,000 spectra representing 20,000 molecules are measured and annotated according to a recent study while the number of molecules recorded in a compound database PubChem is more than 60 million. To alleviate this problem, many recent works has been focusing on the approach so called in silico fragmentation where the fragmentations are first simulated in computer for the molecules in some molecular database. Then the simulated fragments are compared to the measured tandem mass spectra. The main contribution of this thesis is to open a novel direction to bridge the gap between the limited spectral database and the vast molecular database with the help of molecular fingerprints. Molecular fingerprints are a binary representation to encode the structures or properties of a molecule. Kernel based machine learning methods are used to predict the molecular fingerprints from tandem mass spectra. Then the predicted fingerprints are used to match the fingerprints of mole-cules in some molecular database to derive an identification. Multiple kernel learning are also proposed to combine different views of tandem mass spectra. Finally, a one-step approach based on input output kernel regression is also applied to solve this problem, which becomes the new state of the art as demonstrated in several benchmarks including the recent Critical Assessment of Small Molecule Identification (CASMI) 2016 challenge.

Description

Supervising professor

Rousu, Juho, Prof., Aalto University, Department of Computer Science, Finland

Keywords

machine learning, metabolite identification, kernels, multiple kernel learning, structured prediction, tandem mass spectrometry

Other note

Parts

  • [Publication 1]: Markus Heinonen, Huibin Shen, Nicola Zamboni, Juho Rousu. Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics, 28, 18, 2333-2341, Sep. 2012.
    DOI: 10.1093/bioinformatics/bts437 View at publisher
  • [Publication 2]: Huibin Shen, Nicola Zamboni, Markus Heinonen, Juho Rousu. Metabolite identification through machine learning–tackling CASMI challenge using FingerID. Metabolites, 3, 2, 484-505, Jun. 2013.
    DOI: 10.3390/metabo3020484 View at publisher
  • [Publication 3]: Huibin Shen, Kai Dührkop, Sebastian Böcker, Juho Rousu. Metabolite identification through multiple kernel learning on fragmentation trees. Bioinformatics, 30, 12, i157-i164, Jun. 2014.
    DOI: 10.1093/bioinformatics/btu275 View at publisher
  • [Publication 4]: Kai Dührkop, Huibin Shen, Marvin Meusel, Juho Rousu, Sebastian Böcker. Searching molecular structure databases with tandem mass spectra using CSI: FingerID. Proceedings of the National Academy of Sciences, 112, 41, 12580-12585, Oct. 2015.
    DOI: 10.1073/pnas.1509788112 View at publisher
  • [Publication 5]: Céline Brouard, Huibin Shen, Kai Dührkop, Florence d’Alché-Buc, Sebastian Böcker, Juho Rousu. Fast metabolite identification with Input Output Kernel Regression. Bioinformatics, 32, 12, i28-i36, Jun. 2016.
    DOI: 10.1093/bioinformatics/btw246 View at publisher
  • [Publication 6]: Huibin Shen, Sandor Szedmak, Céline Brouard and Juho Rousu. Soft Kernel Target Alignment for Two-stage Multiple Kernel Learning. In 19th International Conference on Discovery Science, Bari, Italy, 427-441, Oct. 2016.
    DOI: 10.1007/978-3-319-46307-0_27 View at publisher

Citation