StrainMiner - Data mining and discrete optimizations for strains separation in metagenomes using long reads

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2023-10-09

Department

Major/Subject

Data Science

Mcode

SCI3115

Degree programme

Master's Programme in ICT Innovation

Language

en

Pages

58+10

Series

Abstract

This study presents StrainMiner, an approach that combines data mining and discrete optimization techniques for strain separation in microbial communities. Accurately characterizing the genetic diversity and functional potential of microbial populations in metagenomic samples relies on accurate strain separation. StrainMiner utilizes biclustering to identify cohesive genetic features that can distinguish strains by clustering rows (DNA sequences) and columns (DNA positions). The algorithm employs hierarchical clustering and k-nearest neighbors imputation for data preparation. Furthermore, a linear integer programming model is used to search for the maximum quasi-bicliques in order to obtain optimal bipartitions. Experimental evaluations on simulated and real-world metagenomic data demonstrate StrainMiner's ability to accurately separate strains, even in datasets with high noisiness and high number of strains. StrainMiner is an early version, and future integration with HairSplitter, an end-to-end tool for strains separation currently in development at IRISA, is planned.).

Description

Supervisor

Lähdesmäki, Harri

Thesis advisor

Andonov, Rumen

Keywords

quasi-biclique, K-nearest neighbor imputation, hierarchical clustering, integer linear programing, metagenomics, strains separation

Other note

Citation