Pool-seq analysis for the identification of polymorphisms in bacterial strains and utilization of the variants for protein database creation

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2016-10-27
Department
Major/Subject
Bioinformatics
Mcode
MBI
Degree programme
Master's Programme in Bioinformatics (MBI)
Language
en
Pages
58 + 10
Series
Abstract
Pooled sequencing (Pool-seq) is the sequencing of a single library that contains DNA pooled from different samples. It is a cost-effective alternative to individual whole genome sequencing. In this study, we utilized Pool-seq to sequence 100 streptococcus pyogenes strains in two pools to identify polymorphisms and create variant protein databases for shotgun proteomics analysis. We investigated the efficacy of the pooling strategy and the four tools used for variant calling by using individual sequence data of six of the strains in the pools as well as 3407 publicly available strains from the European Nucleotide Archive. Besides the raw sequence data from the public repository, we also extracted polymorphisms from 19 S.pyogenes publicly available complete genomes and compared the variations against our pools. In total 78955 variants (76981 SNPs and 1725 INDELs ) were identified from the two pools. Of these, ∼ 60.5% and 95.7% were discovered in the complete genomes and the European Nucleotide Archive data respectively. Collectively, the four variant calling tools were able to mine majority of the variants, ∼ 96.5%, found from the six individual strains, suggesting Pool-seq is a robust approach for variation discovery. Variants from the pools that fell in coding regions and had non synonymous effects constituted 24% and were used to create variant protein databases for shotgun proteomics analysis. These variant databases improved protein identification in mass spectrometry analysis.
Description
Supervisor
Lähdesmäki , Harri
Thesis advisor
Jokiranta, Sakari T.
Keywords
pooled sequencing, variant protein database, variant calling, shotgun proteomics
Other note
Citation