[dipl] Perustieteiden korkeakoulu / SCI
Permanent URI for this collectionhttps://aaltodoc.aalto.fi/handle/123456789/21
Browse
Browsing [dipl] Perustieteiden korkeakoulu / SCI by Author "Aakko, Juhani"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
- Comparison of normalization and statistical testing methods of 16S rRNA gene sequencing data
Perustieteiden korkeakoulu | Master's thesis(2018-12-10) Lehtinen, IlonaThe decreasing cost and increasing speed of next-generation sequencing techniques now enable more affordable and time effective access to human microbiomes. The aim of many 16S ribosomal RNA (rRNA) gene sequencing experiments is to identify the taxa significantly differing in the abundance between two or more conditions. However, increasing awareness about the compositional nature of the 16S rRNA gene sequencing data has evoked concerns about the validity of conclusions drawn from this type of data. Many early differential abundance testing methods completely ignore the compositionality or uneven library sizes. Recently, new methods taking the compositionality into account have been developed with the aim to ensure scale invariance and sub-compositional coherence. However, the constitutive problem of compositional data not containing the information needed for differential abundance testing remains a major challenge. The aim of this thesis was to evaluate different methods used for differential abundance testing for 16S rRNA gene sequencing data using both simulated and real data. Overall, we found that the simulation results are very dependent upon the simulation design and data characteristics. We confirm that better detection performance was achieved with bigger effect size and when more samples were available. The experiment performed on real data revealed that big differences between the methods still appear. Centered log-ratio (CLR) transformation prior to statistical tests produced the highest detection accuracy in our simulation experiments. CLR transformation in combination with Reproducibility-Optimized Test Statistic (ROTS) or Wilcoxon rank sum test produced nearly equal results on bigger sample sizes. However, on small sample sizes ROTS outperformed Wilcoxon rank sum test. Thus, based on our results, the use of CLR transformation combined with ROTS statistical test can be encouraged for the differential abundance testing on 16S rRNA gene sequencing data. - Machine Learning-Based Classification of Clinical Notes to Extract Smoking Status from Electronic Health Records
Perustieteiden korkeakoulu | Master's thesis(2022-01-17) Hölsä, OliviaSmoking is a significant factor affecting human health and development of various diseases but smoking status is usually documented in an unstructured format in the electronic health records. Therefore the information about smoking status is difficult to extract with purpose to, for example, analyse the health effects of smoking based on a real world data. This thesis was made as a part of a study where effects of smoking on postoperative surgical complications were assessed. Therefore a text classifier to identify smoking status of a patient based on clinical notes was built. Smoking-related sentences were selected by searching smoking-related regular expressions from the clinical notes. Overall 809,958 sentences were classified with a machine learning-based fastText classifier trained with 19,999 sentences into classes ex-smoker, nonsmoker, smoker and unknown smoking status. The results were improved by estimating the uncertainty of the classification results and the classifications in the classes ex-smoker, nonsmoker and smoker that were considered as uncertain results were reassigned to the class unknown. The final classifier achieved the precisions of 0.958, 0.974 and 0.95 for the classes ex-smoker, nonsmoker and smoker, respectively and the accuracy of the classifier for the sentences classified in these three classes was 0.959. Additionally, a rule-based classifier to assign smoking status for each surgery patient based on the smoking statuses of the classified sentences was introduced. The classifier outperformed prior approaches to identify smoking status from clinical notes taking into account the differences in the study settings.