[dipl] Perustieteiden korkeakoulu / SCI
Permanent URI for this collectionhttps://aaltodoc.aalto.fi/handle/123456789/21
Browse
Browsing [dipl] Perustieteiden korkeakoulu / SCI by Degree programme/Major subject "Bioinformatics"
Now showing 1 - 20 of 21
- Results Per Page
- Sort Options
- Active learning and interactive training for retinal image classification
Perustieteiden korkeakoulu | Master's thesis(2018-06-18) Sahlsten, JaakkoThe goal of this study is to investigate application of deep learning and human-computer interaction on diagnosing diabetic retinopathy from colour fundus images. We apply deep learning and study the effects of network pretraining, active learning and a personalised annotation on a private dataset. Diabetic retinopathy is a global issue with increasing number of patients and screening cases each year. In the current trend, an increasing amount of fundus images are scanned which in turn require diagnosis of diabetic retinopathy and other eye diseases, constituting in a major expenditure in an ophthalmologist's time. To aid and speed up the increasing diagnosis and annotation tasks, a machine learning solution is suggested for automatic diagnosis of diabetic retinopathy from colour fundus images. The State-of-the-art deep neural network has been demonstrated to achieve the same performance as opthamologists in the diagnosing referable diabetic retinopathy, while trained on tens of thousands of colour fundus images and associated labels. In this work the State-of-the-art model was deployed using smaller dataset. The model was trained from random initialisation and from pretrained weights from training on ImageNet dataset. Fine-tuning the pretrained network was compared to a network trained from scratch on two test sets. Fine-tuned model had area under receiving operator characteristic (ROCAUC) of 0.965 and 0.921, and model trained from random initialisation had ROCAUC of 0.962 and 0.879. Active learning is a well-studied subfield of machine learning and has been applied successfully. However, there is limited literate on applying it to high-dimensional data with deep neural networks. In this work, recent active learning solutions were applied to diabetic retinopathy classification in order to reduce required size of dataset to achieve required opthamologists performance in classifying referable diabetic retinopathy when applied in screening. The solution achieved the threshold with 8700 images compared to randomly sampled requiring 10500 images. A model was developed attempting to learn the user preferences in annotation with the help of pretrained network. The trained model was compared to a reference model with no human feedback and evaluated on subjective and objective performance. Tool was tested anecdotally which showed that it was able to subjective gradability to some extent. However, the tool did not provide additional benefits in subjective classification of retinopathy. - Adaptive real-time anomaly detection for multi-dimensional streaming data
Perustieteiden korkeakoulu | Master's thesis(2017-04-03) Saarinen, InkaData volumes are growing at a high speed as data emerges from millions of devices. This brings an increasing need for streaming analytics, processing and analysing the data in a record-by-record manner. In this work a comprehensive literature review on streaming analytics is presented, focusing on detecting anomalous behaviour. Challenges and approaches for streaming analytics are discussed. Different ways of determining and identifying anomalies are shown and a large number of anomaly detection methods for streaming data are presented. Also, existing software platforms and solutions for streaming analytics are presented. Based on the literature survey I chose one method for further investigation, namely Lightweight on-line detector of anomalies (LODA). LODA is designed to detect anomalies in real time from even high-dimensional data. In addition, it is an adaptive method and updates the model on-line. LODA was tested both on synthetic and real data sets. This work shows how to define the parameters used with LODA. I present a couple of improvement ideas to LODA and show that three of them bring important benefits. First, I show a simple addition to handle special cases such that it allows computing an anomaly score for all data points. Second, I show cases where LODA fails due to lack of data preprocessing. I suggest preprocessing schemes for streaming data and show that using them improves the results significantly, and they require only a small subset of the data for determining preprocessing parameters. Third, since LODA only gives anomaly scores, I suggest thresholding techniques to define anomalies. This work shows that the suggested techniques work fairly well compared to theoretical best performance. This makes it possible to use LODA in real streaming analytics situations. - Analysis of LC-MS data in untargeted nutritional metabolomics
Perustieteiden korkeakoulu | Master's thesis(2019-08-19) Mattsson, AntonLiquid chromatography-mass spectrometry based untargeted metabolomics is a technique that can measure the levels of thousands of compounds from virtually any biological sample. This thesis was done for the research group of nutritional metabolomics at the University of Eastern Finland. While there exists software for analyzing raw LC-MS data, the output of such software often requires additional preprocessing and quality control procedures that are integral to the workflow of the research group. This thesis covers many of these steps in detail, while also providing a broad overview of metabolomics and LC-MS instrumentation. The most important steps for curating data that is output from a LC-MS data collection software are drift correction, removal of low-quality features and imputation of missing values. We use cubic spline regression to model and correct for the systematic drift of signal intensity during an LC-MS run. Next, low-quality features are identified using several quality metrics measuring the relative magnitude of analytical variation. Finally, missing values are imputed by predicting them using a random forest fit on the observed part of the dataset. The main outcome of the thesis is an R package that automates data analysis of LC-MS experiments. The package provides a simple interface for the common preprocessing steps and several statistical analysis techniques for finding the most interesting features of the data, along with an arsenal of visualizations for quality control, exploratory visualization and assessment of study results. The package is licensed under the open source MIT license and is available for anyone to use. In addition, this thesis presents a new algorithm for finding molecular features originating from the same compound. - Application of variations of non-linear CCA for feature selection in drug sensitivity prediction
Perustieteiden korkeakoulu | Master's thesis(2019-06-17) Shadbahr, TolouCancer arises due to the genetic alteration in patient DNA. Many studies indicate the fact that these alterations vary among patients and can affect the therapeutic effect of cancer treatment dramatically. Therefore, extensive studies focus on understanding these alterations and their effects. Pre-clinical models play an important role in cancer drug discovery and cancer cell lines are one of the main ingredients of these pre-clinical studies which can capture many different aspects of multi-omics properties of cancer cells. However, the assessment of cancer cell line responses to different drugs is faulty and laborious. Therefore, in-silico models, which perform accurate prediction of drug sensitivity values, enhance cancer drug discovery. In the past decade, many computational methods achieved high performances by studying similarity between cancer cell lines and drug compounds and used them to obtain an accurate predictive model for unknown instances. In this thesis, we study the effect of non-linear feature selection through two variations of canonical correlation analysis, KCCA, and HSIC-SCCA, on the prediction of drug sensitivity. To estimate the performance of these features we use pairwise kernel ridge regression to predict the drug sensitivity, measured as IC50 values. The data set under study is a subset of Genomics of Drug Sensitivity in Cancer comprise of 124 cell lines and 124 drug compounds. The high diversity between cell lines and drug compound samples and the high dimension of data matrices reduce the accuracy of the model obtained by pairwise kernel ridge regression. This accuracy reduced by employing HSIC-SCCA method as a dimension reduction step since the HSIC-SCCA method increased the differences among the samples by employing different projection vectors for samples in different folds of cross-validation. Therefore, the obtained variables are rotated to provide more homogeneous samples. This step slightly improved the accuracy of the model. - Bioinformatics approach to unearthing bimolecular hammerhead ribozymes
Perustieteiden korkeakoulu | Master's thesis(2019-03-11) Zareie, AshkanBackground. Motif finding in large genomic datasets is one of the most powerful bioinformatics techniques in research. It acts as an important gateway to answering a wide variety of biological questions. However, it also remains one of bioinformatics' most challenging and computationally-expensive aspects. Motivation. In this work, we focus on small catalytic RNA molecules, i.e. hammerhead ribozymes, that over the past decades have been found to be present among the entire tree of life. Hammerhead ribozymes have been observed in many different forms and unusualities across biological samples and organisms of different origins. Nonetheless, bioinformatics searches have only concentrated on hammerhead ribozymes that are contiguous and wholly found on a single chromosome. In this work, we concentrate on hammerheads that are composed of two split RNA molecules each found on a different chromosome. Such hammerheads have been widely the focus of laboratory design and experiments due to their importance in medicine and therapeutics, and as of yet have not been reported as naturally-occurring motifs. Objective. Development and design of a bioinformatics tool that has the capability of locating bimolecular motifs that span two chromosomes; using hammerhead ribozymes and their consensus sequence structure to evaluate the tool on a certain model organism; discovering whether there is a theoretical probability that such motifs could occur in nature or they are truly restricted to laboratory design. Results. We have developed a hybrid pipeline in R, Perl, and Bash languages that utilizes regular expressions and BLAST for exact pattern matching of small motifs. We evaluated our tool on five chromosomes of Drosophila melanogaster and obtained thousands of results that match the structural requirements of a bimolecular discontiguous hammerhead ribozyme. These results showcase a high chance of occurrence for such motifs in nature. Moreover, many of the hits show characteristics of trans-acting and multiple-turnover hammerheads. - Comparison of normalization and statistical testing methods of 16S rRNA gene sequencing data
Perustieteiden korkeakoulu | Master's thesis(2018-12-10) Lehtinen, IlonaThe decreasing cost and increasing speed of next-generation sequencing techniques now enable more affordable and time effective access to human microbiomes. The aim of many 16S ribosomal RNA (rRNA) gene sequencing experiments is to identify the taxa significantly differing in the abundance between two or more conditions. However, increasing awareness about the compositional nature of the 16S rRNA gene sequencing data has evoked concerns about the validity of conclusions drawn from this type of data. Many early differential abundance testing methods completely ignore the compositionality or uneven library sizes. Recently, new methods taking the compositionality into account have been developed with the aim to ensure scale invariance and sub-compositional coherence. However, the constitutive problem of compositional data not containing the information needed for differential abundance testing remains a major challenge. The aim of this thesis was to evaluate different methods used for differential abundance testing for 16S rRNA gene sequencing data using both simulated and real data. Overall, we found that the simulation results are very dependent upon the simulation design and data characteristics. We confirm that better detection performance was achieved with bigger effect size and when more samples were available. The experiment performed on real data revealed that big differences between the methods still appear. Centered log-ratio (CLR) transformation prior to statistical tests produced the highest detection accuracy in our simulation experiments. CLR transformation in combination with Reproducibility-Optimized Test Statistic (ROTS) or Wilcoxon rank sum test produced nearly equal results on bigger sample sizes. However, on small sample sizes ROTS outperformed Wilcoxon rank sum test. Thus, based on our results, the use of CLR transformation combined with ROTS statistical test can be encouraged for the differential abundance testing on 16S rRNA gene sequencing data. - Coronary heart disease prediction in Finnish cohort utilizing genomic and other health data
Perustieteiden korkeakoulu | Master's thesis(2021-06-14) Ala-Pietilä, Emil - Data simulation of tumor phylogenetic trees and evaluation of phylogenetic reconstructing tools
Perustieteiden korkeakoulu | Master's thesis(2017-12-11) Li, XinyueTumor heterogeneity describes that a tumor usually contains more than one type of cells which are called clones. Clones in a tumor have distinct morphological and physiological features such as genetic variations. Different clones display different sensitivities to cytotoxic drugs, and tumor heterogeneity can add complexity to understand tumor composition and pose challenges for the development of successful therapies. Thus, studying tumor heterogeneity can guide tumor therapies for individual patient and enhance our understanding of inter-clonal functional relationships during therapies, which could be benefit to personalized and efficient treatments. Heterogenetic tumor development is an evolutionary process. There exists an evolutionary relationship among the clones of a heterogenetic tumor and the relationship can be described by an phylogenetic tree. Computational tools have been increasingly important to study tumor heterogeneity because of their time and economic efficiency. Such tools usually take as input the genetic variability data produced by high-throughput sequencing technologies, then output clonal composition of a tumor and reconstruct the polygenetic tree of it. In this thesis, we simulated a large amount of datasets consisting of tumor phylogenetic trees with varying properties and used the datasets to evaluate five recent and popular tumor phylogenetic reconstructing computational tools. We found relatively large differences for performance among those tools and also their strengths and shortcomings, respectively. We left as future work improvement of the data simulation methods and exploration of tool parameters for possibly more beneficial results. - Drug combination synergy prediction with minimal set of experiments for high-throughput combinatorial screening
Perustieteiden korkeakoulu | Master's thesis(2018-08-20) Ianevski, Aleksandr - Drug Set Enrichment Analysis (DSEA): A Computational Approach to Identify Functional Drug Sets from High-Throughput Drug Testing
Perustieteiden korkeakoulu | Master's thesis(2014-12-01) Bychkov, Dmitrii - Drug side-effect prediction using machine learning methods
Perustieteiden korkeakoulu | Master's thesis(2017-12-11) Khan, MuhammadDrug toxicity (or adverse side effects) is a pressing health problem which is also an impediment to the development of therapeutically effective drugs. Despite many on-going efforts to determine the toxicity beforehand, computational prediction of drug side-effects remains a challenging task. This thesis presents an approach to predict side-effects by utilizing side-information sources for the drugs, while simultaneously comparing state-of-the-art machine learning methods to improve accuracy. Specifically, the thesis implements a data-analysis pipeline for obtaining side-information that are useful for the prediction task. This thesis then formulates the drug side-effect prediction as a machine learning problem: Given disease indications and structural features (as side-information sources) of drugs, for which some measurements of side-effect exist, predict sideeffect for a new drug. As case studies, the prediction accuracies are compared for ten different side-effects using linear as well as non-linear machine learning methods. The thesis summarizes three key findings. First, the drug side-information sources are predictive of the side-effects. Second, non-linear methods show improved prediction accuracies as compared to their linear analogs. Third, the integration of disease indications and structural features with a principled machine learning approach further improves the drug side-effect predictions. However, the current study limits the analysis assuming side-effects are independent. In future, modeling the joint relationships of several side-effects could yield more strong predictions and better help to understand the underlying biological mechanism. - Engineering a Multi-Electrode Patch Clamp System:A novel tool to quantify retinal circuits
Perustieteiden korkeakoulu | Master's thesis(2018-02-12) Narayanan, SathishThe human brain contains almost 100 billion neurons. They form distinct neural circuits that underlie the computational power of the brain. To understand how these neural networks function, high-detail physiological recordings from multiple identified neurons within a circuit are required. However, the technical possibilities to achieve this have been limited. Simultaneous patch clamp recordings from multiple well-defined neurons at the same time would give an excellent opportunity to obtain a deeper mechanistic understanding of neural circuit function. Thus, the goal of this master’s project was to build the first state-of-the-art multi-electrode patch clamp system, along with acquisition and analysis software for retinal studies. This multi-electrode patch clamp system makes it possible for the first time to study at high physiological resolution how identified neurons in the vertebrate retina contribute to processing in small networks. The system is flexible to study other areas of the brain and can be extended to eight electrodes with only a few changes. The custom written software ensures protocol standardization for rig calibration, data acquisition, and analysis. All toolboxes are freely available as open source code, which ensures seamless collaboration between researchers and laboratories. - Evaluation of robustness of modeling-based experiment retrieval method to differences in measurement and preprocessing techniques
Perustieteiden korkeakoulu | Master's thesis(2017-02-13) Eranti, Pradeep - Forecasting the Demand of Retail Stock Keeping Units Using a Negative Binomial State Space Model
Perustieteiden korkeakoulu | Master's thesis(2019-12-16) Kaijala, SaaraDemand forecasting is one of the core challenges in retail business and successful supply chain planning. However, many endogenous and exogenous factors make the task very challenging. Simple linear and univariate models are unable to capture many of the complex patterns present in the demand time series. Hence, probabilistic Bayesian models have gained prominence in the field. The objective of this thesis is to determine whether the probabilistic model specification by Chapados (2014) is sufficient for industrial-scale demand forecasting. The model is a state space model with negative binomial observations and a latent autoregressive (AR) process of order one. The Bayesian inference over the unknown parameters and latent states is carried out with integrated nested Laplace approximation, which is an emerging method suitable for latent Gaussian models (Rue et al., 2009). The results are illustrated on real-world retail data, consisting of 2460 sales time series from a large European retailer. The performance of the model is compared against the forecasting accuracy of a Holt-Winters' exponential smoothing model and a simple naïve model. Our results regarding the forecasting performance of the model are mixed. In general, no notable accuracy gains could be obtained through the negative binomial state space model in comparison with the benchmark models. In our setup, especially for slow-moving items with intermittent sales, the model generated systematically upward biased forecasts. However, for products with high sales volumes as well as for frequently promoted products, the forecasts by the negative binomial state space model were competitive. Given the complexity of the framework and the slowness of the inference calculations, the exact reasons for the poor performance remain unclear. We suspect that the latent AR(1) process may not be enough for capturing some correlations in the data. We also notice that formulating a strategy for setting the model priors suitably with a reasonable amount of effort can be very challenging. For future research, we suggest experimenting with the order of the AR process as well as revisiting the strategy for setting the model priors. It could also be investigated whether faster convergence of the inference could be obtained through another inference method. - From SNPs to Signals: Automatic Result Filtering and Novelty identification for Genome-Wide Association Studies
Perustieteiden korkeakoulu | Master's thesis(2019-12-16) Lehisto, ArtoIn recent years, genome-wide association studies (GWAS) have grown both in size and scope, with sample sizes growing to hundreds of thousands of samples and the focus of the efforts shifting to the amassing of phenome-wide, population-level data resources. These studies have brought with them an unprecedented amount of associations between genomic regions and phenotypic traits. Recently, the FinnGen project was started to create a population-level, phenome-wide GWAS recource of the Finnish population. The large amount of result data created by the FinnGen project creates a need for an automatic process of extracting significant results from the result data. This thesis describes the automatic reporting tool, which was created for the needs of the FinnGen project. The tool extracts and annotates significant results from GWAS summary statistics and compares them to previously identified associations. The tool's motivation and function is described. A data analysis pipeline was created for the tool, and it was tested using a set of GWAS summary statistics. The results come in the form of identified signals per phenotype, as well as information about the novelty of the signals.The results of the experiment show the tool scales to the sizes necessary for the FinnGen project. - Identification of metabolic fluxes leading to the production of industrially relevant products
Perustieteiden korkeakoulu | Master's thesis(2016-06-02) Ilievska, MajaIn metabolic pathway analysis the focus is on identifying the complete range of paths within a biochemical network. However, most current methods characterizing all potential paths between the selected substrates and product are based either on the enumeration of all elementary flux modes or all extreme pathways. This becomes computationally unfeasible for large reaction matrices. In this work, we propose an alternative approach that identifies a set of potential paths while avoiding an exhaustive enumeration. More specifically, we identify a set of (minimal) flux vectors that produce the desired product and do not accumulate any intermediates while consuming at least one of the specified substrates. Our k-best approach uses linear programming to identify the first k solutions, according to a predefined objective function. Furthermore, in order to determine biologically more meaningful flux vectors we define an augmented solution space, where in addition to the flux distribution we incorporate the net consumption/production of external metabolites and the contribution of the null space basis vectors to the given flux distribution. One of the main aims of this research was to computationally determine the best substrate-path-product combination for industrial scale production. In fact, we were interested in identifying the best carbon source (or the best combination of different carbon sources) that will lead to the highest productivity for a specific product, as well as the best metabolic pathway from the identified sources to the product. A special focus within this work was the identification of an objective function for the enumerated paths, which would return a good set of candidate paths. The results demonstrate that our k-best method is able to identify a set of candidate pathways for genome-scale metabolic models, where elementary modes and extreme pathway analysis fail to provide a resulting set of pathways. Among the pathways proposed by our enumeration approach there are novel ones with the potential to improve the production processes of the specific product in terms of energetic efficiency. - Integrated data analysis pipeline for whole human genome transcription factor binding sites prediction
Perustieteiden korkeakoulu | Master's thesis(2015-06-11) Khakipoor, BanafshehTranscription factors (TF) have a central role in regulating gene expression by binding to regulatory regions in DNA. Position weight matrix (PWM) model is the most commonly used model for representing and predicting TF binding sites. Consequently, several studies have been done on predicting TF binding sites using PWMs and many databases have been created containing large numbers of PWMs. However, these studies require the user to search for binding sites for each PWM separately, thus making it is difficult to get a general view of binding predictions for many PWMs simultaneously. In response to this need, this thesis project evaluates both individual and groups of PWMs and creates an effortless method to analyze and visualize the desired set of PWMs together, making it easier for biologist to analyze large amount of data in a short period of time. For this purpose, we used bioinformatics methods to detect putative TF binding sites in human genome and make them available online via the UCSC genome browser. Still, the sheer amount of data in PWM databases required a more efficient method to summarize TF binding prediction. Hence, we used PWM similarity measures and clustering algorithms to group together PWMs and to create one integrated database from four popular PWM databases: SELEX, TRANSFAC, UniPROBE, and JASPAR. All results are made publicly available for the research community via the UCSC genome broswer. - Interactive learning in personalized medicine
Perustieteiden korkeakoulu | Master's thesis(2016-12-08) Kaurila, KarelIn personalized medicine, the goal is to tailor treatments to a particular patient. To do this, one needs to be able to accurately predict treatment outcomes based on previous treatments on other patients, while still taking into account the particularities of the patient being treated. In order to make these kinds of predictions, one first needs to solve a number of statistical problems. This thesis studies one of them. This thesis studies the application of interactive machine learning methods for the problem of predicting local effects in a high-dimensional environment with few samples - a setting often found in personalized medicine. For this task, the thesis proposes eliciting additional information about the similarities of the samples from an expert and using this information for learning local models. Specifically, the proposed approach is to uses an interactive metric learning method together with a recent, sparse and local regression method. The method is empirically evaluated in a synthetic proof of concept setting, where the response to be predicted has strong local effects. The results on this setting suggest integrating similarity information about items learned from expert feedback can be an effective way to approach prediction in ''small $n$ large $p$'' settings. - Modeling protein-DNA binding specificities with random forest
Perustieteiden korkeakoulu | Master's thesis(2018-01-22) Antikainen, AnniProtein-DNA binding specifities are modeled with random forest in this Master's thesis. Specific proteins called transcriptional factors are essential for gene expression regulation, since their binding on DNA can alter transcription initiation probability of target genes. Furthermore, transcriptional factors can bind DNA as dimers even though as individuals they would lack the required affinity for the binding site. Thus, models that predict individual protein and protein dimer binding sites, would be beneficial for deducing gene regulatory networks. In this Master's thesis HT-SELEX and CAP-SELEX data sets measured by Jolma et al. are utilized for modeling binding specificities. SELEX measurements yield large sets of DNA sequences, which are known to comprise a binding site. HT-SELEX measure individual transcriptional factor binding sites while CAP-SELEX measure binding sites of transcriptional factor dimers. Currently, position weight matrices (PWM) are most often utilized for modeling protein-DNA binding specifities even though they may be too simple and inflexible for accurate modeling. For instance a neural network model, DeepBind, have been shown to outperform PWM modeling significantly. In this Master's thesis, random forest, which is known to be well suited for high-dimensional and correlated data, is combined with PWMs to yield models for protein-DNA binding specifities. For individual transcriptional factor binding sites random forest perform almost equally to DeepBind and outperform PWM modeling significantly. In addition, random forest predict protein dimer binding sites significantly more accurately than position weight matrices. Furthermore, the difference between random forest and PWM modeling is greater for protein pairs than for individual proteins. In addition, DeepBind is not currently provided for transcriptional factor pairs. Thus, according to results represented in this Master's thesis, modeling protein-DNA binding specificities with random forest is beneficial in comparison to position weight matrices especially for protein dimers. - Pool-seq analysis for the identification of polymorphisms in bacterial strains and utilization of the variants for protein database creation
Perustieteiden korkeakoulu | Master's thesis(2016-10-27) Weldatsadik, RigbePooled sequencing (Pool-seq) is the sequencing of a single library that contains DNA pooled from different samples. It is a cost-effective alternative to individual whole genome sequencing. In this study, we utilized Pool-seq to sequence 100 streptococcus pyogenes strains in two pools to identify polymorphisms and create variant protein databases for shotgun proteomics analysis. We investigated the efficacy of the pooling strategy and the four tools used for variant calling by using individual sequence data of six of the strains in the pools as well as 3407 publicly available strains from the European Nucleotide Archive. Besides the raw sequence data from the public repository, we also extracted polymorphisms from 19 S.pyogenes publicly available complete genomes and compared the variations against our pools. In total 78955 variants (76981 SNPs and 1725 INDELs ) were identified from the two pools. Of these, ∼ 60.5% and 95.7% were discovered in the complete genomes and the European Nucleotide Archive data respectively. Collectively, the four variant calling tools were able to mine majority of the variants, ∼ 96.5%, found from the six individual strains, suggesting Pool-seq is a robust approach for variation discovery. Variants from the pools that fell in coding regions and had non synonymous effects constituted 24% and were used to create variant protein databases for shotgun proteomics analysis. These variant databases improved protein identification in mass spectrometry analysis.