Predicting Protein Producibility: Binary classification of recombinant proteins produced in filamentous fungi
Perustieteiden korkeakoulu | Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Machine Learning and Data Mining
Master’s Programme in Machine Learning and Data Mining (Macadamia)
AbstractRecombinant protein synthesis aims to produce specific protein products of interest in living cells. However, protein production is subject to failure, and thus the successful development of a computational tool to predict protein sequence success prior to laboratory experimentation would save time and resources. We demonstrate the ability of an SVM trained on protein amino acid composition to predict successful protein production in a dataset of sequences tested in the host species Trichoderma reesei. We found that predictive models generalize well between two species of filamentous fungi, and furthermore that 50 training sequences are sufficient to train a model that yields an AUC of over .7. We introduced novel predictive features using protein domains detected with the InterProScan tool, which were modestly successful in the predictive task but whose addition did not improve over the use of amino acid composition alone. Experiments applying semi-supervised SVM formulations to the predictive task did not yield significant improvement, most likely because the spatial distribution of data points under the chosen numeric representations did not conform to the assumptions of the semi-supervised models. We explored the species of origin and enzyme function of sequences from the UniProt SwissProt database predicted to be successful by the trained SVM models, and showed that models trained with an RBF kernel were the most conservative in terms of the number of predicted successes.
Thesis advisorArvas, Mikko
binary classification, SVM, protein, filamentous fungi, semi-supervised