Browsing by Author "Jokinen, Emmi"
Now showing 1 - 13 of 13
- Results Per Page
- Sort Options
- Auttaja-T-solujen erilaistumisen dynaaminen mallinus differentiaaliyhtälöillä
Sähkötekniikan korkeakoulu | Bachelor's thesis(2013-12-30) Jokinen, Emmi - A deep learning method for predicting T cell receptor binding to unseen epitopes
Perustieteiden korkeakoulu | Master's thesis(2022-12-12) Korpela, DaniT cells are a vital part of the immune system, defending us against invading pathogens and cancer. However, T cells can also target non-infected healthy cells of the individual causing autoimmune diseases. The recognition of a target cell, whether disease causing or healthy, is mediated by the T cell receptor (TCR). More specifically the TCR recognizes a peptide fragment, an epitope, presented by the major histocompatibility complex (MHC) by binding to it. Understanding this recognition would be valuable and could be used in many medical applications. In this thesis a deep learning model for the prediction of TCR-peptide-MHC binding is presented. Most current models use the epitopes as a categorical variable, being unable to predict for epitopes outside the training distribution. Our model uses the epitope amino acid sequence and is able to predict for previously unseen epitopes. In addition to the epitope our model uses the MHC allele and the complementarity determining region 3 (CDR3) V and J genes of both chains or either chain of the TCR. The amino acid information of the epitope and TCR are combined using self-attention. We show that different learning rates in the optimization scheme work well for the seen and for the unseen task and how different input features are important for different tasks. The task of unseen epitope prediction is still a very hard task, and the performance is significantly worse than in the seen epitope case. Finally, we show that our model outperforms or is comparable to state of the art methods that are able to predict for unseen epitopes. - EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2023-12-09) Korpela, Dani; Jokinen, Emmi; Dumitrescu, Alexandru; Huuhtanen, Jani; Mustjoki, Satu; Lähdesmäki, HarriMotivation T cells play an essential role in adaptive immune system to fight pathogens and cancer but may also give rise to autoimmune diseases. The recognition of a peptide-MHC (pMHC) complex by a T cell receptor (TCR) is required to elicit an immune response. Many machine learning models have been developed to predict the binding, but generalizing predictions to pMHCs outside the training data remains challenging. Results We have developed a new machine learning model that utilizes information about the TCR from both α and β chains, epitope sequence, and MHC. Our method uses ProtBERT embeddings for the amino acid sequences of both chains and the epitope, as well as convolution and multi-head attention architectures. We show the importance of each input feature as well as the benefit of including epitopes with only a few TCRs to the training data. We evaluate our model on existing databases and show that it compares favorably against other state-of-the-art models. - Evolution and modulation of antigen-specific T cell responses in melanoma patients
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2022-10-11) Huuhtanen, Jani; Chen, Liang; Jokinen, Emmi; Kasanen, Henna; Lönnberg, Tapio; Kreutzman, Anna; Peltola, Katriina; Hernberg, Micaela; Wang, Chunlin; Yee, Cassian; Lähdesmäki, Harri; Davis, Mark M.; Mustjoki, SatuAnalyzing antigen-specific T cell responses at scale has been challenging. Here, we analyze three types of T cell receptor (TCR) repertoire data (antigen-specific TCRs, TCR-repertoire, and single-cell RNA + TCRαβ-sequencing data) from 515 patients with primary or metastatic melanoma and compare it to 783 healthy controls. Although melanoma-associated antigen (MAA) -specific TCRs are restricted to individuals, they share sequence similarities that allow us to build classifiers for predicting anti-MAA T cells. The frequency of anti-MAA T cells distinguishes melanoma patients from healthy and predicts metastatic recurrence from primary melanoma. Anti-MAA T cells have stem-like properties and frequent interactions with regulatory T cells and tumor cells via Galectin9-TIM3 and PVR-TIGIT -axes, respectively. In the responding patients, the number of expanded anti-MAA clones are higher after the anti-PD1(+anti-CTLA4) therapy and the exhaustion phenotype is rescued. Our systems immunology approach paves the way for understanding antigen-specific responses in human disorders. - Identifying Phenotypes Based on TCR Repertoire Using Machine Learning Methods
Sähkötekniikan korkeakoulu | Master's thesis(2020-06-15) Qin, QianqianThe adaptive immune system can prevent human beings being infected by pathogens. T cells, a kind of lymphocytes in the adaptive immunity, recognise antigens by T cell receptors (TCRs) and then generate cell-mediated immune responses. After primary immune responses, the adaptive immunity can generate corresponding immunological memory. TCRs are generated by a process of somatic gene rearrangement and therefore have high diversity. An individual's TCR repertoire can reveal his pathogen exposure history, which can assist in biological studies such as disease diagnosis. This master thesis targets to make predictions about phenotype statuses based on high-throughput TCR sequencing data using machine learning approaches, to see how accurate the phenotype identification based on TCR repertoire can be. The raw TCR data is preprocessed in three different ways and then proceed the next steps separately. Several feature selection approaches are applied to obtain the most important TCRs. The machine learning algorithms including Beta-binomial model (baseline), Logistic regression, Random forest and a Boosting algorithm LightGBM are trained and evaluated. Two datasets, Cytomegalovirus (CMV) and rheumatoid arthritis (RA), are explored. For the CMV dataset, Random forest performs best, even though only a little bit better than the baseline model. However, the classification results of the RA dataset are not so good whatever models used, and the best classifier is LightGBM. The results imply that the TCR data needs to be large enough to make powerful predictions. Using a sufficiently large dataset, the prediction ability of the baseline model is great, and there may exist certain algorithms such as Random forest outperform it. - MGPfusion: Predicting protein stability changes with Gaussian process kernel learning and data fusion
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2018-07-01) Jokinen, Emmi; Heinonen, Markus; Lähdesmäki, HarriMotivation: Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins? properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability is necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data. Results: We have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein?s stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms stateof- the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy. - Modeling protein stability with Gaussian processes
Sähkötekniikan korkeakoulu | Master's thesis(2016-08-24) Jokinen, EmmiProteins are used in various applications by different industries. In order to refine the processes they are used in or to create new applications, protein engineering is applied to alter the properties of proteins by introducing mutations to them. It is often desirable to improve the stability of proteins as they should be stable in the conditions of industrial processes. Protein stability predictors provide a way to estimate how mutations affect the stability. When a novel protein is being designed, the predictors can thus be used to reduce the amount of proteins to be tested experimentally. This master's thesis introduces two machine learning approaches for predicting stability changes of proteins upon mutations. They both utilise Gaussian processes and a graph presentation of proteins, but by using different kernels and different notions of similarity, they adapt to different situations. The first approach uses experimental stability measurements only from the protein of interest. When enough data is available it can reach excellent results. For example, when we trained this model using a stability data set of 349 measurements for bacteriophage T4 lysozyme and leave-one-out cross validation, we achieved a correlation of 0.90 and root mean squared error of 0.76 kcal/mol and outperformed the current state-of-art prediction methods. This method can predict the effects of single and multiple simultaneous mutations and can also incorporate information from predictors relying on energy functions to further improve stability predictions. The second approach exploits data from multiple proteins and can be applied even when only little or no experimental data is available from the protein of interest. We trained this model using a previously published data set of 2648 mutations from 131 proteins. When a set of 350 mutations of this data set was excluded for testing and the rest of the data was used for training, we achieved reasonable results, a correlation of 0.54 and a mean squared error of 1.32 kcal/mol. - Predicting recognition between T cell receptors and epitopes with TCRGP
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2021-03-25) Jokinen, Emmi; Huuhtanen, Jani; Mustjoki, Satu; Heinonen, Markus; Lähdesmäki, HarriAdaptive immune system uses T cell receptors (TCRs) to recognize pathogens and to consequently initiate immune responses. TCRs can be sequenced from individuals and methods analyzing the specificity of the TCRs can help us better understand individuals' immune status in different disorders. For this task, we have developed TCRGP, a novel Gaussian process method that predicts if TCRs recognize specified epitopes. TCRGP can utilize the amino acid sequences of the complementarity determining regions (CDRs) from TCRα and TCRβ chains and learn which CDRs are important in recognizing different epitopes. Our comprehensive evaluation with epitope-specific TCR sequencing data shows that TCRGP achieves on average higher prediction accuracy in terms of AUROC score than existing state-of-the-art methods in epitope-specificity predictions. We also propose a novel analysis approach for combined single-cell RNA and TCRαβ (scRNA+TCRαβ) sequencing data by quantifying epitope-specific TCRs with TCRGP and identify HBV-epitope specific T cells and their transcriptomic states in hepatocellular carcinoma patients. - Single-cell characterization of anti-LAG-3 and anti-PD-1 combination treatment in patients with melanoma
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2023-03-15) Huuhtanen, Jani; Kasanen, Henna; Peltola, Katriina; Lönnberg, Tapio; Glumoff, Virpi; Brück, Oscar; Dufva, Olli; Peltonen, Karita; Vikkula, Johanna; Jokinen, Emmi; Ilander, Mette; Lee, Moon Hee; Mäkelä, Siru; Nyakas, Marta; Li, Bin; Hernberg, Micaela; Bono, Petri; Lähdesmäki, Harri; Kreutzman, Anna; Mustjoki, SatuBACKGROUND. Relatlimab plus nivolumab (anti-lymphocyte-activation gene 3 plus anti-programmed death 1 [anti-LAG-3+anti-PD-1]) has been approved by the FDA as a first-line therapy for stage III/IV melanoma, but its detailed effect on the immune system is unknown. METHODS. We evaluated blood samples from 40 immunotherapy-naive or prior immunotherapy-refractory patients with metastatic melanoma treated with anti-LAG-3+anti-PD-1 in a phase I trial using single-cell RNA and T cell receptor sequencing (scRNA+TCRαβ-Seq) combined with other multiomics profiling. RESULTS. The highest LAG3 expression was noted in NK cells, Tregs, and CD8+ T cells, and these cell populations underwent the most significant changes during the treatment. Adaptive NK cells were enriched in responders and underwent profound transcriptomic changes during the therapy, resulting in an active phenotype. LAG3+ Tregs expanded, but based on the transcriptome profile, became metabolically silent during the treatment. Last, higher baseline TCR clonality was observed in responding patients, and their expanding CD8+ T cell clones gained a more cytotoxic and NK-like phenotype. CONCLUSION. Anti-LAG-3+anti-PD-1 therapy has profound effects on NK cells and Tregs in addition to CD8+ T cells. TRIAL REGISTRATION. ClinicalTrials.gov (NCT01968109) FUNDING. Cancer Foundation Finland, Sigrid Juselius Foundation, Signe and Ane Gyllenberg Foundation, Relander Foundation, State funding for university-level health research in Finland, a Helsinki Institute of Life Sciences Fellow grant, Academy of Finland (grant numbers 314442, 311081, 335432, and 335436), and an investigator-initiated research grant from BMS. - Substrate specificity of 2-deoxy-D-ribose 5-phosphate aldolase (DERA) assessed by different protein engineering and machine learning methods
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2020-12) Voutilainen, Sanni; Heinonen, Markus; Andberg, Martina; Jokinen, Emmi; Maaheimo, Hannu; Pääkkönen, Johan; Hakulinen, Nina; Rouvinen, Juha; Lähdesmäki, Harri; Kaski, Samuel; Rousu, Juho; Penttilä, Merja; Koivula, AnuAbstract: In this work, deoxyribose-5-phosphate aldolase (Ec DERA, EC 4.1.2.4) from Escherichia coli was chosen as the protein engineering target for improving the substrate preference towards smaller, non-phosphorylated aldehyde donor substrates, in particular towards acetaldehyde. The initial broad set of mutations was directed to 24 amino acid positions in the active site or in the close vicinity, based on the 3D complex structure of the E. coli DERA wild-type aldolase. The specific activity of the DERA variants containing one to three amino acid mutations was characterised using three different substrates. A novel machine learning (ML) model utilising Gaussian processes and feature learning was applied for the 3rd mutagenesis round to predict new beneficial mutant combinations. This led to the most clear-cut (two- to threefold) improvement in acetaldehyde (C2) addition capability with the concomitant abolishment of the activity towards the natural donor molecule glyceraldehyde-3-phosphate (C3P) as well as the non-phosphorylated equivalent (C3). The Ec DERA variants were also tested on aldol reaction utilising formaldehyde (C1) as the donor. Ec DERA wild-type was shown to be able to carry out this reaction, and furthermore, some of the improved variants on acetaldehyde addition reaction turned out to have also improved activity on formaldehyde. Key points: • DERA aldolases are promiscuous enzymes. • Synthetic utility of DERA aldolase was improved by protein engineering approaches. • Machine learning methods aid the protein engineering of DERA. - TCR Sequence Representations Using Deep, Contextualized Language Models
Perustieteiden korkeakoulu | Master's thesis(2021-03-15) Dumitrescu, AlexandruThe recent advents of deep, contextual language models have brought significant improvements to various complex tasks such as neural machine translation or document generation. Models similar to those used in natural language have also started to grow in popularity in the bioinformatics field. The sequence information of proteins can be represented as strings of characters, each denoting one unique amino acid. This fact has led researchers to successfully experiment with amino acid vector representations that are learned and computed with models similar to those used in the natural language field. T cell receptors (TCRs) are sequences of proteins that form through the (random) recombination of the so-called variable (V), diversity (D), and joining (J) gene segments. These sequences are responsible for determining the epitope specificities of T cells and, in turn, their ability to recognize foreign pathogens. The physicochemical properties of each amino acid in a TCR and how the TCR protein folds determine what pathogens the T cell recognizes. This thesis presents and compares various ways of extracting contextual embeddings from T cell receptor proteins, using only their sequence information. We implement and test adaptations of character level Embeddings from Language Models (ELMO) and fine-tune Bidirectional Encoder Representations from Transformers (BERT) models using only sequences of amino acids coming from human TCR proteins. We then test the language models we train using only TCRs on an additional task that classifies a TCR based on its epitope specificity. We show how much the language model's task performance affects the TCR epitope classifier. Finally, we compare our approach to other state-of-the-art methods for TCR epitope classification. - TCRconv: predicting recognition between T cell receptors and epitopes using contextualized motifs
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2023-01-01) Jokinen, Emmi; Dumitrescu, Alexandru; Huuhtanen, Jani; Gligorijevic, Vladimir; Mustjoki, Satu; Bonneau, Richard; Heinonen, Markus; Lähdesmäki, HarriMotivation: T cells use T cell receptors (TCRs) to recognize small parts of antigens, called epitopes, presented by major histocompatibility complexes. Once an epitope is recognized, an immune response is initiated and T cell activation and proliferation by clonal expansion begin. Clonal populations of T cells with identical TCRs can remain in the body for years, thus forming immunological memory and potentially mappable immunological signatures, which could have implications in clinical applications including infectious diseases, autoimmunity and tumor immunology.Results: We introduce TCRconv, a deep learning model for predicting recognition between TCRs and epitopes. TCRconv uses a deep protein language model and convolutions to extract contextualized motifs and provides state-of-the-art TCR-epitope prediction accuracy. Using TCR repertoires from COVID-19 patients, we demonstrate that TCRconv can provide insight into T cell dynamics and phenotypes during the disease. - TSignal : a transformer model for signal peptide prediction
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2023-06-01) Dumitrescu, Alexandru; Jokinen, Emmi; Paatero, Anja; Kellosalo, Juho; Paavilainen, Ville O.; Lähdesmäki, HarriMotivation: Signal peptides (SPs) are short amino acid segments present at the N-terminus of newly synthesized proteins that facilitate protein translocation into the lumen of the endoplasmic reticulum, after which they are cleaved off. Specific regions of SPs influence the efficiency of protein translocation, and small changes in their primary structure can abolish protein secretion altogether. The lack of conserved motifs across SPs, sensitivity to mutations, and variability in the length of the peptides make SP prediction a challenging task that has been extensively pursued over the years. Results: We introduce TSignal, a deep transformer-based neural network architecture that utilizes BERT language models and dot-product attention techniques. TSignal predicts the presence of SPs and the cleavage site between the SP and the translocated mature protein. We use common benchmark datasets and show competitive accuracy in terms of SP presence prediction and state-of-the-art accuracy in terms of cleavage site prediction for most of the SP types and organism groups. We further illustrate that our fully data-driven trained model identifies useful biological information on heterogeneous test sequences.