dc.contributor | Aalto-yliopisto | fi |
dc.contributor | Aalto University | en |
dc.contributor.advisor | Jokinen, Emmi | |
dc.contributor.author | Dumitrescu, Alexandru | |
dc.date.accessioned | 2021-03-21T18:05:52Z | |
dc.date.available | 2021-03-21T18:05:52Z | |
dc.date.issued | 2021-03-15 | |
dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/103090 | |
dc.description.abstract | The recent advents of deep, contextual language models have brought significant improvements to various complex tasks such as neural machine translation or document generation. Models similar to those used in natural language have also started to grow in popularity in the bioinformatics field. The sequence information of proteins can be represented as strings of characters, each denoting one unique amino acid. This fact has led researchers to successfully experiment with amino acid vector representations that are learned and computed with models similar to those used in the natural language field. T cell receptors (TCRs) are sequences of proteins that form through the (random) recombination of the so-called variable (V), diversity (D), and joining (J) gene segments. These sequences are responsible for determining the epitope specificities of T cells and, in turn, their ability to recognize foreign pathogens. The physicochemical properties of each amino acid in a TCR and how the TCR protein folds determine what pathogens the T cell recognizes. This thesis presents and compares various ways of extracting contextual embeddings from T cell receptor proteins, using only their sequence information. We implement and test adaptations of character level Embeddings from Language Models (ELMO) and fine-tune Bidirectional Encoder Representations from Transformers (BERT) models using only sequences of amino acids coming from human TCR proteins. We then test the language models we train using only TCRs on an additional task that classifies a TCR based on its epitope specificity. We show how much the language model's task performance affects the TCR epitope classifier. Finally, we compare our approach to other state-of-the-art methods for TCR epitope classification. | en |
dc.format.extent | 70 + 12 | |
dc.format.mimetype | application/pdf | en |
dc.language.iso | en | en |
dc.title | TCR Sequence Representations Using Deep, Contextualized Language Models | en |
dc.type | G2 Pro gradu, diplomityö | fi |
dc.contributor.school | Perustieteiden korkeakoulu | fi |
dc.subject.keyword | leep Learning | en |
dc.subject.keyword | ELMO (Embeddings from Language Models) | en |
dc.subject.keyword | BERT (Bidirectional Encoder Representations from Transformers) | en |
dc.subject.keyword | T-cell receptor | en |
dc.subject.keyword | complementary determining region | en |
dc.subject.keyword | epitope | en |
dc.identifier.urn | URN:NBN:fi:aalto-202103212369 | |
dc.programme.major | Alexandru Dumitrescu | fi |
dc.programme.mcode | SCI3044 | fi |
dc.type.ontasot | Master's thesis | en |
dc.type.ontasot | Diplomityö | fi |
dc.contributor.supervisor | Lähdesmäki, Harri | |
dc.programme | Master’s Programme in Computer, Communication and Information Sciences | fi |
local.aalto.electroniconly | yes | |
local.aalto.openaccess | yes |
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.