MiniLingua: Training of multilingual small large language model

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorDainese, Nicola
dc.contributor.advisorNikitin, Alexander
dc.contributor.authorZverkov, Boris
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.schoolSchool of Scienceen
dc.contributor.supervisorMarttinen, Pekka
dc.date.accessioned2025-08-19T17:14:38Z
dc.date.available2025-08-19T17:14:38Z
dc.date.issued2025-07-30
dc.description.abstractThis thesis presents the development of a multilingual small language model (SLM) designed to perform a wide range of language tasks across diverse languages, including low-resource ones. The work explores recent trends in SLM and multilingual large language model (MLLM) design, and addresses key challenges such as architecture optimization, data efficiency, and multilingual alignment. The model was trained in two stages: pre-training and supervised fine-tuning (SFT), using carefully curated datasets spanning 13 languages and multiple domains. A comprehensive preprocessing pipeline was implemented to ensure data quality, including deduplication, filtering, and privacy safeguards. Custom Byte-Pair Encoding (BPE) tokenizer was trained to ensure an optimal multilingual performance.Hyperparameter optimization and scaling law experiments guided model configuration, The 1B-parameter MiniLingua model demonstrates competitive performance across various benchmarks.en
dc.format.extent66
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/138139
dc.identifier.urnURN:NBN:fi:aalto-202508196368
dc.language.isoenen
dc.programmeMaster's Programme in Computer, Communication and Information Sciencesen
dc.programme.majorMachine Learning, Data Science and Artificial Intelligenceen
dc.subject.keywordlarge language modelen
dc.subject.keywordscaling lawen
dc.subject.keywordmultilingual language modelen
dc.subject.keywordsmall language modelen
dc.subject.keywordbyte-pair encodingen
dc.subject.keywordsupervised fine-tuningen
dc.titleMiniLingua: Training of multilingual small large language modelen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessno

Files