MiniLingua: Training of multilingual small large language model
| dc.contributor | Aalto-yliopisto | fi |
| dc.contributor | Aalto University | en |
| dc.contributor.advisor | Dainese, Nicola | |
| dc.contributor.advisor | Nikitin, Alexander | |
| dc.contributor.author | Zverkov, Boris | |
| dc.contributor.school | Perustieteiden korkeakoulu | fi |
| dc.contributor.school | School of Science | en |
| dc.contributor.supervisor | Marttinen, Pekka | |
| dc.date.accessioned | 2025-08-19T17:14:38Z | |
| dc.date.available | 2025-08-19T17:14:38Z | |
| dc.date.issued | 2025-07-30 | |
| dc.description.abstract | This thesis presents the development of a multilingual small language model (SLM) designed to perform a wide range of language tasks across diverse languages, including low-resource ones. The work explores recent trends in SLM and multilingual large language model (MLLM) design, and addresses key challenges such as architecture optimization, data efficiency, and multilingual alignment. The model was trained in two stages: pre-training and supervised fine-tuning (SFT), using carefully curated datasets spanning 13 languages and multiple domains. A comprehensive preprocessing pipeline was implemented to ensure data quality, including deduplication, filtering, and privacy safeguards. Custom Byte-Pair Encoding (BPE) tokenizer was trained to ensure an optimal multilingual performance.Hyperparameter optimization and scaling law experiments guided model configuration, The 1B-parameter MiniLingua model demonstrates competitive performance across various benchmarks. | en |
| dc.format.extent | 66 | |
| dc.format.mimetype | application/pdf | en |
| dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/138139 | |
| dc.identifier.urn | URN:NBN:fi:aalto-202508196368 | |
| dc.language.iso | en | en |
| dc.programme | Master's Programme in Computer, Communication and Information Sciences | en |
| dc.programme.major | Machine Learning, Data Science and Artificial Intelligence | en |
| dc.subject.keyword | large language model | en |
| dc.subject.keyword | scaling law | en |
| dc.subject.keyword | multilingual language model | en |
| dc.subject.keyword | small language model | en |
| dc.subject.keyword | byte-pair encoding | en |
| dc.subject.keyword | supervised fine-tuning | en |
| dc.title | MiniLingua: Training of multilingual small large language model | en |
| dc.type | G2 Pro gradu, diplomityö | fi |
| dc.type.ontasot | Master's thesis | en |
| dc.type.ontasot | Diplomityö | fi |
| local.aalto.electroniconly | yes | |
| local.aalto.openaccess | no |