MiniLingua: Training of multilingual small large language model
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Science |
Master's thesis
Authors
Date
Department
Mcode
Language
en
Pages
66
Series
Abstract
This thesis presents the development of a multilingual small language model (SLM) designed to perform a wide range of language tasks across diverse languages, including low-resource ones. The work explores recent trends in SLM and multilingual large language model (MLLM) design, and addresses key challenges such as architecture optimization, data efficiency, and multilingual alignment. The model was trained in two stages: pre-training and supervised fine-tuning (SFT), using carefully curated datasets spanning 13 languages and multiple domains. A comprehensive preprocessing pipeline was implemented to ensure data quality, including deduplication, filtering, and privacy safeguards. Custom Byte-Pair Encoding (BPE) tokenizer was trained to ensure an optimal multilingual performance.Hyperparameter optimization and scaling law experiments guided model configuration, The 1B-parameter MiniLingua model demonstrates competitive performance across various benchmarks.Description
Supervisor
Marttinen, PekkaThesis advisor
Dainese, NicolaNikitin, Alexander