MiniLingua: Training of multilingual small large language model

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Department

Mcode

Language

en

Pages

66

Series

Abstract

This thesis presents the development of a multilingual small language model (SLM) designed to perform a wide range of language tasks across diverse languages, including low-resource ones. The work explores recent trends in SLM and multilingual large language model (MLLM) design, and addresses key challenges such as architecture optimization, data efficiency, and multilingual alignment. The model was trained in two stages: pre-training and supervised fine-tuning (SFT), using carefully curated datasets spanning 13 languages and multiple domains. A comprehensive preprocessing pipeline was implemented to ensure data quality, including deduplication, filtering, and privacy safeguards. Custom Byte-Pair Encoding (BPE) tokenizer was trained to ensure an optimal multilingual performance.Hyperparameter optimization and scaling law experiments guided model configuration, The 1B-parameter MiniLingua model demonstrates competitive performance across various benchmarks.

Description

Supervisor

Marttinen, Pekka

Thesis advisor

Dainese, Nicola
Nikitin, Alexander

Other note

Citation