The impact of quantization on the inference performance and quality of large language models: A comparative study on code generation task

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Authors

Department

Mcode

Language

en

Pages

63

Series

Abstract

Large language models (LLMs) have recently achieved strong performance on code generation tasks, but their deployment is often limited by high inference costs in terms of memory, latency, and energy consumption. Quantization---the reduction of numerical precision in weights and activations---offers a promising avenue for making LLMs more efficient, yet its impact on functional correctness and serving performance in realistic code-generation settings remains underexplored. This thesis investigates how post-training quantization affects both the quality and efficiency of LLMs on competitive-programming style code generation. Concretely, the study focuses on two members of the Llama~3 family: Llama~3.1~8B Instruct and Llama~3.2~3B Instruct. Both models are evaluated on the CodeContests dataset, which provides natural-language problem descriptions paired with executable test suites. Several quantization schemes are considered, including 8-bit and 4-bit weight-only quantization via \texttt{bitsandbytes}, GPTQ-based 4-bit quantization, and Activation-aware Weight Quantization (AWQ). The models are assessed in a few-shot setting using pass@1 and pass@5, and inference performance is measured in terms of memory footprint, latency, and throughput on a single NVIDIA V100 32\,GB GPU. Overall, the thesis contributes a systematic comparison of post-training quantization methods for code-focused LLMs, quantifies their impact on both accuracy and serving efficiency, and offers practical guidance for deploying Llama~3.x models under realistic hardware constraints.

Description

Supervisor

Hellas, Arto

Thesis advisor

Leinonen, Juho

Other note

Citation