Chunking strategies in retrieval-augmented generation

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Master's thesis

Department

Major/Subject

Mcode

Language

en

Pages

56

Series

Abstract

Retrieval Augmented Generation (RAG) effectively mitigates the hallucination and knowledge cut-off issues by Large Language Models with retrieving external knowledge. However, existing research overlooks the evaluation of text chunking, which is a critical connection between retrieval and generation phrase of RAG system. Focus this gap, this study constructs an evaluation framework based on a full factorial design to investigate the impact of three strategies—Character-level, Sentence-aware, and Structure-aware chunking, on the retrieval efficiency, generation quality, and computational cost of RAG systems across varying retrieval depths ($k=3, 5, 8$). The experiments are conducted on the HotpotQA multi-hop reasoning dataset using the DeepSeek-R1:1.5b model for end-to-end inference. The results reveal significant performance inversion phenomenon. While Character-level chunking achieves the highest Recall and Evidence Hit Rate during the retrieval phase, it leads to the worst F1 score and Exact Match Rate during the generation phase. In contrast, Structure-aware chunking, despite a disadvantage in retrieval ranking, achieves the highest F1 Score and Exact Match Rate in the generation phase by preserving complete paragraph logic. Furthermore, qualitative analysis indicates that both Character-level and Sentence-aware chunking, lacking macro-context, tends to induce hallucinations, while Structure-aware chunking effectively supports correct rejection. This study confirms that Semantic Integrity of the context is more critical than mere Physical Coverage in the RAG systems. By providing logical units with high signal-to-noise ratios, Structure-aware chunking enhances the reliability of complex reasoning. These findings provide theoretical evidence and engineering guidance for optimizing RAG systems.

Description

Supervisor

Zhou, Quan

Thesis advisor

Ma, Teng

Other note

Citation