Shilpa Blog

Loading on-device model…

How do you chunk your data in RAG — All about chunking strategies

Chunking is nothing but splitting documents into smaller units called chunks which is individually indexed, embedded and retrieved. The LLM output with limited context windows is highly dependent on the retrieval from vector databases. The goal of chunking is to improve retrieval accuracy and reducing computational overhead.

Before getting little deeper into chunking, let us recall the RAG pipeline

Indexing — convert documents into vector embeddings and store them in a vector database.
Retrieval — Query the database for most relevant chunks
Augmentation — Inject the retrieved chunks into the LLM prompt
Generation — Prompt the LLM to produce a final, context, informed response.

Why do we need strong chunking strategy?

As Embedding model and LLM has strict context size limits, the context window is highly dependent on well sized chunks
Precise and smaller chunks often mean faster lookups and better recall.
Appropriate chunk size also reduces processing and optimize computation capacity
Semantic integrity ensures more accurate matches and better output.

Chunking is a preprocessing step that directly affects the retrieval phase. There are multiple chunking strategies and the right strategy is chosen based on the use case. Let us see each one in detail

Fixed-Size Chunking

This is very simple and straight forward chunking where the input text is segmented into equal sized pieces based on number of characters, token or word counts. Often, overlap parameter is used to reduce the data loss. In LangChain there are libraries that support this chunking

from langchain_text_splitters import CharacterTextSplitter
from langchain_core.documents import Document

def perform_fixed_size_chunking(document, chunk_size=1000, chunk_overlap=200):
     text_splitter = CharacterTextSplitter(
        separator="\n\n",
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    
    # Split the text into chunks
    chunks = text_splitter.split_text(document)
    # Convert to Document objects with metadata
    documents = []
    for i, chunk in enumerate(chunks):
        doc = Document(
            page_content=chunk,
            metadata={
                "chunk_id": i,
                "total_chunks": len(chunks),
                "chunk_size": len(chunk),
                "chunk_type": "fixed-size"
            }
        )
        documents.append(doc)
    
    return documents

This strategy is straightforward and easy to implement and as it is uniform the batch operations are simple. If the application does not rely heavily on semantic context, this is suitable. But this method cuts off sentences or paragraphs abruptly and ignores sematic breaks. Sometimes the relevant information is scattered across chunks.

Recursive Character Text Splitting

With CharacterTextSplitter we simply split by a fix number of characters but with recursive it is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. This keeps all paragraphs together as long as possible, as those would generically seem to be the strongest semantically related pieces of text

from langchain_text_splitters import RecursiveCharacterTextSplitter

rec_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=10
)
rec_text_splits = text_splitter.split_documents(documents)

Semantic Chunking

This divides the text into meaningful and complete chunks on the semantic similarity calculated by the embedding model. This improves the quality of retrieval in most use cases, rather than blind, syntactic chunking

from langchain_experimental.text_splitter import SemanticChunker

semantic_text_splitter = SemanticChunker(embeddings_model)

semantic_text_splits = semantic_text_splitter.split_documents(documents)

It determines when to break the sentence by looking for differences in embeddings between any 2 sentences and if it is past some threshold then they are split. The threshold is determined by

Percentile — If difference is greater than the X percentile then it is split
Standard Deviation — If difference is greater than X standard deviations then it is split
Interquartile — If interquartile distance is used to split the chunks
Gradient — The gradient of distance is used to split chunks along with percentile method. This is useful when chunks are highly correlated with each other or specific to a domain. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.

semantic_text_splitter = SemanticChunker(
    embeddings_model, breakpoint_threshold_type="percentile / standard_deviation / interquartile /gradient"
)

Recursive Chunking

It relies on hierarchy of separators and the algorithm attempts to split on high-level separators first, then moves to increasingly finer separators if chunks remain too large. The method recursively splits text until the chunks meet specified size preserving the logical structure

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

text_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n", ". ", " ", ""],
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
 semantic_chunks = text_splitter.split_text(document)

This kind of splitting creates more context-aware splits than simple fixed size approach and powerful for structured text or code where block based splitting is crucial. The drawbacks are it is more complicated to configure and requires domain specific separators for best results

Hierarchical (Parent-Child) Chunking

It is a two level approach where large parent chunk are created and indexed and smaller child chunks are stored and optionally retrieved or bundled under each parent. This supports retrieval of context plus details.

parents = split_into_sections(document_text, max_tokens=2000)
for parent in parents:
    children = split_into_paragraphs(parent, max_tokens=500)
    index_parents(parent)
    link_children(parent_id, children)

This can be used for structured documents like academic papers or legal texts where maintaining hierarchy is essential. This preserves document structure and maintains context at multiple levels of granularity. It is more complex to implement and also may lead to uneven chunks.

Context- Aware Chunking

This method attach additional metadata or summaries to each chunk and while retrieving models have more background for each chunk leading to improved understanding during generation

    # Create text splitter
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", " ", ""]
    )

    # Split the document into base chunks
    base_chunks = splitter.split_text(document)
    print(f"Document split into {len(base_chunks)} base chunks")
    
    # Create a mock summarization function
    def mock_summarize(text:disappointed_face:
        first_sentence = text.split('.')[0]
        return f"Summary: {first_sentence[:100]}..."
    
    # Process chunks with contextual windows
    enriched_documents = []
    for i, chunk in enumerate(base_chunks):
        # Define window around current chunk
        window_start = max(0, i - window_size)
        window_end = min(len(base_chunks), i + window_size + 1)
        window = base_chunks[window_start:window_end]
        
        # Extract context (excluding the current chunk)
        context_chunks = [c for j, c in enumerate(window) if j != i - window_start]
        context_text = " ".join(context_chunks)
        
        # Generate mock summary for context
        if context_chunks:
            context_summary = mock_summarize(context_text)
            metadata = {
                "chunk_id": i,
                "total_chunks": len(base_chunks),
                "context": context_summary,
                "context_type": "summary"
            }
            enriched_text = f"Context: {context_summary}\n\nContent: {chunk}"
        else:
            metadata = {
                "chunk_id": i,
                "total_chunks": len(base_chunks),
                "context": "",
                "context_type": "none"
            }
            enriched_text = chunk
        
        # Create Document object
        doc = Document(
            page_content=enriched_text,
            metadata=metadata
        )
        
        enriched_documents.append(doc)
    
    return enriched_documents

This helps in maintaining coherence across different parts of the document and can boost retrieval performance in queries that span multiple segments. This has the drawbacks to increase both storage and memory requirements and has additional preprocessing layer adds complexity and can introduce repetitive information if not managed.

Agentic/LLM based chunking

Utilizes an LLM or multiple agents to determine how to chunk text either by prompting for logical chunk boundaries or by dynamic chunk creation based on query context.

prompt = "Divide the following document into semantically isolated chunks each containing a complete thought:"
chunks = call_llm(model, prompt + document_text)

The LLM analyses the document and creates chunks tailored to meaning, often producing more natural chunking than rules. If we have highly variable documents or when manual chunking logic is brittle we can use this method. It highly aligns with semantic units and potential best retrieval performance but it is costly as LLM is used during ingestion and hard to reproduce and harder to batch at scale. Chunking logic may vary across runs

Document Structure Based Chunking

This chunking uses the inherent structure of the document like headings, sections, page breaks to define chunk boundaries. For instance, each section or sub-section becomes a chunk, optionally further split if too large

# Split based on considering headings as '##'
def split_text(self, text: str):
        parts = text.split("## ")
        chunks = []
        for part in parts:
            # further split if part still > max_tokens by token count
            while len(part.split()) > self.max_tokens:
                # simple: take first max_tokens words
                chunk = " ".join(part.split()[:self.max_tokens])
                chunks.append(chunk)
                part = " ".join(part.split()[self.max_tokens - self.overlap:])
            if part:
                chunks.append(part)
        return chunks

We have split here using ## but if there is other logic it should be possible using custom logic. this is good when we have the documents with explicit headings and respects structure with more semantically coherent chunks. The drawback is that we need the document to be structured otherwise the logic may fail or produce uneven chunks

Sliding Window (Overlap) Chunking

Chunks are created with fixed size but it includes overlap with preceding chunks to preserve the context across boundaries. Often implemented with fixed or recursive splitting plus windowed overlap.

chunk_size = 500
chunk_overlap = 100
chunks = []
words = document_text.split()
for i in range(0, len(words), chunk_size - chunk_overlap):
    chunk = " ".join(words[i: i + chunk_size])
    chunks.append(chunk)

It ensures every piece of information near a boundary is included in at least one chunk. This is useful for the documents with context across boundaries like conversational logs, meeting transcripts. This improves recall by reducing boundary loss and better for queries that span boundaries but the drawback is duplicate content across chunks which requires more storage and computation. Retrieval may return overlapping chunks increasing redundancy

Chunking is foundational to a performant RAG system. IT directly influences what your retrieval engine sees, and this what information enters in your LLM prompt. The right chunking strategy enables effecient indexing, accurate retrieval and high-quality generation. Whether you use fixed size, recursive, semantic, hierarchical, sliding window or agentic chunking or a hybrid of them- what matters is aligning the strategy with your document types, query demands and system constraints.

LongRAG Implementation

I have used the hybrid chunking strategy as hierarchial chunking and semantic chunking and created the RAG Framework which supports long documents. Here is the repo link for the code

GitHub - shilpathota/LongRAG: LongRAG is the architectural pattern that uses hierarchial chunking strategy to process long documents unlike the normal RAG

References