How do you chunk your data in RAG — All about chunking strategies
Chunking is nothing but splitting documents into smaller units called chunks which is individually indexed, embedded and retrieved. The LLM output with limited context windows is highly dependent on the retrieval from vector databases. The goal of chunking is to improve retrieval accuracy and reducing computational overhead.
Before getting little deeper into chunking, let us recall the RAG pipeline
- Indexing — convert documents into vector embeddings and store them in a vector database.
- Retrieval — Query the database for most relevant chunks
- Augmentation — Inject the retrieved chunks into the LLM prompt
- Generation — Prompt the LLM to produce a final, context, informed response.
Why do we need strong chunking strategy?
- As Embedding model and LLM has strict context size limits, the context window is highly dependent on well sized chunks
- Precise and smaller chunks often mean faster lookups and better recall.
- Appropriate chunk size also reduces processing and optimize computation capacity
- Semantic integrity ensures more accurate matches and better output.
Chunking is a preprocessing step that directly affects the retrieval phase. There are multiple chunking strategies and the right strategy is chosen based on the use case. Let us see each one in detail

Fixed-Size Chunking
This is very simple and straight forward chunking where the input text is segmented into equal sized pieces based on number of characters, token or word counts. Often, overlap parameter is used to reduce the data loss. In LangChain there are libraries that support this chunking
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.documents import Document
def perform_fixed_size_chunking(document, chunk_size=1000, chunk_overlap=200):
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len
)
# Split the text into chunks
chunks = text_splitter.split_text(document)
# Convert to Document objects with metadata
documents = []
for i, chunk in enumerate(chunks):
doc = Document(
page_content=chunk,
metadata={
"chunk_id": i,
"total_chunks": len(chunks),
"chunk_size": len(chunk),
"chunk_type": "fixed-size"
}
)
documents.append(doc)
return documents
This strategy is straightforward and easy to implement and as it is uniform the batch operations are simple. If the application does not rely heavily on semantic context, this is suitable. But this method cuts off sentences or paragraphs abruptly and ignores sematic breaks. Sometimes the relevant information is scattered across chunks.
Recursive Character Text Splitting
With CharacterTextSplitter we simply split by a fix number of characters but with recursive it is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. This keeps all paragraphs together as long as possible, as those would generically seem to be the strongest semantically related pieces of text
from langchain_text_splitters import RecursiveCharacterTextSplitter
rec_text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=10
)
rec_text_splits = text_splitter.split_documents(documents)
Semantic Chunking
This divides the text into meaningful and complete chunks on the semantic similarity calculated by the embedding model. This improves the quality of retrieval in most use cases, rather than blind, syntactic chunking
from langchain_experimental.text_splitter import SemanticChunker
semantic_text_splitter = SemanticChunker(embeddings_model)
semantic_text_splits = semantic_text_splitter.split_documents(documents)
It determines when to break the sentence by looking for differences in embeddings between any 2 sentences and if it is past some threshold then they are split. The threshold is determined by
- Percentile — If difference is greater than the X percentile then it is split
- Standard Deviation — If difference is greater than X standard deviations then it is split
- Interquartile — If interquartile distance is used to split the chunks
- Gradient — The gradient of distance is used to split chunks along with percentile method. This is useful when chunks are highly correlated with each other or specific to a domain. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.
semantic_text_splitter = SemanticChunker(
embeddings_model, breakpoint_threshold_type="percentile / standard_deviation / interquartile /gradient"
)
Recursive Chunking
It relies on hierarchy of separators and the algorithm attempts to split on high-level separators first, then moves to increasingly finer separators if chunks remain too large. The method recursively splits text until the chunks meet specified size preserving the logical structure
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
text_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ". ", " ", ""],
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len
)
semantic_chunks = text_splitter.split_text(document)
This kind of splitting creates more context-aware splits than simple fixed size approach and powerful for structured text or code where block based splitting is crucial. The drawbacks are it is more complicated to configure and requires domain specific separators for best results
Hierarchical (Parent-Child) Chunking
It is a two level approach where large parent chunk are created and indexed and smaller child chunks are stored and optionally retrieved or bundled under each parent. This supports retrieval of context plus details.
parents = split_into_sections(document_text, max_tokens=2000)
for parent in parents:
children = split_into_paragraphs(parent, max_tokens=500)
index_parents(parent)
link_children(parent_id, children)
This can be used for structured documents like academic papers or legal texts where maintaining hierarchy is essential. This preserves document structure and maintains context at multiple levels of granularity. It is more complex to implement and also may lead to uneven chunks.
Context- Aware Chunking
This method attach additional metadata or summaries to each chunk and while retrieving models have more background for each chunk leading to improved understanding during generation
# Create text splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ".", " ", ""]
)
# Split the document into base chunks
base_chunks = splitter.split_text(document)
print(f"Document split into {len(base_chunks)} base chunks")
# Create a mock summarization function
def mock_summarize(text:disappointed_face:
first_sentence = text.split('.')[0]
return f"Summary: {first_sentence[:100]}..."
# Process chunks with contextual windows
enriched_documents = []
for i, chunk in enumerate(base_chunks):
# Define window around current chunk
window_start = max(0, i - window_size)
window_end = min(len(base_chunks), i + window_size + 1)
window = base_chunks[window_start:window_end]
# Extract context (excluding the current chunk)
context_chunks = [c for j, c in enumerate(window) if j != i - window_start]
context_text = " ".join(context_chunks)
# Generate mock summary for context
if context_chunks:
context_summary = mock_summarize(context_text)
metadata = {
"chunk_id": i,
"total_chunks": len(base_chunks),
"context": context_summary,
"context_type": "summary"
}
enriched_text = f"Context: {context_summary}\n\nContent: {chunk}"
else:
metadata = {
"chunk_id": i,
"total_chunks": len(base_chunks),
"context": "",
"context_type": "none"
}
enriched_text = chunk
# Create Document object
doc = Document(
page_content=enriched_text,
metadata=metadata
)
enriched_documents.append(doc)
return enriched_documents
This helps in maintaining coherence across different parts of the document and can boost retrieval performance in queries that span multiple segments. This has the drawbacks to increase both storage and memory requirements and has additional preprocessing layer adds complexity and can introduce repetitive information if not managed.
Agentic/LLM based chunking
Utilizes an LLM or multiple agents to determine how to chunk text either by prompting for logical chunk boundaries or by dynamic chunk creation based on query context.
prompt = "Divide the following document into semantically isolated chunks each containing a complete thought:"
chunks = call_llm(model, prompt + document_text)
The LLM analyses the document and creates chunks tailored to meaning, often producing more natural chunking than rules. If we have highly variable documents or when manual chunking logic is brittle we can use this method. It highly aligns with semantic units and potential best retrieval performance but it is costly as LLM is used during ingestion and hard to reproduce and harder to batch at scale. Chunking logic may vary across runs
Document Structure Based Chunking
This chunking uses the inherent structure of the document like headings, sections, page breaks to define chunk boundaries. For instance, each section or sub-section becomes a chunk, optionally further split if too large
# Split based on considering headings as '##'
def split_text(self, text: str):
parts = text.split("## ")
chunks = []
for part in parts:
# further split if part still > max_tokens by token count
while len(part.split()) > self.max_tokens:
# simple: take first max_tokens words
chunk = " ".join(part.split()[:self.max_tokens])
chunks.append(chunk)
part = " ".join(part.split()[self.max_tokens - self.overlap:])
if part:
chunks.append(part)
return chunks
We have split here using ## but if there is other logic it should be possible using custom logic. this is good when we have the documents with explicit headings and respects structure with more semantically coherent chunks. The drawback is that we need the document to be structured otherwise the logic may fail or produce uneven chunks
Sliding Window (Overlap) Chunking
Chunks are created with fixed size but it includes overlap with preceding chunks to preserve the context across boundaries. Often implemented with fixed or recursive splitting plus windowed overlap.
chunk_size = 500
chunk_overlap = 100
chunks = []
words = document_text.split()
for i in range(0, len(words), chunk_size - chunk_overlap):
chunk = " ".join(words[i: i + chunk_size])
chunks.append(chunk)
It ensures every piece of information near a boundary is included in at least one chunk. This is useful for the documents with context across boundaries like conversational logs, meeting transcripts. This improves recall by reducing boundary loss and better for queries that span boundaries but the drawback is duplicate content across chunks which requires more storage and computation. Retrieval may return overlapping chunks increasing redundancy
Chunking is foundational to a performant RAG system. IT directly influences what your retrieval engine sees, and this what information enters in your LLM prompt. The right chunking strategy enables effecient indexing, accurate retrieval and high-quality generation. Whether you use fixed size, recursive, semantic, hierarchical, sliding window or agentic chunking or a hybrid of them- what matters is aligning the strategy with your document types, query demands and system constraints.
LongRAG Implementation
I have used the hybrid chunking strategy as hierarchial chunking and semantic chunking and created the RAG Framework which supports long documents. Here is the repo link for the code
References
- https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies-for-rag-applications/ba-p/113089
- https://aws-samples.github.io/amazon-bedrock-samples/rag/open-source/chunking/rag_chunking_strategies_langchain_bedrock
- https://www.pinecone.io/learn/chunking-strategies
- https://milvus.io/ai-quick-reference/what-chunking-strategies-work-best-for-document-indexing
- https://www.f22labs.com/blogs/7-chunking-strategies-in-rag-you-need-to-know
