Optimizing RAG Pipelines

Published: 17, February 2026 by Santosh Kumar Singampalli

When I first started building RAG (Retrieval-Augmented Generation) systems, I thought the biggest problem would be the language model. But I was wrong. The real bottleneck is retrieval. If your pipeline brings back the wrong chunks, even GPT-5 can’t save you.

Over time, I’ve broken down the optimization process into a few clear steps. In this blog, I’ll share what worked for me, the mistakes I made, and the practical tricks that helped me get my RAG pipelines running smoothly.

Step 1: Fix the Basics (The Hard Way I Learned It)

In the beginning, I just dumped raw Word Docs into my system and split them randomly. The results were messy hallucinations, broken answers, and a lot of frustration.

Later I realized: data quality is everything.

I started cleaning documents and adding metadata like date, section, and author.
I updated data regularly instead of letting old stuff sit forever.
I let subject matter experts validate tricky docs before indexing them.

Lesson learned: A bad foundation will always break your pipeline.

Step 2: Smarter Chunking

One of my early mistakes in RAG was using fixed-size chunks--for example, 500 tokens per chunk

-The Problem with Fixed-Size Chunking was

Fixed-size chunking doesn’t care about semantics. It only cares about token count.

That leads to problems like:

A definition split in half

A table broken across chunks
A sentence cut mid-way, losing its conclusion

When this happens, the retriever may fetch:

Half a definition
A table without headers
Context without the actual answer

The LLM then tries to “fill the gaps” and ends up hallucinating or giving partial answers based on the temperature we set.

Result: Broken context → Confused LLM → Poor answers

What Worked Better: Semantic-Aware Chunking

Instead of slicing text mechanically, I switched to meaning-preserving chunking.

Each chunk should be understandable on its own.

That means:

Don’t cut in the middle of sentences
Prefer paragraph or section boundaries
Keep related ideas together

Here’s the approach that consistently improved retrieval quality:

1.Sentence-level / Paragraph-aware splitting

Split text using natural separators:

Paragraphs (\n\n)
New lines (\n)
Sentences (.)

This keeps definitions, explanations, and tables intact.

2. Dynamic chunk sizes based on use case

Not all questions need the same context size.

Small chunks (128–256 tokens)
Best for:
- Very specific questions
- Fact lookups
- Definitions
Medium chunks (512 tokens)
Best for:
- Most documentation
- API explanations
- How-to guides
Large chunks (768–1024 tokens)
Best for:
- Concept-heavy explanations
- Reasoning across multiple paragraphs
- Design discussions

The key insight:
Chunk size is not a constant. It’s a trade-off between precision and context.

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_text(document)
print(len(chunks), chunks[0])

In the above example

Chunks are created hierarchically based on separators, while respecting the chunk_size limit.

Here’s what actually happens:

The splitter tries to fit as much text as possible within chunk_size (500 tokens).
While doing that, it first attempts to split using the highest-level separator:
1. Paragraphs (\n\n)
If a paragraph is too large to fit, it falls back to:
1. New lines (\n)
If that still doesn’t work, it tries:
1. Sentence boundaries (.)
As a last resort, it splits on:
1. Spaces ( )

This recursive fallback ensures that:

Meaningful boundaries are preferred
Chunks stay semantically coherent

What chunk_overlap Does

chunk_overlap=50 means:

The last 50 tokens of one chunk are repeated at the beginning of the next chunk
This prevents loss of important context at chunk boundaries
Especially useful when:

A key idea spans two chunks

A sentence or explanation flows across boundaries

This one change improved retrieval more than any model tuning I tried before.

Step 3: Metadata Saved My Pipeline

At first, my RAG would return irrelevant sections because everything looked “semantically similar.”

Then I started filtering with metadata — like category, year, or document type. For example, if the user asks about “latest agreements,” I don’t want a 2018 doc. Metadata filtering solved that.

Example:

results = index.query( 

    vector=embedding, 

    filter={"category": "agreements", "year": 2025}, 

    top_k=5 

)

That small filter improved precision massively.

Step 4: Advanced Retrieval (When Basics Aren’t Enough)

Even with clean data and chunking, some queries were still failing. That’s when I explored more advanced tricks.

Hybrid Search

At first, I only used vector search. But it missed out on exact keywords like names, dates, and IDs.
Now, I combine vector search + keyword search.

Query Rewriting

Users often ask vague questions. I use the LLM itself to rewrite queries into multiple variations. Sometimes I even generate a hypothetical answer and embed it for better retrieval.

Example:

query = "Impact of rate increase on Bill" 

multi_queries = llm.generate_rewrites(query, n=3)

Re-ranking

This was another game-changer. Instead of blindly trusting vector DB results, I pass them through a cross-encoder model that re-scores based on the query.

Example:

from sentence_transformers import CrossEncoder 

 

model = CrossEncoder("BAAI/bge-reranker-v2-m3") 

scores = model.predict([(query, doc) for doc in retrieved_docs]) 

ranked = [doc for _, doc in sorted(zip(scores, retrieved_docs), reverse=True)]

I also used the Amazon Rerank 1.0 model via the AWS SDK to rerank retrieved results instead of hosting the open-source reranker models, making the RAG pipeline more accurate and production-ready.

Suddenly, the top 3 results actually made sense.

Step 5: Making It Fast

Once I got accuracy right, latency is not an issue in my case. But in your case it might be, Nobody likes waiting !.

What can be work for you:

Using Redis caching for repeated queries.
Batch embedding instead of one by one.
Running FAISS with optimized indexes.

Caching trick I used in other project:

import redis, hashlib, json 

 

r = redis.Redis() 

 

def cache_query(query, func): 

    key = hashlib.md5(query.encode()).hexdigest() 

    if r.exists(key): 

        return json.loads(r.get(key)) 

    result = func(query) 

    r.set(key, json.dumps(result)) 

    return result

This alone reduced response time by ~30% in one of my projects.

Step 6: Don’t Forget Monitoring

At first, I thought once the system works, I’m done. But in production, things change fast.

Sometimes the model starts hallucinating.
Sometimes new data breaks chunking.
Sometimes latency spikes.

So, I started tracking:

Retrieval precision, recall, hit rate
Answer relevancy and faithfulness
Latency per stage

Lesson learned: Monitoring is not optional in RAG.

Final Thoughts

If I had to summarize my experience:

Start with data, not the LLM. Clean it, chunk it right, and add metadata.
Mix techniques. Hybrid search + re-ranking beats any single method.
Speed matters. Use caching and batching.
Measure everything. Otherwise, you won’t know what’s broken.

When I first built RAG, I wasted weeks tweaking prompts and models. Now I know: the real secret to a good RAG pipeline is retrieval optimization step by step.

And honestly? Once retrieval was fixed, my LLM felt 10x smarter — without me touching its weights.

Tags:

Optimizing RAG Pipelines

Published: 17, February 2026 by Santosh Kumar Singampalli

Step 1: Fix the Basics (The Hard Way I Learned It)

Step 2: Smarter Chunking

-The Problem with Fixed-Size Chunking was

What Worked Better: Semantic-Aware Chunking

Each chunk should be understandable on its own.

1.Sentence-level / Paragraph-aware splitting

2. Dynamic chunk sizes based on use case

What chunk_overlap Does

Step 3: Metadata Saved My Pipeline

Step 4: Advanced Retrieval (When Basics Aren’t Enough)

Query Rewriting

Re-ranking

Step 5: Making It Fast

Step 6: Don’t Forget Monitoring

Final Thoughts

Subscribe to our mailing list