Optimizing RAG Pipelines

Santosh - RAG (1)

When I first started building RAG (Retrieval-Augmented Generation) systems, I thought the biggest problem would be the language model. But I was wrong. The real bottleneck is retrieval. If your pipeline brings back the wrong chunks, even GPT-5 can’t save you.

Over time, I’ve broken down the optimization process into a few clear steps. In this blog, I’ll share what worked for me, the mistakes I made, and the practical tricks that helped me get my RAG pipelines running smoothly.

Step 1: Fix the Basics (The Hard Way I Learned It)

image

In the beginning, I just dumped raw Word Docs into my system and split them randomly. The results were messy hallucinations, broken answers, and a lot of frustration.

Later I realized: data quality is everything.

  • I started cleaning documents and adding metadata like date, section, and author.
  • I updated data regularly instead of letting old stuff sit forever.
  • I let subject matter experts validate tricky docs before indexing them.

Lesson learned: A bad foundation will always break your pipeline.

Step 2: Smarter Chunking

One of my early mistakes in RAG was using fixed-size chunks--for example, 500 tokens per chunk

-The Problem with Fixed-Size Chunking was

Fixed-size chunking doesn’t care about semantics. It only cares about token count.

That leads to problems like:

A definition split in half

  • A table broken across chunks
  • A sentence cut mid-way, losing its conclusion

When this happens, the retriever may fetch:

  • Half a definition
  • A table without headers
  • Context without the actual answer

The LLM then tries to “fill the gaps” and ends up hallucinating or giving partial answers based on the temperature we set.

Result: Broken context → Confused LLM → Poor answers

What Worked Better: Semantic-Aware Chunking

Instead of slicing text mechanically, I switched to meaning-preserving chunking.

Each chunk should be understandable on its own.

That means:

  • Don’t cut in the middle of sentences
  • Prefer paragraph or section boundaries
  • Keep related ideas together

Here’s the approach that consistently improved retrieval quality:

1.Sentence-level / Paragraph-aware splitting

Split text using natural separators:

  • Paragraphs (\n\n)
  • New lines (\n)
  • Sentences (.)

This keeps definitions, explanations, and tables intact.

2. Dynamic chunk sizes based on use case

Not all questions need the same context size.

  • Small chunks (128–256 tokens)
    Best for:
    • Very specific questions
    • Fact lookups
    • Definitions
  • Medium chunks (512 tokens)
    Best for:
    • Most documentation
    • API explanations
    • How-to guides
  • Large chunks (768–1024 tokens)
    Best for:
    • Concept-heavy explanations
    • Reasoning across multiple paragraphs
    • Design discussions

The key insight:
Chunk size is not a constant. It’s a trade-off between precision and context.

image (1)

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_text(document)
print(len(chunks), chunks[0])

In the above example

Chunks are created hierarchically based on separators, while respecting the chunk_size limit.

Here’s what actually happens:

  1. The splitter tries to fit as much text as possible within chunk_size (500 tokens).
  2. While doing that, it first attempts to split using the highest-level separator:
    1. Paragraphs (\n\n)
  3. If a paragraph is too large to fit, it falls back to:
    1. New lines (\n)
  4. If that still doesn’t work, it tries:
    1. Sentence boundaries (.)
  5. As a last resort, it splits on:
    1. Spaces ( )

This recursive fallback ensures that:

  • Meaningful boundaries are preferred
  • Chunks stay semantically coherent

What chunk_overlap Does

chunk_overlap=50 means:

  • The last 50 tokens of one chunk are repeated at the beginning of the next chunk
  • This prevents loss of important context at chunk boundaries
  • Especially useful when:

A key idea spans two chunks

A sentence or explanation flows across boundaries

This one change improved retrieval more than any model tuning I tried before.

Step 3: Metadata Saved My Pipeline

At first, my RAG would return irrelevant sections because everything looked “semantically similar.”

Then I started filtering with metadata — like category, year, or document type. For example, if the user asks about “latest agreements,” I don’t want a 2018 doc. Metadata filtering solved that.

image

Example:

results = index.query( 

    vector=embedding, 

    filter={"category": "agreements", "year": 2025}, 

    top_k=5 

)

That small filter improved precision massively.

Step 4: Advanced Retrieval (When Basics Aren’t Enough)

Even with clean data and chunking, some queries were still failing. That’s when I explored more advanced tricks.

Hybrid Search

At first, I only used vector search. But it missed out on exact keywords like names, dates, and IDs.
Now, I combine vector search + keyword search.

image (1)

Query Rewriting

Users often ask vague questions. I use the LLM itself to rewrite queries into multiple variations. Sometimes I even generate a hypothetical answer and embed it for better retrieval.

Example:

query = "Impact of rate increase on Bill" 

multi_queries = llm.generate_rewrites(query, n=3)

Re-ranking

This was another game-changer. Instead of blindly trusting vector DB results, I pass them through a cross-encoder model that re-scores based on the query.

Example:

from sentence_transformers import CrossEncoder 

 

model = CrossEncoder("BAAI/bge-reranker-v2-m3") 

scores = model.predict([(query, doc) for doc in retrieved_docs]) 

ranked = [doc for _, doc in sorted(zip(scores, retrieved_docs), reverse=True)]

I also used the Amazon Rerank 1.0 model via the AWS SDK to rerank retrieved results instead of hosting the open-source reranker models, making the RAG pipeline more accurate and production-ready.

Suddenly, the top 3 results actually made sense.

Step 5: Making It Fast

Once I got accuracy right, latency is not an issue in my case. But in your case it might be, Nobody likes waiting !.

What can be work for you:

  • Using Redis caching for repeated queries.
  • Batch embedding instead of one by one.
  • Running FAISS with optimized indexes.

Caching trick I used in other project:

import redis, hashlib, json 

 

r = redis.Redis() 

 

def cache_query(query, func): 

    key = hashlib.md5(query.encode()).hexdigest() 

    if r.exists(key): 

        return json.loads(r.get(key)) 

    result = func(query) 

    r.set(key, json.dumps(result)) 

    return result

This alone reduced response time by ~30% in one of my projects.

Step 6: Don’t Forget Monitoring

At first, I thought once the system works, I’m done. But in production, things change fast.

  • Sometimes the model starts hallucinating.
  • Sometimes new data breaks chunking.
  • Sometimes latency spikes.

So, I started tracking:

  • Retrieval precision, recall, hit rate
  • Answer relevancy and faithfulness
  • Latency per stage

Lesson learned: Monitoring is not optional in RAG.

Final Thoughts

If I had to summarize my experience:

  1. Start with data, not the LLM. Clean it, chunk it right, and add metadata.
  2. Mix techniques. Hybrid search + re-ranking beats any single method.
  3. Speed matters. Use caching and batching.
  4. Measure everything. Otherwise, you won’t know what’s broken.

When I first built RAG, I wasted weeks tweaking prompts and models. Now I know: the real secret to a good RAG pipeline is retrieval optimization step by step.

And honestly? Once retrieval was fixed, my LLM felt 10x smarter — without me touching its weights.

Tags: