RAG Application: 7 Retrieval Tricks to Boost Answer Accuracy

Upgrade your RAG pipeline with proven chunking, MMR, metadata filters, and BM25/TF‑IDF choices to deliver sharper, trustworthy answers consistently.

Paco Awissi

17 min read • November 9, 2025

Let me tell you something I've learned the hard way. Most RAG pipelines don't fail because the LLM can't generate good answers. They fail because they can't find the right information in the first place. After building dozens of these systems, I've discovered that retrieval is where the real battle happens.

In this guide, I'm going to walk you through implementing seven retrieval techniques that actually work. We're building chunking, semantic search, MMR, metadata filters, BM25, TF-IDF with FAISS, and a hybrid router that ties everything together. The goal here is simple. Get sharper, more trustworthy answers from large document sets.

Now, I should mention upfront that this guide focuses purely on the retrieval layer. You'll build a complete context-fetching system that returns top-k chunks with reduced redundancy, normalized scores, and measured latency. By the time we're done, you'll have a working hybrid retrieval function ready to plug into any LLM generation step. And if you're curious about retrieval pipelines that leverage structured relationships, our guide to building a Knowledge Graph RAG pipeline with Neo4j and embeddings offers a completely different approach that's worth exploring.

Why This Approach Works

Here's the thing about single-method retrieval. It misses too much. I learned this lesson while working on a customer support system that needed to handle both technical documentation and conversational queries. Semantic search alone kept missing exact error codes that customers typed in. BM25 alone couldn't understand when someone described a problem in their own words instead of using our technical terminology.

The solution? Combine multiple strategies, each one tuned for different query types. This way you capture the right evidence regardless of how someone asks their question.

For this build, I'm using LangChain for retriever adapters and splitters, Chroma for fast local vector search, FAISS for scalable similarity, rank-bm25 for exact matches, and sklearn TF-IDF for keyword baselines. Sure, you could use alternatives like Elasticsearch, Qdrant, or Weaviate if you need production-grade scale and persistence. But honestly, this stack prioritizes speed, simplicity, and Colab compatibility. Perfect for rapid prototyping and experimentation.

How It Works (High-Level Overview)

The approach is straightforward once you break it down. You'll split documents into chunks using three different strategies, embed them with SentenceTransformers, then build indexes for semantic retrieval with Chroma, keyword retrieval with BM25, and TF-IDF+FAISS retrieval.

But that's just the foundation. Next, you'll add MMR for diversity, metadata filters for precision, and a hybrid router that's smart enough to select the best retrieval path based on query heuristics.

Each technique I'm showing you addresses a specific failure mode I've encountered. Semantic drift when the model wanders off topic. Keyword blindness when it can't find exact terms. Redundancy when it keeps returning the same information. Or just plain irrelevant context. The final router merges results, deduplicates everything, and returns top-k chunks ready for your LLM to consume.

Project Overview

What we're building: A retrieval layer that accepts a query and returns the top-k most relevant document chunks from a corpus. It uses a hybrid of semantic, keyword, and metadata-driven strategies to get the job done.

The real problem we're solving: Single-method retrieval fails on diverse queries. I've seen this happen countless times. Semantic search misses exact codes, BM25 misses paraphrases, and both return redundant or completely off-topic chunks.

The core challenge: How do you combine multiple retrieval methods without score confusion, redundancy, or misrouting? And how do you keep latency low while making results interpretable? That's what we're tackling here.

Setup and Dependencies

First things first. This cell securely loads API keys from Colab userdata. You'll need OpenAI and Anthropic keys for downstream LLM integration. Actually, let me be honest here. They're not used in this retrieval-only build, but you'll need them for full RAG pipelines later.

import os
from google.colab import userdata
from google.colab.userdata import SecretNotFoundError

keys = ["OPENAI_API_KEY", "ANTHROPIC_API_KEY"]
missing = []
for k in keys:
    value = None
    try:
        value = userdata.get(k)
    except SecretNotFoundError:
        pass

    os.environ[k] = value if value is not None else ""

    if not os.environ[k]:
        missing.append(k)

if missing:
    raise EnvironmentError(f"Missing keys: {', '.join(missing)}. Add them in Colab → Settings → Secrets.")

print("All keys loaded.")

Now let's install all the dependencies. This includes LangChain core and text splitters, Chroma for vector storage, FAISS for similarity search, BM25 for keyword retrieval, and sklearn for TF-IDF. In my experience, this takes about 30 seconds in Colab. The memory footprint for TF-IDF dense conversion stays manageable for small to medium corpora, up to around 10,000 chunks. If you're working with larger datasets, you might want to consider sparse backends like Elasticsearch.

!pip -q install -U langchain langchain-core langchain-community langchain-huggingface langchain-text-splitters chromadb sentence-transformers faiss-cpu rank-bm25 nltk tiktoken scikit-learn requests==2.32.4 opentelemetry-exporter-otlp-proto-common==1.37.0 opentelemetry-proto==1.37.0 opentelemetry-sdk==1.37.0 opentelemetry-api==1.37.0 google-adk==1.17.0

import os
import time
import math
import numpy as np
from pprint import pprint
import logging

import nltk
try:
    nltk.download('punkt', quiet=True)
except:
    nltk.download('punkt_tab', quiet=True)

from langchain_core.documents import Document
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
    SentenceTransformersTokenTextSplitter,
)
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, FAISS

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import TfidfVectorizer
import faiss

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

Let's verify the embedding model and environment setup. This health check is something I always do. It ensures the model loads correctly and returns the expected 384-dimensional vectors. Trust me, catching this early saves headaches later.

print("Embeddings model:", EMBED_MODEL)
vec = embeddings.embed_query("health check")
print("Embedding dim:", len(vec))
print("FAISS available:", faiss is not None)
print("Chroma available:", Chroma is not None)

Stage 1: Chunking Strategy First

Alright, here's where the real work begins. We're going to chunk documents using three strategies to match different content types. Natural boundaries for prose, token-based for LLM limits, and embedding-compatible splits for SentenceTransformers.

I cannot stress this enough. Good chunks are the single biggest lever for retrieval quality. I've seen perfectly good RAG systems fail simply because the chunking was wrong.

Start with 256 to 400 tokens and 10 to 20% overlap. If you're seeing duplicates, reduce the overlap. And if you want to avoid common issues that can completely derail your chunking and retrieval, check out our breakdown of tokenization pitfalls and invisible characters that break prompts and RAG.

raw_docs = [
    Document(page_content="""Product OrionX v2.3 Release Notes:
- Added support for TLS 1.3
- Deprecated config flag net.legacy_mode
- Fixed CVE-2024-12345 affecting auth handshake
For migrations, see Section 4.2.""",
            metadata={"source": "release_notes", "version": "2.3", "date": "2024-06-01", "section": "overview"}),

    Document(page_content="""OrionX Admin Manual – Networking:
To enable TLS 1.3, set security.tls_version=1.3 in orionx.conf.
For FIPS mode, enable crypto.fips=true and restart.
Avoid using net.legacy_mode in production.""",
            metadata={"source": "manual", "section": "networking", "date": "2024-05-20"}),

    Document(page_content="""Troubleshooting:
Handshake failures with error code OX-AUTH-902 typically indicate clock skew.
Verify NTP and ensure the client presents ECDSA certificates.
See Appendix A for certificate chains and sample configs.""",
            metadata={"source": "manual", "section": "troubleshooting", "date": "2024-05-20"}),

    Document(page_content="""Support KB #KB-7782:
How to resolve OX-AUTH-902 during federated login with Azure AD.
Root cause: invalid audience in JWT.
Mitigation: set auth.saml.audience=orionx-prod in IdP config.""",
            metadata={"source": "kb", "id": "KB-7782", "date": "2024-07-15"}),
]

rc_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)
rc_chunks = rc_splitter.split_documents(raw_docs)

tok_splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=32)
tok_chunks = tok_splitter.split_documents(raw_docs)

st_splitter = SentenceTransformersTokenTextSplitter(chunk_size=256, chunk_overlap=32)
st_chunks = st_splitter.split_documents(raw_docs)

print("Recursive chunks:", len(rc_chunks))
print("Token-based chunks:", len(tok_chunks))
print("ST token chunks:", len(st_chunks))
print("--- Sample chunk ---\n", rc_chunks[0].page_content, "\n", rc_chunks[0].metadata)

Stage 2: Semantic Search with Chroma

Time to build a Chroma vector store for semantic retrieval. This indexes all your chunks with SentenceTransformers embeddings and returns the top-k most similar chunks for any given query. Pretty straightforward, but incredibly powerful when done right.

chunks = st_chunks

chroma_store = Chroma.from_documents(documents=chunks, embedding=embeddings, collection_name="orionx_demo")

def semantic_search(query, k=4):
    """
    Perform semantic similarity search using Chroma vector store.

    Args:
        query (str): The user query.
        k (int): Number of top results to return.

    Returns:
        List[Tuple[Document, float]]: List of (Document, relevance_score) tuples.
    """
    return chroma_store.similarity_search_with_relevance_scores(query, k=k)

results = semantic_search("How do I enable TLS 1.3?")
for doc, score in results:
    print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:80]}")

Stage 3: Max Marginal Relevance (MMR)

Now we need to control redundancy with MMR. This balances relevance and diversity in your results. The lambda parameter controls the trade-off. Set it to 1.0 for pure relevance, 0.0 for pure diversity.

In practice, I use 0.3 to 0.7 for most cases. Go lower when you want exploration, higher when you need precision. It's one of those things you'll get a feel for after experimenting with your specific data.

def mmr_search(query, k=4, fetch_k=20, lambda_mult=0.5):
    """
    Perform Max Marginal Relevance (MMR) search to balance relevance and diversity.

    Args:
        query (str): The user query.
        k (int): Number of results to return.
        fetch_k (int): Number of candidates to fetch before MMR selection.
        lambda_mult (float): Trade-off between relevance (1.0) and diversity (0.0).

    Returns:
        List[Tuple[Document, None]]: List of (Document, None) tuples for compatibility.
    """
    docs = chroma_store.max_marginal_relevance_search(
        query=query, k=k, fetch_k=fetch_k, lambda_mult=lambda_mult
    )
    return [(d, None) for d in docs]

q = "Troubleshooting ORIONX handshake failures"
print("\n-- Similarity Search --")
for doc, score in semantic_search(q, k=4):
    print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:80]}")

print("\n-- MMR (lambda=0.5) --")
for doc, _ in mmr_search(q, k=4, lambda_mult=0.5):
    print(f"     | {doc.metadata} | {doc.page_content.splitlines()[0][:80]}")

Stage 4: Metadata Filtering

Here's where we add precision with metadata filtering. This restricts retrieval to chunks matching specific metadata keys. Could be source type, section, date range, whatever makes sense for your use case.

One thing I learned the hard way: normalize your metadata keys at ingestion. Otherwise you'll get silent filter misses that are incredibly frustrating to debug.

def filtered_search(query, meta_filter: dict, k=4):
    """
    Perform semantic search with metadata filtering.

    Args:
        query (str): The user query.
        meta_filter (dict): Metadata key-value pairs to filter on.
        k (int): Number of results to return.

    Returns:
        List[Tuple[Document, float]]: List of (Document, relevance_score) tuples.
    """
    return chroma_store.similarity_search_with_relevance_scores(query, k=k, filter=meta_filter)

print("\n-- Filter: source=manual, section=networking --")
for doc, score in filtered_search("Enable TLS 1.3", {"$and": [{"source": "manual"}, {"section": "networking"}]}, k=3):
    print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:80]}")

Stage 5: BM25 for Exact Matches

Let's add BM25 for exact-match and jargon-heavy queries. BM25 really shines with keyword precision. I'm talking error codes, product names, technical identifiers. The stuff that semantic search sometimes glosses over.

import nltk
nltk.download('punkt_tab', quiet=True) # Ensure punkt_tab is downloaded

def tokenize(text):
    """
    Tokenize text for BM25 using NLTK's word_tokenize.

    Args:
        text (str): Input text.

    Returns:
        List[str]: List of lowercased tokens.
    """
    return nltk.word_tokenize(text.lower())

bm25_corpus = [c.page_content for c in chunks]
bm25_tokens = [tokenize(t) for t in bm25_corpus]
bm25 = BM25Okapi(bm25_tokens)

def bm25_search(query, k=5):
    """
    Perform BM25 keyword search.

    Args:
        query (str): The user query.
        k (int): Number of top results to return.

    Returns:
        List[Tuple[Document, float]]: List of (Document, BM25_score) tuples.
    """
    tokens = tokenize(query)
    scores = bm25.get_scores(tokens)
    idxs = np.argsort(scores)[::-1][:k]
    out = []
    for i in idxs:
        out.append((chunks[i], float(scores[i])))
    return out

print("\n-- BM25: technical query with specific token --")
for doc, score in bm25_search("Resolve OX-AUTH-902 handshake failures", k=4):
    print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

Stage 6: TF-IDF + FAISS for Keyword Retrieval

Now we're adding TF-IDF with FAISS for scalable keyword retrieval with cosine similarity. This approach is fast for small to medium corpora, up to about 10,000 chunks.

Actually, wait. If you're dealing with larger datasets, you might want to consider sparse backends like Elasticsearch or Pyserini. The dense conversion overhead can become a real bottleneck otherwise.

vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=1)
tfidf_matrix = vectorizer.fit_transform(bm25_corpus)
tfidf_dense = tfidf_matrix.astype(np.float32).toarray()

def l2_normalize(mat):
    """
    L2-normalize a matrix for cosine similarity via inner product.

    Args:
        mat (np.ndarray): Input matrix.

    Returns:
        np.ndarray: L2-normalized matrix.
    """
    norms = np.linalg.norm(mat, axis=1, keepdims=True) + 1e-12
    return mat / norms

tfidf_norm = l2_normalize(tfidf_dense)

dim = tfidf_norm.shape[1]
faiss_index = faiss.IndexFlatIP(dim)
faiss_index.add(tfidf_norm)

def tfidf_faiss_search(query, k=5):
    """
    Perform TF-IDF + FAISS search for keyword/phrase queries.

    Args:
        query (str): The user query.
        k (int): Number of top results to return.

    Returns:
        List[Tuple[Document, float]]: List of (Document, cosine_score) tuples.
    """
    q_vec = vectorizer.transform([query]).astype(np.float32).toarray()
    q_vec = l2_normalize(q_vec)
    D, I = faiss_index.search(q_vec, k)
    out = []
    for score, idx in zip(D[0], I[0]):
        out.append((chunks[int(idx)], float(score)))
    return out

print("\n-- TF-IDF+FAISS: phrase query --")
for doc, score in tfidf_faiss_search("enable TLS 1.3 in config", k=4):
    print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

Stage 7: Hybrid Retrieval Router

This is where everything comes together. We're building a hybrid retrieval router that combines all our retrieval strategies. The router uses query heuristics to select the best path.

Semantic for conceptual queries. MMR for long queries where you need diversity. BM25 or TF-IDF for keyword-heavy queries. And metadata filters for time-bounded or source-specific queries. Results get merged, deduplicated, and normalized to ensure consistent ranking.

Honestly, this router is what makes the whole system work. It's like having multiple specialists on your team, each one called in for their specific expertise.

import re

def is_keyword_heavy(query):
    """
    Heuristic to detect if a query is keyword-heavy (IDs, codes, symbols).

    Args:
        query (str): The user query.

    Returns:
        bool: True if keyword-heavy, else False.
    """
    version_pattern = r'\bv?\d+(.\d+)*\b'
    error_code_pattern = r'\b[A-Z]{2,}-[A-Z0-9-]+\b'
    date_pattern = r'\b20\d{2}\b'
    
    versions = len(re.findall(version_pattern, query))
    codes = len(re.findall(error_code_pattern, query))
    dates = len(re.findall(date_pattern, query))
    
    return (versions + codes + dates) >= 1

def is_time_bounded(query):
    """
    Heuristic to detect if a query is time-bounded.

    Args:
        query (str): The user query.

    Returns:
        bool: True if time-bounded, else False.
    """
    triggers = ["before", "after", "since", "between", "version", r"\bv\d+"]
    pattern = "|".join(triggers)
    return bool(re.search(pattern, query.lower()))

def pick_semantic_mode(query):
    """
    Choose between MMR and similarity search based on query length.

    Args:
        query (str): The user query.

    Returns:
        str: "mmr" or "similarity"
    """
    return "mmr" if len(query.split()) >= 6 else "similarity"

def normalize_scores(results, method_name):
    """
    Normalize scores within a method using min-max scaling.

    Args:
        results (List[Tuple[Document, float]]): List of (Document, score) tuples.
        method_name (str): Name of the retrieval method (for logging).

    Returns:
        List[Tuple[Document, float]]: Normalized results.
    """
    if not results or all(s is None for _, s in results):
        return results
    
    scores = [s for _, s in results if s is not None]
    if not scores:
        return results
    
    min_s, max_s = min(scores), max(scores)
    if max_s == min_s:
        return [(doc, 1.0) for doc, _ in results]
    
    normalized = []
    for doc, score in results:
        if score is not None:
            norm_score = (score - min_s) / (max_s - min_s)
            normalized.append((doc, norm_score))
        else:
            normalized.append((doc, 0.0))
    return normalized

def merge_results(*lists, max_k=6):
    """
    Merge and deduplicate results from multiple retrieval strategies.
    Uses content hash for deduplication and sorts by normalized score.

    Args:
        *lists: Lists of (Document, score) tuples.
        max_k (int): Max number of results to return.

    Returns:
        List[Tuple[Document, float]]: Merged, deduplicated, sorted results.
    """
    seen = set()
    merged = []
    for lst in lists:
        for doc, score in lst:
            key = hash(doc.page_content.strip())
            if key not in seen:
                merged.append((doc, score if score is not None else 0.0))
                seen.add(key)
    
    merged.sort(key=lambda x: x[1], reverse=True)
    return merged[:max_k]

def hybrid_retrieve(query, k=6, meta_filter=None):
    """
    Hybrid retrieval combining semantic, MMR, metadata, BM25, and TF-IDF+FAISS.
    Normalizes scores per method before merging to ensure fair ranking.

    Args:
        query (str): The user query.
        k (int): Number of results to return.
        meta_filter (dict, optional): Metadata filter.

    Returns:
        List[Tuple[Document, float]]: Top-k merged results.
    """
    candidates = []

    if meta_filter:
        filt_results = filtered_search(query, meta_filter, k=min(4, k))
        candidates += normalize_scores(filt_results, "filtered")

    mode = pick_semantic_mode(query)
    if mode == "mmr":
        mmr_results = mmr_search(query, k=min(4, k), lambda_mult=0.5)
        candidates += [(doc, 0.5) for doc, _ in mmr_results]
    else:
        sem_results = semantic_search(query, k=min(4, k))
        candidates += normalize_scores(sem_results, "semantic")

    if is_keyword_heavy(query):
        bm_results = bm25_search(query, k=min(4, k))
        candidates += normalize_scores(bm_results, "bm25")
        tf_results = tfidf_faiss_search(query, k=min(4, k))
        candidates += normalize_scores(tf_results, "tfidf")

    if is_time_bounded(query):
        time_results = filtered_search(query, {"source": "release_notes"}, k=min(3, k))
        candidates += normalize_scores(time_results, "time_bounded")

    merged = merge_results(candidates, max_k=k)
    
    if len(merged) < k:
        fallback = bm25_search(query, k=k)
        merged = merge_results(merged, normalize_scores(fallback, "fallback"), max_k=k)
    
    return merged

queries = [
    "How do I enable TLS 1.3?",
    "Resolve OX-AUTH-902 handshake failures",
    "What changed in version 2.3 regarding legacy mode?",
    "Troubleshooting steps for federated login with Azure AD",
    "Only show networking config details from the manual",
]
for q in queries:
    print(f"\n=== {q} ===")
    for doc, score in hybrid_retrieve(q, k=5, meta_filter={"section": "networking"} if "networking" in q.lower() else None):
        print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

Run and Validate

Time to run the complete pipeline across all retrieval techniques. This cell times each method and prints results for a set of test queries. You can compare performance and quality side by side.

test_queries = [
    "Enable TLS 1.3 in configuration",
    "What is KB-7782 about?",
    "Fix CVE-2024-12345 handshake issue",
    "Why does OX-AUTH-902 occur?",
    "Show only manual networking guidance",
]

def time_call(fn, *args, **kwargs):
    """
    Time a function call and return its output and elapsed time in ms.

    Args:
        fn (callable): Function to call.
        *args: Positional arguments.
        **kwargs: Keyword arguments.

    Returns:
        Tuple[Any, float]: (Function output, elapsed time in ms)
    """
    t0 = time.time()
    out = fn(*args, **kwargs)
    return out, (time.time() - t0) * 1000

for q in test_queries:
    print(f"\n\n### Query: {q}")

    ss_out, ss_ms = time_call(semantic_search, q, 4)
    print(f"\n-- Similarity Search ({ss_ms:.1f} ms) --")
    for doc, score in ss_out:
        print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

    mmr_out, mmr_ms = time_call(mmr_search, q, 4, 20, 0.5)
    print(f"\n-- MMR ({mmr_ms:.1f} ms) --")
    for doc, _ in mmr_out:
        print(f"     | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

    filt = {"source": "manual"} if "manual" in q.lower() or "networking" in q.lower() else None
    if filt:
        fs_out, fs_ms = time_call(filtered_search, q, filt, 4)
        print(f"\n-- Metadata-Filtered ({fs_ms:.1f} ms), filter={filt} --")
        for doc, score in fs_out:
            print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

    bm_out, bm_ms = time_call(bm25_search, q, 4)
    print(f"\n-- BM25 ({bm_ms:.1f} ms) --")
    for doc, score in bm_out:
        print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

    tf_out, tf_ms = time_call(tfidf_faiss_search, q, 4)
    print(f"\n-- TF-IDF+FAISS ({tf_ms:.1f} ms) --")
    for doc, score in tf_out:
        print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

    hy_out, hy_ms = time_call(hybrid_retrieve, q, 6, meta_filter={"section": "networking"} if "networking" in q.lower() else None)
    print(f"\n-- Hybrid Router ({hy_ms:.1f} ms) --")
    for doc, score in hy_out:
        print(f"{score:.3f} | {doc.metadata} | {doc.page_content.splitlines()[0][:100]}")

Practical Tuning Guidelines

After building dozens of these systems, here's what I've learned about tuning:

Chunking: Start with 256 to 400 tokens, 10 to 20% overlap. If you see duplicates, reduce the overlap. Simple as that.
MMR: Lambda between 0.3 and 0.7 works for most cases. Lower for exploration, higher for precision. You'll know when you've got it right.
k value: 3 to 6 is the sweet spot for most prompts. More than that and you're just raising token costs without much benefit.
BM25/TF-IDF: Keep both. Seriously. BM25 for short keywords, TF-IDF+FAISS for fast cosine search on small to medium corpora.
Metadata: Normalize keys at ingestion. Avoid free-form fields that don't filter cleanly. This will save you so much debugging time.

For more on crafting robust prompts and ensuring reliable outputs from LLMs in production, our guide to prompt engineering with LLM APIs covers the generation side of things.

Integration and Next Steps

Let's expose a clean retrieval function for downstream use:

def retrieve_context(query: str, k: int = 6, filters: dict = None) -> list:
    """
    Retrieve top-k context chunks for a given query.

    Args:
        query (str): User query.
        k (int): Number of chunks to return.
        filters (dict, optional): Metadata filters.

    Returns:
        list: List of Document objects.
    """
    results = hybrid_retrieve(query, k=k, meta_filter=filters)
    return [doc for doc, _ in results]

context = retrieve_context("Enable TLS 1.3", k=5)
for doc in context:
    print(doc.page_content[:100])

If you want to persist Chroma indexes across sessions, use persist_directory. This is crucial for production systems:

chroma_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="orionx_demo",
    persist_directory="./chroma_db"
)

And here's a tip. To cache embeddings and avoid recomputation, store the embedding model and vectors locally. Or better yet, use a persistent vector store like Qdrant or Weaviate. Your future self will thank you when you don't have to re-embed everything after a restart.