Multi-Document Agent with LlamaIndex: The Ultimate Guide

Build a production-ready multi-document agent with LlamaIndex, turning PDFs into retrieval and summarization tools using semantic selection for accurate answers.

Paco Awissi

11 min read • November 11, 2025

So here's what we're going to build: a multi-document research assistant that can actually answer questions across multiple PDFs and tell you exactly where it found the information. I've been working on this kind of system for a while now, and the key breakthrough was combining semantic vector search for those specific "what did they say about X" questions with hierarchical summarization for the bigger picture stuff. The whole thing uses function calling to figure out which tool to use when.

By the time you're done with this tutorial, you'll have a working notebook that handles cross-document Q&A, gives you proper citations in [file_name p.page_label] format (which honestly took me way too long to get right the first time), and includes some basic tests to make sure everything's working.

Prerequisites:

Python 3.10+
OpenAI API key
2 to 3 sample PDFs (research papers, reports, technical documents, whatever you have)
Expected cost: about $0.10 to $0.50 per summary-heavy query, depending on how big your documents are

Why This Approach Works

Per-Document Tool Isolation

Each PDF gets its own vector and summary tool. Sounds simple, but this is actually crucial. It prevents the system from mixing up information between documents, makes citations precise, and lets the agent figure out which document to look at for any given question. Plus, when something goes wrong (and it will), debugging is so much easier when everything's cleanly separated.

Semantic Tool Retrieval

Here's something I learned the hard way: when you have more than a handful of documents, you can't just throw all the tools at the agent and hope for the best. Instead, we use an object index that embeds tool descriptions and retrieves the top-k relevant ones for each query. Works beautifully even with dozens of documents.

Dual Retrieval Strategy

Vector tools handle the narrow stuff. "What dataset did the authors use?" That kind of thing. Summary tools handle the broad synthesis questions. "Compare the main contributions across papers." The agent picks which one based on what you're asking. I've found this combination covers about 95% of research queries.

Citation Enforcement

Every tool attaches file name and page metadata to its results. Then the system prompt basically hammers into the agent that it needs to cite sources after each claim. You can also post-process responses to clean up the citation format if needed. No more vague "according to the document" nonsense that drives everyone crazy.

How It Works (High-Level Overview)

Load and chunk PDFs – Extract text, split into chunks that respect sentence boundaries, normalize all the metadata for citations.
Build per-document tools – Create both vector and summary tools for each PDF, give them clear descriptions so the agent knows what they do.
Index tools semantically – Embed those tool descriptions in an object index so we can retrieve them dynamically.
Assemble the agent – Use function calling with a pretty strict system prompt to route queries and make sure citations happen.
Validate and iterate – Run test queries, see which tools get selected, adjust retrieval thresholds and temperature until it works right.

Setup & Installation

First things first, let's install everything with pinned versions so we don't run into compatibility issues later:

%pip -q install llama-index llama-index-llms-openai  llama-index-embeddings-openai pypdf nest_asyncio python-dotenv numpy pandas jedi>=0.16

Now for the OpenAI API key. If you're in Colab, go to Settings, then Secrets, and add OPENAI_API_KEY there. Otherwise, just make a .env file with OPENAI_API_KEY=your_key. Pretty standard stuff.

import os
from dotenv import load_dotenv

load_dotenv()

# Fail early if key is missing
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in .env or Colab Secrets"
print("API key loaded.")

Let's set up logging and suppress those annoying warnings. Also enabling async support because why not:

import logging
import warnings
import nest_asyncio

warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO)
nest_asyncio.apply()

Configure the LLM and embedding model globally. This way we don't have to specify them everywhere:

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Use GPT-4o for reliable function calling; fallback to gpt-4o-mini if needed
Settings.llm = OpenAI(model="gpt-4o", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

Create a data directory and grab some sample PDFs so you can actually run this thing end-to-end:

import urllib.request

DATA_DIR = "data"
os.makedirs(DATA_DIR, exist_ok=True)

# Example: download public arXiv papers (replace with your own PDFs)
sample_urls = [
    ("https://arxiv.org/pdf/2005.11401.pdf", "paper1.pdf"),  # GPT-3 paper
    ("https://arxiv.org/pdf/2303.08774.pdf", "paper2.pdf"),  # GPT-4 paper
]

for url, fname in sample_urls:
    fpath = os.path.join(DATA_DIR, fname)
    if not os.path.exists(fpath):
        print(f"Downloading {fname}...")
        urllib.request.urlretrieve(url, fpath)

pdf_files = [f for f in os.listdir(DATA_DIR) if f.lower().endswith(".pdf")]
print(f"Found {len(pdf_files)} PDFs:", pdf_files)

Step-by-Step Implementation

Step 1: Load and Chunk PDFs

Load documents from the data directory. The PDF reader is nice enough to attach page metadata automatically:

from llama_index.core import SimpleDirectoryReader, Document

docs = SimpleDirectoryReader(DATA_DIR, recursive=False).load_data()
print(f"Loaded {len(docs)} documents")

Now we split documents into chunks. But here's the thing about chunking: you want to respect sentence boundaries. I learned this after watching my early systems split sentences in half and completely lose the meaning. Sentence-aware splitting gives the vector index much better semantic units to work with. Makes a huge difference with dense technical writing. Actually, if you want to dive deeper into retrieval optimization, I wrote up some retrieval tricks to boost answer accuracy that might help.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs, show_progress=True)
print(f"Total chunks: {len(nodes)}")

Normalize metadata for citations. Every node needs file_name and page_label, otherwise your citations will be a mess:

for n in nodes:
    meta = n.metadata or {}
    if "file_name" not in meta:
        file_path = meta.get("file_path", meta.get("source", "unknown"))
        meta["file_name"] = os.path.basename(file_path) if isinstance(file_path, str) else "unknown"
    if "page_label" not in meta:
        meta["page_label"] = str(meta.get("page_number", "N/A"))
    n.metadata = meta

print("Sample chunk metadata:", nodes[0].metadata)
print("Sample chunk text:", nodes[0].text[:300], "...")

Group nodes by document so we can create per-document tools:

from collections import defaultdict

nodes_by_file = defaultdict(list)
for n in nodes:
    nodes_by_file[n.metadata["file_name"]].append(n)

print({k: len(v) for k, v in nodes_by_file.items()})

Step 2: Build Per-Document Vector Tools

Create a vector index for each document. This handles precise passage retrieval:

from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool

vector_tools = {}

for fname, doc_nodes in nodes_by_file.items():
    v_index = VectorStoreIndex(doc_nodes, show_progress=True)
    v_engine = v_index.as_query_engine(similarity_top_k=5)
    v_tool = QueryEngineTool.from_defaults(
        name=f"vector_{fname.replace('.', '_')}",
        query_engine=v_engine,
        description=(
            f"Semantic vector search for {fname}. "
            "Use for targeted, specific questions that require exact passages and citations."
        )
    )
    vector_tools[fname] = v_tool

print(f"Vector tools created: {len(vector_tools)}")

Always test your tools. Seriously, always:

sample_file = next(iter(vector_tools.keys()))
resp = vector_tools[sample_file].query_engine.query("What problem does this paper address?")
print(resp)

Step 3: Build Per-Document Summary Tools

Create a summary index for each document. This is what handles the high-level synthesis questions:

from llama_index.core import SummaryIndex

summary_tools = {}

for fname, doc_nodes in nodes_by_file.items():
    s_index = SummaryIndex(doc_nodes)
    s_engine = s_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True
    )
    s_tool = QueryEngineTool.from_defaults(
        name=f"summary_{fname.replace('.', '_')}",
        query_engine=s_engine,
        description=(
            f"Hierarchical summarization for {fname}. "
            "Use for overviews, key contributions, limitations, and document-wide synthesis."
        )
    )
    summary_tools[fname] = s_tool

print(f"Summary tools created: {len(summary_tools)}")

Test the summary tool too:

sample_file = next(iter(summary_tools.keys()))
resp = summary_tools[sample_file].query_engine.query("Provide a 5-bullet executive summary.")
print(resp)

Step 4: Index Tools Semantically

Build an object index over all the tools. This is the clever bit. It embeds tool descriptions and retrieves the most relevant ones for each query:

from llama_index.core.objects import ObjectIndex

all_tools = list(vector_tools.values()) + list(summary_tools.values())

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
    show_progress=True
)

tool_retriever = obj_index.as_retriever(similarity_top_k=3)

Check which tools get retrieved for different queries. This debugging step has saved me hours of head-scratching:

import inspect

def inspect_tools(query: str):
    retrieved_results = tool_retriever.retrieve(query)
    print(f"Query: {query}")
    for i, res in enumerate(retrieved_results, 1):
        tool_obj = None
        # Check if res is a NodeWithScore object (standard behavior for ObjectRetriever)
        if hasattr(res, 'node') and hasattr(res.node, 'obj'):
            tool_obj = res.node.obj
        # Otherwise, assume res is the QueryEngineTool object directly
        else:
            tool_obj = res

        tool_name = getattr(getattr(tool_obj, 'metadata', None), 'name', None)

        if tool_name:
            name_parts = tool_name.split('_', 1)
            tool_type = name_parts[0] if len(name_parts) > 0 else "unknown"
            file_name = name_parts[1] if len(name_parts) > 1 else "unknown"
            print(f"#{i} -> {tool_type} | {file_name} | {tool_name}")
        else:
            print(f"#{i} -> Could not determine tool properties: tool object {tool_obj} has no valid 'name' attribute in its metadata or it's empty/None.")
        print("-" * 30)

# Test calls for inspect_tools
inspect_tools("Provide an executive summary across all documents.")
inspect_tools("Which sections discuss model architecture details?")

Step 5: Assemble the Agent

Time to create the actual agent with function calling and a system prompt that's pretty insistent about citations. I've tried a bunch of frameworks for this. LangChain and CrewAI are solid, but LlamaIndex just clicks for document workflows. It has all the indexing, retrieval, and summarization stuff built in. If you're curious about the fundamentals of how agents work, I put together a tutorial on building an LLM agent from scratch with GPT-4 ReAct that breaks it all down.

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core import Settings

SYSTEM_PROMPT = """You are a multi-document research assistant.
- Use only the provided tools.
- Prefer vector tools for specific, narrow questions.
- Prefer summary tools for high-level synthesis.
- Always cite sources as [file_name p.page_label] after each relevant sentence.
- If you cannot find relevant evidence, say so explicitly."""

# If you previously used a tool retriever, see Option B below.
agent = FunctionAgent(
    tools=all_tools,          # same list you used before
    llm=Settings.llm,         # your configured LLM
    system_prompt=SYSTEM_PROMPT,
    verbose=True,
)

Run and Validate

Let's run a cross-document query and see if the agent actually synthesizes answers with proper citations:

import asyncio
response = asyncio.run(
    agent.run("Compare the main challenges and proposed collaboration mechanisms across the papers.")
)
print(str(response))

Run a bunch of test queries to really put the agent through its paces. This is where you find out if everything's actually working together:

import asyncio

tests = [
    "List the datasets used by each paper and compare evaluation metrics.",
    "Provide a high-level summary of the main contributions across documents.",
    "According to the authors, what are the primary limitations?"
]

async def main():
    for q in tests:
        print("\nQ:", q)
        resp = await agent.run(q)
        print("A:", str(resp))

asyncio.run(main())

Conclusion

And there you have it. You've built a multi-document research assistant that routes queries to the right tool, pulls out precise passages, and actually tells you where it got the information from.

The key decisions that make this work: keeping tools separate per document for clean attribution, using semantic retrieval to scale beyond a handful of documents, and having two retrieval modes. Vector for the specific stuff, summary for the big picture.

If you want to take this to production, here's what I'd do next:

Persist indices – Save those vector and summary indices to disk or something like pgvector or Pinecone. Re-embedding everything on every run gets expensive fast. Learned that one the hard way.
Add retries and rate limits – Wrap your LLM calls with exponential backoff and timeouts. Things fail. Better to handle it gracefully than crash.
Implement structured logging – Use LlamaIndex callbacks or a proper logging framework to track tool calls, latency, token usage. Future you will thank present you when something weird happens.
Cache answers – Use an LRU cache or Redis for repeated queries. Actually, I wrote up how to implement semantic cache with Redis Vector if you want to get fancy with it and save some serious API costs.
Post-process citations – Extract source_nodes from responses and format citations programmatically. Relying purely on prompts for formatting only gets you so far. Sometimes you need to just fix it in post.

That's pretty much it. You now have a working multi-document research assistant that knows where its information comes from. Pretty handy for any serious document work where you need to back up your claims.