Agentic Search for Dummies
Claude Code is a testament to the unreasonable effectiveness of just giving models tools, and letting them figure out the rest. In this post, I'll lay out how I approach search in a world where models are smart enough to translate a task or user request into a series of queries to a search index. I claim that full-text search with offline augmentation is a simple and effective way to let a model search a corpus of documents. I do not claim this is state-of-the-art, and you won't find any evaluation numbers in this post. Rather, I'm demonstrating one way to set up agentic search that I like, and that works well enough to be a decent baseline.
I am assuming you already know a few things about search, and what an "agent" is. If you're confused, I recommend checking out Thorsten Ball's post, How to Build an Agent which shows you how to start with a language model and turn it into a simple coding agent in a few hundred lines of code.
Overview
A "search agent" (as used here) takes a request or query from the user or application, and returns an answer or result for the request, using search as necessary to gather context to complete the task. It is composed of a few basic parts, which I will cover one by one.
- A corpus of documents, used to construct a search index
- A search API (no AI) which takes queries in, hits the index, and returns a results page
- A set of "tools" that allow a language model to use the search API
- The AI model which is given the task and tools and instructed to provide an answer
Part 1 – The Corpus
Search is useful because it allows an AI to dynamically gather information based on the task, without you needing to know in advance exactly what information is needed to complete the task. Instead of providing exactly the information that is needed, you provide a much larger set of information (the corpus), and part of the AI's job is to figure out which parts of the corpus to look at to complete the task. So, a corpus is usually a really big set of documents (hundreds, thousands, or millions).
Preparing a clean corpus is very important, because it is both a) the text that will be searched with your search engine, and b) the text that will be read by the model. You will want to be quite sure you've removed random SVGs, base64-encoded data, navigation bars, and the like, leaving a set of nicely-formatted documents for your model to read.
Multimodal LLMs like Gemini, Claude, GPT-4.1 are good enough at OCR now that you can send them PDFs and get reasonable results, but for doing OCR on PDFs at scale, I like DataLab's tool, Marker, which you can run yourself on an H100 for free (a license is required if you are using it for commercial purposes and your startup has raised a lot of money). DOCX files can be ingested with python-docx
in a manner that preserves numbering, which is helpful. HTML can be converted to Markdown, and I typically post-process webpages with an LLM to remove boilerplate.
Finally, a key trick that I think is important (again, no data to back this up, sorry!) is offline document augmentation. This means creating a set of keywords and a summary for each document using a language model, and it serves two purposes: a) It makes it easier for the AI to search for the document, as a query may miss the document text, but hit one of the keywords we generated offline; b) It provides more information to show in a "preview" of the document on the search results page (more on that later). Augmentation is performed with a prompt like this:
prompt = (
f"Given a section from the {document_name}, provide the following metadata as JSON, with "
"`keywords` and `description` keys as follows:\n\n"
" - `keywords` list[str]: A list of as many keywords/keyphrases as you can think of that someone "
"might search for where the section would be relevant. The keywords/phrases do NOT have to occur "
"in the section, they can be semantic matches, synonyms, etc. "
"However, they should be specific to the section, not "
f"keywords that would apply to literally any section of the {document_name}.\n"
" - `description` str: A summary/overview of what the section says. "
"Be mindful of your tendency to make overlong summaries, and remember that the goal is to provide a SHORT "
"overview of the section. A long summary is pointless because you may as well just read the original.\n\n"
f"Here is the {document_name} section:\n\n```\n{section}\n```"
"\n\nNow provide your JSON response, no prelude or commentary needed."
)
The documents, along with resulting keywords and summaries, all get saved into a JSONL which can be loaded into a search index on demand.
Part 2 – The Search Index
Once documents are prepared, they need to be indexed. A search index is just a special way of storing documents that makes it fast to search for them. For example, an inverted index stores a mapping from keyword to document. That way, when you search for a keyword, you don't have to scan every document for the keyword, which would be linear in the number of documents. Instead, you can hop straight to that keyword in the mapping, and grab all the associated documents, which takes constant time.
For our search index, we use Tantivy. It's fast, open source, implements good algorithms, and has Python bindings that are easy to install. We use the text, keywords, and summary generated in the previous part as the searchable columns to build a Tantivy index, which takes just seconds for hundreds to thousands of documents. We wrap the Tantivy index with a simple API that implements two operations: search
and read
.
def search(queries: list[str], limit: int = 10) -> list[SearchResult]:
pass
def read(document_ids: list[int]) -> list[Document]:
pass
This API separates searching from reading. This is important because if you forced an AI to read every document it searched for, the context window would rapidly fill up with irrelevant or repeated documents. Instead, for each search, we get back a list of search results (like a Google Search page), and provide those to the AI. The AI can then decide which, if any, of those documents it wants to actually "click on" and read.
You will also notice that queries
is a list. Our search index allows for multiple queries, and fuses the results. This means that you can provide 5, 10, or 100 queries, but you still only get back one search results page. Each query is run separately, and then an algorithm like Reciprocal Rank Fusion is used to "fuse" all 5, 10, or 100 pages into 1 search results page, based on how highly each document ranks across all the queries. Allowing the model to "spray and pray" queries like this can improve recall, as it increases the chances that at least 1 query will include the keyword needed to hit the desired document.
Part 3 – The Tools
AI models can write Python, so they could already use this search index, but executing arbitrary code generated by a language model is considered harmful, so typically, the abstraction of "tools" is used, which allows the AI to call very specific functions, and nothing else. In our case, the search
and read
functions above are each translated into a form the AI understands, and passed along with the request to the model.
One way to do this is using tools, an abstraction where each function has a JSON schema explaining how to use it. This list of tools is passed directly to the model. Another way is to use an MCP server, which seems to be all the rage now. An MCP (Model Context Protocol) server is like an API server, but specifically designed for AIs to use tools. OpenAI, Anthropic, and Google all support MCP integrations to some degree or another. We use FastMCP to turn the tools into a server.
When using MCP, you just have to provide information about the server. I am still not completely convinced that this extra abstraction is worth it. MCP deployment is a bit fragile and unintuitive, and the provider integrations can be flaky (missing a trailing slash can break the server!).
One note on multiple indexes: If you need your search agent to be able to consult a lot of sources, I find it helpful to serve each source as a separate index. For example, an estate-planning assistant should be able to search the probate code, and the Internal Revenue Code. You could combine both into one corpus, but leads to undesirable results: probate documents appearing in searches for IRC documents and vice-versa. Instead, you can keep search indexes separate, and give the model separate tools for each. Let the AI decide which corpus it wants to search—it's usually smart enough to figure it out, if your tool descriptions are sufficiently detailed.
Part 4 – The Search Agent
A corpus, search index, and tools are all you need to build a search agent. What remains is telling the AI what task it needs to complete, providing the tools to the model, and calling the model in a loop. Each search query yields a results page, and each request to read documents puts those documents directly into the model's context. Searching and reading are done in a loop until the model decides it has read enough, and produces a final answer without any tool uses.
When using an MCP server, the model provider will handle this loop for you. If you're passing tools, then you have to send each tool result back to the model yourself. Either way, this loop of searching and reading, finally resulting in an answer, is a search agent! We're done.
Why doesn't your search agent use embeddings?
There's nothing wrong with dense embeddings. A truly state of the art search system probably should use them. But they have serious weaknesses, and introduce overhead. Embeddings grow worse with long documents, and many practical tasks (including those we are interested in) require long documents. This means using embeddings requires either eating a performance penalty, or splitting documents into very small pieces ("chunking"). Very documents have to be chunked anyway, but limiting chunks to 500-1000 words is oppressive—it makes it harder to carve a document at the "joints".
Combining full-text search with embeddings also adds more overhead for ranking fusion. Now you're not just fusing across different queries, you're fusing across two search modalities (keyword and neural search), and the weight placed on each needs to be tuned. Using dense embeddings also requires LLM inference, which is CPU-intensive and probably shouldn't run on your webserver. Neither should Tantivy, honestly, but transfomer inference is almost certainly worse. Overall, it requires a more complex architecture.
The argument for embeddings is that they allow documents to match even when the exact keywords don't, if the documents are semantically similar. I argue this is less important for agentic search. If a human is searching and wants to get a good result right away with a badly written query, full text search alone is not going to cut it. But when an agent is searching:
- It can write lots of queries at once (we leverage this with multi-query searches)
- It can write longer more exhaustive queries (covering more synonyms and keywords)
- It can try again when its initial query doesn't work
If anything, full-text search is a more powerful primitive for AI models than a black-box search engine, because the relationship between queries and results is predictable, rather than vibe-based. I know if I put the word "pizza" into my query, I will get documents with the word pizza in them. In contrast, a black-box search API is something the model will have to learn to use on the fly ("Do I put pepperoni or Domino's? Which is closest to what I want in the vibe-space?")
Conclusion
If you have comments or questions, find me on Twitter (@andersonbcdefg). Normally this part is where I'dd tell you to join us if this kind of stuff interests you, or to use our product, but we aren't hiring and we aren't trying to sell our product to developers, so... thank you for reading!