Turn any document collection into a reasoning agent without vectors
GitHub RepoImpressions1.1k

Turn any document collection into a reasoning agent without vectors

@githubprojectsPost Author

Project Description

View on GitHub

Turn Your Docs into a Reasoning Agent, No Vectors Needed

We’ve all been there. You’ve got a pile of documentation, a knowledge base, or a collection of text files, and you want to build something that can intelligently reason over it. The usual playbook involves vector embeddings, setting up a vector database, and managing semantic search. But what if you could skip all that and still get an agent that can understand and answer questions based solely on your raw text?

That’s the idea behind PageIndex. It’s a project that turns any collection of documents into a reasoning agent without relying on vector embeddings or semantic search. Instead, it uses a clever combination of traditional search and large language models to find, process, and reason about your content.

What It Does

PageIndex is a Python-based tool that ingests your documents (like markdown files, PDFs, or plain text), builds a straightforward keyword-based index, and then uses an LLM to answer queries. The key twist? It doesn't use vector similarity to find relevant text. Instead, it performs a traditional keyword search to retrieve candidate sections, then hands those sections to an LLM (like GPT-4 or an open-source model) with instructions to reason strictly from the provided text.

The workflow is simple: you point it at a folder of documents, it indexes them. When you ask a question, it finds text snippets containing relevant keywords, passes them to the LLM as context, and the LLM formulates an answer based only on that context. This creates a kind of "reasoning agent" that’s grounded in your documents.

Why It’s Cool

The obvious win here is simplicity. You don’t need to set up a vector database or worry about embedding models. It works with flat files and a basic index, which means it’s fast to get running and easy to understand. The entire search-and-reason process is transparent—you can see exactly which document sections were retrieved and how the LLM used them.

But the more interesting aspect is the approach. By decoupling the retrieval step from the reasoning step, PageIndex makes a clear statement: sometimes, good old keyword search is enough to find the right context, and the LLM’s job is to synthesize that context intelligently. This can be more predictable and controllable than vector search, especially for domain-specific or structured documentation where exact terminology matters.

It’s also flexible. You can run it locally with an open-source LLM via Ollama or use an API like OpenAI. The code is straightforward Python, so you can adapt it to your own document formats or tweak the prompting logic.

How to Try It

Getting started is pretty simple. Clone the repo and install the dependencies:

git clone https://github.com/VectifyAI/PageIndex
cd PageIndex
pip install -r requirements.txt

You’ll need an OpenAI API key or a local Ollama instance. Set your key as an environment variable:

export OPENAI_API_KEY='your-key-here'

Then, add your documents to a folder (like docs/) and run the indexing:

python page_index.py index --input-dir ./docs

Once indexed, you can start querying:

python page_index.py query "Your question here"

The project README has more details on configuration and local model setup. Since it’s early-stage, be prepared to poke around the code a bit if you want to customize things.

Final Thoughts

PageIndex feels like a refreshing back-to-basics experiment. In a world where every retrieval project seems to default to vector search, it’s a good reminder that simpler methods can still work well, especially when paired with a capable LLM. It won’t replace vector-based systems for every use case—semantic understanding across varied phrasing is still a strength of embeddings—but for many documentation or knowledge-base scenarios, this approach is surprisingly effective.

If you’ve been wanting to add a Q&A layer to your docs without the overhead of a vector pipeline, give PageIndex a look. It’s a clean, hackable starting point that gets you 80% of the way there with 20% of the complexity.


Follow for more interesting projects: @githubprojects

Back to Projects
Project ID: e64bcd99-cd39-4ab2-b926-d410bcdf9d01Last updated: January 30, 2026 at 11:10 AM