FORetrieval

FORetrieval is a multimodal document retrieval library built on top of colpali-engine. It indexes document pages as images using late-interaction models (ColPali, ColQwen2, ColQwen2.5) and retrieves the most relevant pages for a given query. It is used by FORag as its retrieval backend.

Key features:

Four storage backends — local (Colpali legacy .pt files), qdrant (default, embedded), milvus (Milvus Lite), and remote (HTTP-delegated vector-DB server)
Remote embedding server — offload all embedding computation to a remote vLLM GPU server; the local machine needs no GPU
Metadata generation — filesystem metadata always; AI-generated tags, language detection, and short descriptions optionally
Metadata filtering — filter the retrieval pool by ext, mtime, language, tags, document_type, or arbitrary regex patterns before scoring
Docling ingestion — optional semantic PDF chunking using Docling, producing image chunks aligned with document structure
Heatmap and circle visualisation — relevance overlays for retrieved pages

Installation

uv sync

# Optional extras:
uv sync --extra qdrant          # Qdrant storage backend (recommended for large indexes)
uv sync --extra docling         # Docling-based PDF chunking
uv sync --extra embedding_server  # Remote vLLM embedding server (adds paramiko for auto-deploy)
uv sync --extra quantization    # 4-bit / 8-bit local model quantization (adds bitsandbytes)

Releases

FORetrieval uses CalVer (YYYY.MM.MICRO). Releases are published on GitHub only (no PyPI) and inherit the visibility of this private repository.

Install a specific release directly from a git tag (requires SSH access to the repo):

uv pip install "foretrieval @ git+ssh://git@github.com/FOR-sight-ai/FORetrieval.git@v2026.5.0"

# With extras:
uv pip install "foretrieval[qdrant,vector_db_server] @ git+ssh://git@github.com/FOR-sight-ai/FORetrieval.git@v2026.5.0"

Or download the .whl / .tar.gz attached to a release on the Releases page and install it with uv pip install <file>.

Pre-requisites

Poppler

Required by pdf2image for PDF-to-image conversion:

Debian / Ubuntu

sudo apt-get install -y poppler-utils

Flash-Attention (optional)

Speeds up ColQwen2 / Gemma-based models significantly:

uv pip install flash-attn

Hardware

ColPali uses multi-billion parameter models. A GPU is strongly recommended for indexing and search. Weak or older GPUs (sm_70+) work fine; CPU is supported but slow.

Quick usage

from foretrieval import MultiModalRetrieverModel

# Index a folder of PDFs
model = MultiModalRetrieverModel.from_pretrained(
    "vidore/colqwen2.5-v0.2",
    index_root="my_indexes",
    storage_qdrant=True,   # use Qdrant backend (default)
)
model.index(
    input_path="path/to/docs/",
    index_name="my_index",
    store_collection_with_index=True,
)
# Indexing is recursive: all files in subdirectories are also indexed.
# Use update_index_from_folder() to add only new files to an existing index,
# also recursing into subdirectories.

# Load an existing index and search
model = MultiModalRetrieverModel.from_index(
    index_path="my_index",
    index_root="my_indexes",
)
results = model.search("maximum output current", k=3)
for r in results:
    print(r.doc_id, r.page_num, r.score)

Storage backends

FORetrieval supports four backends for storing and searching embeddings:

Backend	`storage_backend` value	Dep	Scoring	Typical use
Local	`"local"`	—	Exact MAX_SIM (in-RAM)	Development, small corpora
Qdrant (default)	`"qdrant"`	`foretrieval[qdrant]`	Exact MAX_SIM (native)	Large local indexes, best accuracy
Milvus	`"milvus"`	`foretrieval[milvus]`	Approximate (mean-pool ANN + late-interaction rerank)	Milvus ecosystem
Remote	`"remote"`	httpx (core dep)	Delegated to server	GPU/network-separated deployments

The backend is fixed when an index is first created. It cannot be changed without recreating the index.

# Local (on-disk .pt files)
model = MultiModalRetrieverModel.from_pretrained(..., storage_backend="local")

# Qdrant (embedded, on-disk)
model = MultiModalRetrieverModel.from_pretrained(..., storage_backend="qdrant")

# Milvus Lite (file-based)
model = MultiModalRetrieverModel.from_pretrained(..., storage_backend="milvus")

# Remote server (server holds collections; local machine stays stateless)
from foretrieval.vector_db_server import VectorDBServerConfig

model = MultiModalRetrieverModel.from_pretrained(
    "athrael-soju/colqwen3.5-4.5B-v3",
    storage_backend="remote",
    storage_config={
        "url": "http://gpu-server:18000",
        "backend": "qdrant",   # server-side backend
    },
)

# Load existing index — backend and server URL auto-read from index_config.json.gz
# api_key must be re-supplied at load time (it is never persisted to disk)
model = MultiModalRetrieverModel.from_index(
    "my_index",
    index_root=".",
    storage_config={"api_key": "my-secret"},
)

Note: The deprecated storage_qdrant=True/False flag still works (maps to storage_backend="qdrant"/"local") but will be removed in a future release.

Metadata generation

Metadata can be attached to each document at indexing time. Two levels are available:

Filesystem metadata (no AI required): always populated from the file itself.

Field	Source
`stem`, `ext`, `mime`	filename and MIME type
`mtime`	file modification time (ISO-8601 UTC)
`page_count`	number of pages (PDFs only)
`author`, `title`	embedded PDF metadata (may be absent)
`image_width`, `image_height`	dimensions (images only)

AI-generated metadata (requires an LLM provider): language, tags, document_type, short_description.

from foretrieval.metadata import ai_metadata_provider_factory
from foretrieval.models_metadata import build_metadata_list_for_dir

# No-AI provider: filesystem fields only
provider = ai_metadata_provider_factory(None)

# AI provider: enriches with language, tags, document_type, short_description
provider = ai_metadata_provider_factory({
    "provider": "openrouter",
    "name": "mistralai/mistral-small-3.2-24b-instruct",
    "api_key": "...",
})

metadata_list = build_metadata_list_for_dir(Path("docs/"), provider)

model.index(
    input_path="docs/",
    index_name="my_index",
    metadata=metadata_list,
)

Metadata filtering

When an index was built with metadata, search() accepts a filter_metadata dict that restricts the scoring pool to matching documents only.

Declared filter fields

from foretrieval.models_metadata import MetadataFilter

# Only PDF files
results = model.search("max current", k=3, filter_metadata={"ext": ".pdf"})

# Files modified after a date
results = model.search("max current", k=3, filter_metadata={
    "mtime": {">=": "2025-01-01T00:00:00Z"}
})

# Multiple criteria (AND by default)
results = model.search("max current", k=3, filter_metadata={
    "ext": ".pdf",
    "language": "en",
})

# OR logic
results = model.search("max current", k=3, filter_metadata={
    "ext": [".pdf", ".docx"],
    "logic": "OR",
})

Filter field	Type	Description
`ext`	`str` or `list[str]`	File extension(s)
`mtime`	`dict`	Operators: `>=`, `<=`, `>`, `<`, `==` against ISO-8601 string
`language`	`str` or `list[str]`	Language code(s), e.g. `"en"`
`tags`	`str` or `list[str]`	Any tag in common (requires AI metadata)
`document_type`	`str` or `list[str]`	Document type (requires AI metadata)
`logic`	`"AND"` or `"OR"`	How to combine criteria (default: `"AND"`)

Any other key is matched by exact string equality against the stored metadata dict.

Regex pattern matching

Use the regex field for substring or pattern matching on any text field. Patterns use Python re.search and are always case-insensitive:

# Files whose name contains "general"
results = model.search("max current", k=3, filter_metadata={
    "regex": {"stem": "general"}
})

# Title contains "motor" or "pump"
results = model.search("specs", k=3, filter_metadata={
    "regex": {"title": "motor|pump"}
})

# Combine with ext filter
results = model.search("specs", k=3, filter_metadata={
    "ext": ".pdf",
    "regex": {"stem": "^report_2025"},
})

When the filter matches no documents, search() returns an empty list [] without raising.

Docling ingestion

FORetrieval optionally uses Docling to convert PDFs into semantically meaningful image chunks rather than whole pages. Each chunk corresponds to a coherent region of text and associated figures.

model = MultiModalRetrieverModel.from_pretrained(
    "vidore/colqwen2.5-v0.2",
    ingestion={"backend": "docling"},
    index_root="my_indexes",
)
model.index(input_path="docs/", index_name="chunked_index")

Results include a chunk_num field identifying the exact Docling chunk within the page.

Running the test suite

Install the dev dependencies first:

uv sync --extra dev

Unit tests

No API keys, no GPU required — runs in seconds:

pytest -m "not slow and not integration"

Metadata tests (no AI)

pytest tests/test_metadata_no_ai.py

Metadata tests (with AI)

Set at least one API key:

export OPENROUTER_API_KEY=...
export OPENAI_API_KEY=...
export MISTRAL_API_KEY=...
export OLLAMA_HOST=http://localhost:11434   # + optionally OLLAMA_MODEL (default: mistral-small-latest)

pytest tests/test_metadata_ai.py -v

All available backends are detected automatically and the suite runs once per backend.

Vector-store backend tests

Unit tests for all backends (no GPU, backends mocked or run in-process):

# Local backend
pytest tests/test_vector_store_local.py

# Qdrant backend (unit: mock client; slow: embedded Qdrant round-trip)
pytest tests/test_vector_store_qdrant.py -m "not slow"
pytest tests/test_qdrant.py -m "not slow and not integration"

# Milvus backend (unit: mock client; slow: Milvus Lite round-trip)
pytest tests/test_vector_store_milvus.py -m "not slow"

# Remote backend (unit: HTTP calls mocked; server app: FastAPI TestClient)
pytest tests/test_vector_store_remote.py
pytest tests/test_vector_db_server_app.py
pytest tests/test_vector_db_server_config.py
pytest tests/test_vector_db_server_client.py
pytest tests/test_vector_db_server_manager.py

# Factory and backend dispatch
pytest tests/test_vector_store_factory.py
pytest tests/test_colpali_backend_dispatch.py

Integration tests (require a live vector-DB server — see Remote vector-DB server section):

# Set the server URL to skip the skipif guard
export FORETRIEVAL_TEST_DB_SERVER_URL=http://localhost:18000
pytest tests/ -m "slow and integration" -v

Metadata filter tests

pytest tests/test_metadata_filter.py

Slow tests (GPU-dependent)

Full ColPali indexing and search:

pytest -m slow

Markers reference

Marker	Meaning
`slow`	GPU-dependent or computationally expensive
`integration`	Requires a live API key or Ollama daemon

Remote embedding server

FORetrieval can offload all embedding computation to a remote GPU server running vLLM. The local machine only loads the processor (tokenizer + image preprocessor) — no model weights, no GPU required locally.

Requirements:

vLLM ≥ 0.19.0 on the remote server
Only ColQwen3 / ColQwen3.5 models are supported by the vLLM /pooling endpoint. ColPali, ColQwen2, and ColQwen2.5 are not supported.
Recommended model: athrael-soju/colqwen3.5-4.5B-v3 (rank 3 on ViDoRe V3, 320-dim, Apache 2.0)

Quick start

from foretrieval import MultiModalRetrieverModel
from foretrieval.embedding_server import EmbeddingServerConfig

cfg = EmbeddingServerConfig(
    url="http://gpu-server:8000",
    model_name="athrael-soju/colqwen3.5-4.5B-v3",
)

model = MultiModalRetrieverModel.from_pretrained(
    "athrael-soju/colqwen3.5-4.5B-v3",
    index_root="my_indexes",
    embedding_server=cfg,
)
model.index("path/to/docs/", index_name="my_index")
results = model.search("maximum altitude", k=3)

Auto-deploy

Set auto_deploy=True to have FORetrieval SSH to the GPU server and start the vLLM Docker container automatically if it is not already running. Requires foretrieval[embedding_server] (adds paramiko).

cfg = EmbeddingServerConfig(
    url="http://gpu-server:8000",
    model_name="athrael-soju/colqwen3.5-4.5B-v3",
    auto_deploy=True,
    ssh_host="gpu-server",       # SSH target
    ssh_user="myuser",           # optional, defaults to $USER
    n_gpus=-1,                   # -1 = all available GPUs (auto-detected via nvidia-smi)
)

The manager pulls vllm/vllm-openai:latest, starts the container with --tensor-parallel-size N, and writes a metadata file at ~/.foretrieval/deployment.json on the remote. Subsequent calls detect the running container and skip redeployment.

Authentication and SSL

cfg = EmbeddingServerConfig(
    url="https://gpu-server:8000",
    model_name="athrael-soju/colqwen3.5-4.5B-v3",
    api_key="my-secret-token",   # Authorization: Bearer header
    verify_ssl=False,            # for self-signed certificates
)

Deploy vLLM with --api-key my-secret-token to require authentication.

SSH tunnel (firewalled servers)

If port 8000 is not directly reachable, open an SSH tunnel first:

ssh -fNL 8000:localhost:8000 gpu-server

Then use http://localhost:8000 as the URL.

EmbeddingServerConfig reference

Field	Default	Description
`url`	required	Base URL of the vLLM server
`model_name`	required	HuggingFace model ID (must contain `colqwen3`)
`auto_deploy`	`false`	SSH + Docker auto-deploy
`ssh_host`	`None`	SSH hostname (required when `auto_deploy=True`)
`ssh_user`	`None`	SSH username (defaults to `$USER`)
`ssh_key_path`	`None`	Path to SSH private key (defaults to SSH agent)
`n_gpus`	`-1`	Number of GPUs (`-1` = all available)
`port`	`8000`	Port exposed on the remote server
`hf_token`	`None`	HuggingFace token for gated models
`api_key`	`None`	Bearer token for server authentication
`verify_ssl`	`True`	Verify SSL certificates
`batch_size`	`4`	Images per request (auto-halved on OOM)
`request_timeout`	`120`	HTTP timeout in seconds

Remote vector-DB server

FORetrieval can offload all vector-store operations (indexing, search, fetch) to a remote HTTP server. The local machine only stores the processor and the shared sidecar files — collections live entirely on the server.

Requires: foretrieval[vector_db_server] (adds fastapi, uvicorn, paramiko).

Quick start

Start the server manually on a remote host:

pip install "foretrieval[qdrant,milvus,vector_db_server]"   # or: uv pip install "foretrieval[qdrant,milvus,vector_db_server]"
uvicorn foretrieval.vector_db_server.server:app --host 0.0.0.0 --port 18000
# or: foretrieval-db-server   (console script)

Then use it from the client:

from foretrieval import MultiModalRetrieverModel

model = MultiModalRetrieverModel.from_pretrained(
    "athrael-soju/colqwen3.5-4.5B-v3",
    storage_backend="remote",
    storage_config={
        "url": "http://gpu-server:18000",
        "backend": "qdrant",   # server-side storage backend: local | qdrant | milvus
    },
)
model.index("path/to/docs/", index_name="my_index")
results = model.search("maximum altitude", k=3)

Auto-deploy

Set auto_deploy=True to have FORetrieval SSH to the remote host, build the Docker image from the local foretrieval source, and start the container automatically. Requires foretrieval[vector_db_server] and Docker on the remote host.

model = MultiModalRetrieverModel.from_pretrained(
    "athrael-soju/colqwen3.5-4.5B-v3",
    storage_backend="remote",
    storage_config={
        "url": "http://gpu-server:18000",
        "backend": "qdrant",
        "auto_deploy": True,
        "ssh_host": "gpu-server",
        "data_dir": "/var/lib/foretrieval_db",  # bind-mounted into container
    },
)

The manager:

Uploads the foretrieval/ package source to ~/foretrieval_db_build/ via SSH.
Runs docker build -t foretrieval-vector-db:local on the remote.
Starts the container: docker run -p 18000:18000 -v <data_dir>:/data ….
Writes metadata to ~/.foretrieval/db_deployment.json on the remote. Subsequent calls detect the running container and skip re-deployment.

Authentication and SSL

storage_config={
    "url": "https://gpu-server:18000",
    "backend": "qdrant",
    "api_key": "my-secret-token",   # Authorization: Bearer header
    "verify_ssl": False,            # for self-signed certificates
}

Start the server with FOR_DB_API_KEY=my-secret-token to require authentication.

SSH tunnel (firewalled servers)

If port 18000 is not directly reachable, open an SSH tunnel first:

ssh -fNL 18000:localhost:18000 gpu-server

Then use http://localhost:18000 as the URL.

Server environment variables

Variable	Default	Description
`FOR_DB_DATA_DIR`	`/data`	Root directory where collections are persisted
`FOR_DB_API_KEY`	`""`	Bearer token (auth disabled if empty)
`FOR_DB_HOST`	`0.0.0.0`	Bind address
`FOR_DB_PORT`	`18000`	Bind port

VectorDBServerConfig reference

Field	Default	Description
`url`	required	Base URL of the vector-DB server
`backend`	`"qdrant"`	Server-side storage backend (`local`, `qdrant`, or `milvus`)
`storage_config`	`None`	Extra backend-specific config forwarded to the server (e.g. `{"candidate_limit": 128}` for Milvus)
`auto_deploy`	`false`	SSH + Docker auto-deploy
`ssh_host`	`None`	SSH hostname (required when `auto_deploy=True`)
`ssh_user`	`None`	SSH username (defaults to `$USER`)
`ssh_key_path`	`None`	Path to SSH private key (defaults to SSH agent)
`port`	`18000`	Port exposed on the remote server
`api_key`	`None`	Bearer token for server authentication (never persisted to disk)
`verify_ssl`	`True`	Verify SSL certificates
`request_timeout`	`120`	HTTP timeout in seconds
`data_dir`	`/var/lib/foretrieval_db`	Data path on the remote host (bind-mounted into container)

Persistence and reload

When a remote index is exported (model._export_index()), the index_config.json.gz on the local filesystem stores storage_backend="remote" and the server URL. Sensitive fields (api_key) are never persisted to disk. To reload the index later:

model = MultiModalRetrieverModel.from_index(
    "my_index",
    index_root=".",
    storage_config={"api_key": "my-secret"},   # re-supply at load time
)

Local model quantization

For local (non-remote) inference, 4-bit and 8-bit quantization reduce VRAM usage via BitsAndBytes. Requires foretrieval[quantization] and a CUDA device.

model = MultiModalRetrieverModel.from_pretrained(
    "vidore/colqwen2.5-v0.2",
    load_in_4bit=True,                  # or load_in_8bit=True
    bnb_4bit_quant_type="nf4",          # "nf4" (default) or "fp4"
    bnb_4bit_compute_dtype="float16",   # compute dtype
)

Acknowledgements

FORetrieval was originally forked from Byaldi, a wrapper around the ColPali repository. It has since diverged significantly to add metadata generation and filtering, Qdrant storage, Docling ingestion, and heatmap visualisation.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.github/workflows		.github/workflows
benchmark_results		benchmark_results
foretrieval		foretrieval
sample_data		sample_data
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FORetrieval

Installation

Releases

Pre-requisites

Poppler

Flash-Attention (optional)

Hardware

Quick usage

Storage backends

Metadata generation

Metadata filtering

Declared filter fields

Regex pattern matching

Docling ingestion

Running the test suite

Unit tests

Metadata tests (no AI)

Metadata tests (with AI)

Vector-store backend tests

Metadata filter tests

Slow tests (GPU-dependent)

Markers reference

Remote embedding server

Quick start

Auto-deploy

Authentication and SSL

SSH tunnel (firewalled servers)

EmbeddingServerConfig reference

Remote vector-DB server

Quick start

Auto-deploy

Authentication and SSL

SSH tunnel (firewalled servers)

Server environment variables

VectorDBServerConfig reference

Persistence and reload

Local model quantization

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages