Apache Arrow-based protocol for extending DuckDB using any language.
No C++/C/Zig/Rust or compilation/linking required (unless you want to).
Created by Query.Farm
📖 Documentation: vgi-python.query.farm
# my_worker.py
# /// script
# requires-python = ">=3.13"
# dependencies = ["vgi-python"]
# ///
from typing import Annotated
from vgi import ScalarFunction, Param, Returns, Worker
from vgi.catalog import Catalog, Schema
import pyarrow as pa
import pyarrow.compute as pc
class Greeting(ScalarFunction):
"""Generate a greeting for each name."""
@classmethod
def compute(
cls,
name: Annotated[pa.StringArray, Param(doc="Column containing names")],
) -> Annotated[pa.StringArray, Returns()]:
# The final argument to binary_join_element_wise is the separator.
return pc.binary_join_element_wise("Hello, ", name, "!", "")
class MyWorker(Worker):
catalog = Catalog(
name="my_worker",
schemas=[Schema(name="main", functions=[Greeting])],
)
if __name__ == "__main__":
MyWorker().run()The # /// script block is inline script metadata:
uv run my_worker.py reads it, provisions an isolated environment with
vgi-python, and runs the worker — no virtualenv to create or activate.
-- First time only.
INSTALL vgi FROM community;
LOAD vgi;
-- LOCATION is the command that launches the worker. `uv run` resolves the
-- script's inline dependencies, so nothing needs to be installed first.
ATTACH 'my_worker' (TYPE vgi, LOCATION 'uv run my_worker.py');
SELECT my_worker.greeting(name) FROM users;
-- "Hello, Alice!"
-- "Hello, Bob!"Or you can launch the Haybarn CLI and attach the worker in one step:
uvx haybarn-cli "vgi:my_worker?location=uv run my_worker.py"This drops you into a session with the functions you just added, available as
my_worker.greeting(...).
That's it. No C++ compilation, no extension versioning, no complex build process. Just a Python script that Haybarn (or DuckDB) can call.
The Python package is published on PyPI as vgi-python (the vgi name was
taken), but you still import vgi in code. The examples above don't install it
explicitly — the worker script's inline # /// script metadata lets uv run
provision it on demand. To add it to a project or environment directly:
pip install vgi-python # or: uv add vgi-pythonYou also need a DuckDB-compatible SQL engine to load the vgi extension and
call your functions. These examples use Haybarn,
Query Farm's DuckDB distribution, which ships the vgi extension signed for its
own catalog and runs with no install via uvx:
uvx haybarn-cli # start an interactive SQL sessionStock duckdb works too — INSTALL vgi FROM community; LOAD vgi; resolves the
extension from the DuckDB community repository instead.
VGI lets you extend DuckDB with Python functions that run in separate processes, communicating via Apache Arrow IPC. This means:
| Traditional Extensions | VGI Workers |
|---|---|
| C/C++ compilation required | Any language with an Apache Arrow library |
| Tied to DuckDB version | Version independent |
| Complex build/release cycle | Ship a script or executable |
| Runs in-process | Process isolation |
| Single-threaded | Parallel workers |
Use cases:
- Call REST APIs or external services from SQL
- Run ML inference (PyTorch, scikit-learn, etc.)
- Process data with Python libraries (pandas, numpy)
- Build custom ETL transforms
- Create domain-specific functions for your team
- Expose external data sources as queryable tables and views
A worker is a Python script that defines one or more functions:
# my_worker.py
# /// script
# requires-python = ">=3.13"
# dependencies = ["vgi-python"]
# ///
from typing import Annotated
import pyarrow as pa
import pyarrow.compute as pc
from vgi import ScalarFunction, Param, Returns, Worker
from vgi.catalog import Catalog, Schema
class UpperCase(ScalarFunction):
"""Convert string values to uppercase."""
@classmethod
def compute(
cls,
value: Annotated[pa.StringArray, Param(doc="String value to uppercase")],
) -> Annotated[pa.StringArray, Returns()]:
return pc.utf8_upper(value)
class MyWorker(Worker):
catalog = Catalog(
name="my_funcs",
schemas=[Schema(name="main", functions=[UpperCase])],
)
if __name__ == "__main__":
MyWorker().run()-- Attach the worker as a catalog (its catalog name is "my_funcs")
ATTACH 'my_funcs' (TYPE vgi, LOCATION 'uv run my_worker.py');
-- Call your function (qualify with the catalog name, or run `USE my_funcs;` first)
SELECT my_funcs.upper_case(name) FROM users;
-- Use in complex queries
SELECT id, my_funcs.upper_case(status) as status
FROM orders
WHERE created_at > '2024-01-01';Your function is now available in any DuckDB-compatible engine. Ship the Python script to your team, and they can use it immediately.
For production use, you'll want type validation. Use Param with type_bound to ensure columns have the correct type:
from typing import Annotated
from vgi import ScalarFunction, Param, Returns, Worker
import pyarrow as pa
import pyarrow.compute as pc
class AddValues(ScalarFunction):
"""Add two integer values together."""
@classmethod
def compute(
cls,
left: Annotated[pa.Int64Array, Param(type_bound=pa.types.is_integer, doc="First integer value")],
right: Annotated[pa.Int64Array, Param(type_bound=pa.types.is_integer, doc="Second integer value")],
) -> Annotated[pa.Int64Array, Returns()]:
return pc.add(left, right)SELECT add_values(price, tax) as total FROM orders;
-- This would fail at bind time with a clear error:
-- SELECT add_values(name, price) FROM orders;
-- Error: Column 'name' has type string, expected integerKey features of the Param/Returns API:
- Types are inferred from PyArrow array annotations (
pa.Int64Array->pa.int64()) type_boundvalidates the column's Arrow type at bind timeConstParamreceives scalar values (not columns) from SQL argumentsReturnsdeclares the output type
VGI supports three function types:
| Type | Base Class | SQL Pattern | Use Case |
|---|---|---|---|
| Scalar | ScalarFunction |
SELECT func(col) FROM t |
Per-row transforms (1:1) |
| Table | TableFunctionGenerator |
SELECT * FROM func(args) |
Generate data |
| Table-In-Out | TableInOutFunction |
SELECT * FROM func((SELECT ...)) |
Aggregation, filtering |
Transform each row independently. Output has the same number of rows as input.
class Double(ScalarFunction):
"""Double an integer value."""
@classmethod
def compute(
cls,
value: Annotated[pa.Int64Array, Param(doc="Value to double")],
) -> Annotated[pa.Int64Array, Returns()]:
return pc.multiply(value, 2)Generate output data from arguments (no input table). Each call to process() emits
a batch via out.emit() or signals completion via out.finish().
from dataclasses import dataclass
from typing import Annotated, ClassVar
import pyarrow as pa
from vgi import TableFunctionGenerator, Arg
from vgi.table_function import ProcessParams, OutputCollector
@dataclass
class CounterState:
remaining: int
current: int = 0
class Counter(TableFunctionGenerator):
"""Generate a sequence of integers."""
count: Annotated[int, Arg(0, doc="Number of rows to generate")]
FIXED_SCHEMA: ClassVar[pa.Schema] = pa.schema([("n", pa.int64())])
@classmethod
def initial_state(cls, params: ProcessParams) -> CounterState:
return CounterState(remaining=params.args.count)
@classmethod
def process(cls, params: ProcessParams, state: CounterState, out: OutputCollector) -> None:
if state.remaining <= 0:
out.finish()
return
batch_size = min(state.remaining, 1000)
values = list(range(state.current, state.current + batch_size))
out.emit(pa.RecordBatch.from_pydict({"n": values}, schema=params.output_schema))
state.current += batch_size
state.remaining -= batch_sizeTransform or aggregate input data. Override transform() for per-batch processing
and finish() for final output after all input is consumed.
import pyarrow as pa
import pyarrow.compute as pc
from vgi import TableInOutFunction
class FilterPositive(TableInOutFunction):
"""Keep only rows where all numeric columns are positive."""
@property
def output_schema(self) -> pa.Schema:
return self.input_schema
def transform(self, batch: pa.RecordBatch) -> pa.RecordBatch:
mask = None
for i, field in enumerate(batch.schema):
if pa.types.is_integer(field.type) or pa.types.is_floating(field.type):
col_mask = pc.greater(batch.column(i), 0)
mask = col_mask if mask is None else pc.and_(mask, col_mask)
if mask is not None:
return pc.filter(batch, mask)
return batchVGI workers can expose more than just functions. A worker can provide a complete database catalog with:
- Schemas - Organize objects into namespaces
- Tables - Expose external data as queryable tables
- Views - Define SQL views over your data
- Functions - Scalar, table, and table-in-out functions
ATTACH 'external_db' (TYPE vgi, LOCATION 'uv run my_catalog_worker.py');
-- Query tables from the attached catalog
SELECT * FROM external_db.main.users;
-- Use views
SELECT * FROM external_db.analytics.daily_summary;
-- Call functions
SELECT external_db.main.transform(col) FROM my_table;This enables VGI workers to act as bridges to external systems—databases, APIs, file systems—presenting them as native DuckDB catalogs.
See Catalog Interface for implementation details.
Functions can run across multiple worker processes. The client automatically distributes input batches round-robin across workers and collects results.
See Function API Reference for advanced patterns like distributed aggregation.
Errors in your functions propagate to DuckDB with clear messages:
@classmethod
def compute(cls, value: Annotated[pa.Int64Array, Param()]) -> Annotated[pa.Int64Array, Returns()]:
raise ValueError("Something went wrong")SELECT my_func(col) FROM my_table;
-- Error: Something went wrongType bound violations are caught at bind time (before processing starts):
SELECT add_values(name, price) FROM orders;
-- Error: Argument 'left': Column 'name' has type string,
-- but type bound requires: is_integerWhen a worker fails, the Python traceback is written to stderr. By default, the client captures this stderr and includes it in the error message (last 50 lines), so you get the full context:
ClientError: Worker Exception: function 'my_func' raised ValueError
Worker stderr:
Traceback (most recent call last):
File "my_worker.py", line 42, in compute
...
ValueError: Something went wrong
For real-time debugging, set VGI_WORKER_DEBUG=1 to stream worker logs directly to your terminal and enable DEBUG-level logging:
VGI_WORKER_DEBUG=1 python my_script.pyThis is especially useful when integrating from C++ or other clients where stderr might otherwise be lost.
Use the VGI client for integration tests:
from vgi.client import Client
from vgi import Arguments
import pyarrow as pa
batch = pa.RecordBatch.from_pydict({"name": ["alice", "bob"]})
with Client("./my_worker.py") as client:
results = list(client.scalar_function(
function_name="upper_case",
input=iter([batch]),
arguments=Arguments(positional=[pa.scalar("name")]),
))
assert results[0]["result"].to_pylist() == ["ALICE", "BOB"]VGI uses vgi_rpc, an Apache Arrow IPC-based RPC framework, for all
client-worker communication over stdin/stdout pipes:
Client Worker
│ │
│──── bind(request) ──────────────▶ │ Function name, args, input schema
│◀─── BindResponse ──────────────── │ Output schema, opaque data
│ │
│──── init(request) ──────────────▶ │ Start processing stream
│◀─── Stream header ─────────────── │ execution_id, max_workers
│ │
│──── exchange(batch1) ───────────▶ │
│◀─── output batch 1 ────────────── │ transform(batch)
│ ... │
│──── [stream close] ─────────────▶ │ Signal end of input
│ │
│──── init(phase=FINALIZE) ───────▶ │ Start finalize stream
│◀─── final output batches ──────── │ finish() results
└───────────────────────────────────┘
When record batches are too large for HTTP request/response bodies, VGI supports externalizing them to blob storage. The server replaces oversized batches with pointer batches containing a URL, and the client transparently fetches the data.
The example HTTP server includes a built-in demo blob store for testing this without S3 or any cloud infrastructure:
# Start with demo storage (4 KiB threshold for testing)
vgi-fixture-http --demo-storage --externalize-threshold-bytes 4096
# With zstd compression
vgi-fixture-http --demo-storage --externalize-threshold-bytes 4096 --externalize-compression zstdWhen --demo-storage is enabled:
- Batches exceeding
--externalize-threshold-bytesare stored in-memory and served from/__blobs__/{id}endpoints on the same server - Clients can request upload URLs for large inputs via the
__upload_url__endpoint - The server advertises
VGI-Max-Request-Bytesand rejects oversized requests with 413
For production use, implement the ExternalStorage protocol from vgi_rpc against
your cloud storage (S3, GCS, etc.). The example server also supports S3 via --s3-bucket.
- Function Lifecycle - Bind, init, process, finalize
- Metadata API - Function introspection
- Function API Reference - Advanced function patterns
- Catalog Interface - DuckDB ATTACH integration
Workers support --debug, --log-level, --log-format, and --log-logger options:
# Enable debug logging
vgi-fixture-worker --debug
# JSON-formatted logs for structured pipelines
vgi-fixture-worker --log-format json
# Target a specific logger
vgi-fixture-worker --log-level DEBUG --log-logger vgi.workerYou can also use the VGI_WORKER_DEBUG=1 environment variable, which enables --debug on the worker and stderr passthrough on the client without changing any code or CLI flags:
VGI_WORKER_DEBUG=1 python my_script.pySee CLI Reference for the full list of loggers and options.
git clone https://github.com/query-farm/vgi-python
cd vgi-python
uv sync --all-extras # Install dependencies
uv run pytest -n auto # Run tests
uv run ruff check --fix . # Lint
uv run ruff format . # Format
uv run mypy vgi/ # Type check- Python >= 3.13
- pyarrow
- A DuckDB-compatible engine for SQL integration — Haybarn (
uvx haybarn-cli) or stock DuckDB
Copyright (c) 2025, 2026 Query Farm LLC.
Licensed under the Query Farm Source-Available License, Version 1.0 — see LICENSE for the binding terms. In summary (the LICENSE text governs):
- ✅ Use, copy, modify, and redistribute the code freely, including in production and for commercial purposes — your own internal use, and building products and services on top of VGI.
- 🚫 Not permitted without a separate commercial license: offering a competing VGI-equivalent product or service to third parties (hosted, embedded, or as-a-service), or operating a commercial marketplace for such services.
- ⏳ Each released version converts to the Apache License, Version 2.0, ten years after its public release.
For a commercial license or any licensing questions, contact hello@query.farm.
