Engineering

Arrow-Native Queries: Zero-Copy Data for AI Agents

HatiData Team7 min read

The Serialization Tax Nobody Talks About

Every AI agent that queries a database pays a hidden tax. The database stores your data as typed, columnar values — integers, floats, timestamps, booleans. But the moment those values leave the database, they are converted to text. The integer 42 becomes the string "42". The timestamp 2026-02-18T10:30:00Z becomes a 24-character string. The boolean true becomes 4 bytes of ASCII.

Then the agent's Python code parses every single value back into its native type. int("42"). datetime.fromisoformat("2026-02-18T10:30:00Z"). bool("true"). For a query returning 10,000 rows with 20 columns, that is 200,000 individual parse operations — each one allocating memory, validating format, and converting bytes.

This is the serialization tax: the CPU cost of converting typed data to text and back again, repeated on every query, for every agent, every time. For a single small query, the cost is negligible. For an AI agent running hundreds of queries per hour across thousands of rows, the serialization tax compounds into a measurable fraction of total compute cost.

The irony is that the data never needed to be text in the first place. The engine stores data internally in columnar format. Pandas and polars operate on Arrow-compatible columnar arrays. The only reason the data becomes text is because the transport protocol — Postgres wire protocol, JSON REST — was designed for human-readable output, not machine-to-machine data transfer.

What Arrow-Native Queries Change

HatiData now supports returning query results as Apache Arrow IPC streams — the standard binary format for inter-process columnar data transfer. When an agent requests Arrow-native results, the data flows directly from the engine's internal columnar representation to the agent's DataFrame without ever becoming text.

The HTTP endpoint is straightforward:

python
import pyarrow.ipc as ipc
import requests

resp = requests.post(
    "http://localhost:5440/v1/query/arrow",
    json={"sql": "SELECT * FROM orders WHERE total > 100"},
    headers={"Authorization": "Bearer hd_live_..."},
)

# Arrow IPC bytes → DataFrame in one step
reader = ipc.open_stream(resp.content)
table = reader.read_all()
df = table.to_pandas()  # Zero parsing. Types preserved.

The response is raw Arrow IPC bytes with content-type: application/vnd.apache.arrow.stream. No JSON wrapping. No string encoding. The bytes are the data.

For polars users, the path is even more efficient because polars is built natively on Apache Arrow:

python
import polars as pl
import pyarrow.ipc as ipc

reader = ipc.open_stream(resp.content)
table = reader.read_all()
df = pl.from_arrow(table)  # True zero-copy

Why This Matters for Agent Workloads

The serialization tax is not a theoretical concern. It manifests in three concrete ways that affect production agent deployments.

First, latency. An agent running a ReAct loop executes queries sequentially — each query must complete before the next reasoning step begins. When each query result requires 200,000 parse operations, the cumulative latency adds seconds to the agent's total reasoning time. Arrow-native results eliminate this parsing latency entirely.

Second, memory. Text representations are universally larger than typed binary representations. The integer 42 is 2 bytes in Arrow (Int16) but 2 bytes of UTF-8 text plus string overhead in JSON. Timestamps are 8 bytes in Arrow but 24+ bytes as ISO 8601 strings. For large result sets, the memory savings from Arrow are substantial — often 2x to 5x compared to text-based formats.

Third, type fidelity. JSON has no native representation for dates, timestamps, decimal numbers, or null values in typed arrays. A column of integers with one null value becomes a column of strings with one null in JSON. Arrow preserves every type exactly as the database stores it — including nulls as first-class values in typed arrays. Agents consuming Arrow results never encounter type coercion surprises.

The MCP Tool Path

For agents connected via MCP (Model Context Protocol), the Arrow-native path is available as the run_sql_arrow tool. Since MCP tool results must be JSON text, the Arrow IPC bytes are base64-encoded:

json
{
  "name": "run_sql_arrow",
  "arguments": {
    "sql": "SELECT region, SUM(revenue) as total FROM sales GROUP BY region"
  }
}

The response includes the base64-encoded Arrow IPC stream:

json
{
  "format": "arrow_ipc_base64",
  "data": "QVJST1cxAAAAAAoAAAAAAAD...",
  "rows": 12,
  "execution_time_ms": 3
}

Decode in two lines:

python
import base64, pyarrow.ipc as ipc, io

ipc_bytes = base64.b64decode(result["data"])
table = ipc.open_stream(io.BytesIO(ipc_bytes)).read_all()
df = table.to_pandas()

The base64 encoding adds approximately 33% overhead to the payload size, which makes the MCP tool path less efficient than the direct HTTP endpoint for very large result sets. For typical agent workloads (under 100K rows), the difference is negligible. For bulk data analysis, use the HTTP endpoint directly.

Security: Column Masking on Arrow Arrays

A natural question: how does column masking work with Arrow-native results? HatiData's column masking engine applies masking functions (full redaction, partial redaction, cryptographic hash, null) directly on Arrow RecordBatch arrays. When a column is masked, it is converted to a UTF-8 string array with the masking function applied to each value.

This means masked columns change their Arrow data type. An integer column masked with full redaction becomes a string column containing "<REDACTED>" values. The Arrow schema reflects this change, so the agent's code can detect which columns were masked by comparing the response schema against the expected table schema.

The security pipeline is identical for Arrow and text queries. Authentication, ABAC policy evaluation, row-level security, cost estimation, quota checks, transpilation, metering, and audit logging all execute the same way. The only difference is the final serialization step: Arrow IPC instead of Postgres text protocol or JSON.

When to Use Arrow vs. JSON

Arrow-native queries are not a universal replacement for JSON results. They are optimal for specific workload patterns:

Use Arrow when your agent processes results with pandas, polars, or any Arrow-compatible library. Use Arrow when result sets exceed a few hundred rows. Use Arrow when type fidelity matters — financial calculations, timestamp comparisons, nullable integer columns. Use Arrow when you are feeding query results into ML pipelines, statistical analysis, or visualization code.

Use JSON when your agent inspects individual values in a reasoning chain. Use JSON when result sets are small (under 100 rows) and the parsing overhead is negligible. Use JSON when you need human-readable output for debugging or logging.

Most production agents will use both: run_sql_arrow for data analysis tasks and query for lookup and validation tasks. The choice is per-query, not per-agent.

Try It

Arrow-native queries are available in HatiData today. If you are running hati-local, update to the latest version and the /v1/query/arrow endpoint is enabled by default on port 5440. The run_sql_arrow MCP tool is available alongside the existing 23 tools.

bash
# Update to latest
curl -fsSL https://hatidata.com/install.sh | sh

# Test Arrow endpoint
python3 -c "
import pyarrow.ipc as ipc, requests
resp = requests.post('http://localhost:5440/v1/query/arrow',
    json={'sql': 'SELECT 1 AS num, 3.14 AS pi, true AS flag'},
    headers={'Authorization': 'Bearer dev-token'})
table = ipc.open_stream(resp.content).read_all()
print(table.to_pandas())
print(table.to_pandas().dtypes)  # int64, float64, bool
"

The serialization tax has been the default for so long that most teams do not realize they are paying it. Arrow-native queries make the alternative visible: typed data, columnar layout, zero parsing. For AI agents that live in the pandas and polars ecosystem, this is how database results should always have worked.

Enjoyed this post?

Get notified when we publish new engineering deep-dives and product updates.

Ready to see the difference?

Run the free audit script in 5 minutes. Or start Shadow Mode and see HatiData run your actual workloads side-by-side.