Skip to main content

Arrow & Polars Guide

TTOON maintains two independent processing paths: the object path (general-purpose) and the Arrow path (high-performance tabular). This guide covers the Arrow path.

Why a Separate Arrow Path?

The Arrow path keeps tabular data in Arrow-native columnar form instead of language-native objects. Today, the strongest fast path is T-JSON → Arrow direct read; T-TOON tabular still interoperates through the compatibility Node route. For tabular data, this means:

  • No language-native row materialization on the Arrow side — data stays columnar instead of becoming dict / JS object rows
  • Lower conversion overhead where direct paths exist — especially for T-JSON → Arrow reads
  • Native type preservationDecimal128, Date32, Timestamp, FixedSizeBinary(16) (UUID) stay in their Arrow-native forms

Python: Polars & PyArrow

Serialize

import polars as pl
import pyarrow as pa
import ttoon

# Polars DataFrame
df = pl.DataFrame({"name": ["Alice", "Bob"], "score": [95, 87]})
text = ttoon.dumps(df)
# [2]{name,score}:
# "Alice", 95
# "Bob", 87

# PyArrow Table
table = pa.table({"name": ["Alice", "Bob"], "score": [95, 87]})
text = ttoon.dumps(table)

# Arrow → T-JSON
text = ttoon.stringify_arrow_tjson(df)
# [{"name": "Alice", "score": 95}, {"name": "Bob", "score": 87}]

dumps() auto-detects Polars DataFrame and PyArrow Table/RecordBatch inputs, routing them to the Arrow path. Polars DataFrames are converted to Arrow first (zero-copy in Polars).

Deserialize to Arrow

table = ttoon.read_arrow(text)  # returns pyarrow.Table

From the returned pyarrow.Table, you can convert to any downstream format:

df = pl.from_arrow(table)      # Polars DataFrame
pandas_df = table.to_pandas() # Pandas DataFrame

Delimiter Options

text = ttoon.dumps(df, delimiter="|")
# [2]{name,score}:
# "Alice"| 95
# "Bob"| 87

text = ttoon.dumps(df, delimiter="\t")

JavaScript: Apache Arrow

Requires the optional peer dependency apache-arrow.

Serialize

import { stringifyArrow, stringifyArrowTjson } from '@ttoon/shared';
import { tableFromArrays } from 'apache-arrow';

const table = tableFromArrays({
name: ['Alice', 'Bob'],
score: [95, 87],
});

// Arrow → T-TOON tabular
const ttoonText = await stringifyArrow(table);

// Arrow → T-JSON
const tjsonText = await stringifyArrowTjson(table);

Deserialize to Arrow

import { readArrow } from '@ttoon/shared';

const table = await readArrow(text);

Arrow APIs in JS are async because they dynamically import the apache-arrow module.

Rust

use ttoon_core::{read_arrow, arrow_to_ttoon, arrow_to_tjson};

let table = read_arrow(text)?;
let ttoon = arrow_to_ttoon(&table, None)?;
let tjson = arrow_to_tjson(&table, None)?;

Arrow Input Requirements

read_arrow() across all languages enforces these constraints:

RequirementDescription
Root must be a listArrow bridge only handles tabular data
Each element must be an objectObject keys become schema fields
Field types must be consistentCannot mix different scalar types in the same column
No structural fieldsList/object values are not arrowable

Arrow Schema Mapping

Typed TypeArrow Type
intInt64
floatFloat64
decimalDecimal128 or Decimal256 (by precision)
stringUtf8
boolBoolean
dateDate32
timeTime64(Microsecond)
datetimeTimestamp(Microsecond[, tz])
uuidFixedSizeBinary(16) + UUID metadata
hex/b64Binary
nullNullable column; all-null infers as Null

Arrow types are preserved at their native resolution — decimal is not downgraded to string, uuid uses FixedSizeBinary(16) with metadata.

Performance Notes

T-JSON Direct Path

The Rust core includes a two-pass direct path for T-JSON → Arrow (read_arrow_tjson_direct) that skips the Token/Node intermediate layer. This significantly reduces memory usage for large datasets and benefits all SDKs through the shared core.

Sparse Schema Support

T-JSON read_arrow() supports sparse rows — missing keys are treated as null. Schema field order is inferred from the first occurrence order within the batch.

T-TOON tabular uses the header field order and width as-is.

Datetime Timezone Consistency

The JS Arrow bridge does not allow mixing timezone-aware and naive datetimes within the same column. Mixing them causes a schema inference error.

Next Steps

  • Streaming Guide — Row-by-row Arrow streaming with ArrowStreamReader / ArrowStreamWriter
  • Type Mapping — Complete cross-language type table
  • Stream API — Streaming APIs and schema definitions