PyArrow Interop
Marrow and PyArrow implement the same Apache Arrow columnar format. This page covers how they relate, where Marrow is faster, and how to exchange data between them at the Mojo level via the Arrow C Data Interface.
Same data model
Both libraries represent the same Arrow memory layout. An int64 array in Marrow and in PyArrow are bitwise identical in memory.
import marrow as ma
import pyarrow as pa
# Marrow
ma_arr = ma.array([1, 2, 3, None, 5])
print("marrow:", ma_arr)
print("type: ", ma_arr.type())
# PyArrow
pa_arr = pa.array([1, 2, 3, None, 5])
print("pyarrow:", pa_arr)
print("type: ", pa_arr.type)Construction performance
When converting Python data to Arrow arrays, Marrow is faster than PyArrow for numeric and string types when an explicit type is provided. Results below are from pixi run bench_python on Apple M-series hardware (n=100,000 elements, mean time):
| Array type | marrow | PyArrow | speedup |
|---|---|---|---|
int64 (explicit type) |
0.37 ms | 0.89 ms | 2.4x |
int64 + nulls (explicit) |
0.36 ms | 0.87 ms | 2.4x |
float64 (explicit) |
0.34 ms | 0.48 ms | 1.4x |
float64 + nulls |
0.34 ms | 0.51 ms | 1.5x |
string (explicit) |
0.72 ms | 1.06 ms | 1.5x |
string + nulls |
0.70 ms | 1.04 ms | 1.5x |
| struct, primitive fields (explicit) | 5.41 ms | 6.54 ms | 1.2x |
int64 (inferred) |
1.40 ms | 1.27 ms | 1.1x slower |
string (inferred) |
1.57 ms | 1.02 ms | 1.5x slower |
| nested list (inferred) | 3.92 ms | 2.36 ms | 1.7x slower |
The pattern: provide an explicit type when you know it. Marrow’s builder path is highly optimized for known types. When the type must be inferred from a scan of Python objects, PyArrow currently wins for complex nested structures.
Reproduce with:
pixi run bench_pythonZero-copy C Data Interface
Marrow arrays implement the Arrow C Data Interface protocol (__arrow_c_array__ / __arrow_c_schema__), so they exchange data with PyArrow with zero copies at both the Python and Mojo level.
Python level
import marrow as ma
import pyarrow as pa
# PyArrow -> Marrow (via __arrow_c_array__ protocol)
pa_arr = pa.array([1, 2, 3, None, 5])
ma_arr = ma.array(pa_arr)
print(ma_arr)
# Marrow -> PyArrow (via __arrow_c_array__ protocol)
ma_arr = ma.array([1, 2, 3, None, 5])
pa_arr = pa.array(ma_arr)
print(pa_arr)Mojo level
At the Mojo level the underlying CArrowSchema and CArrowArray structs are directly accessible for low-level interop.
from std.python import Python
from marrow.c_data import CArrowArray, CArrowSchema
var pa = Python.import_module("pyarrow")
var pyarr = pa.array([1, 2, 3, 4, 5], mask=[False, False, False, False, True])
# Import from PyArrow via the Arrow C Data Interface capsule protocol
var capsules = pyarr.__arrow_c_array__()
var dtype = CArrowSchema.from_pycapsule(capsules[0]).to_dtype() # int64
var data = CArrowArray.from_pycapsule(capsules[1])^.to_array(dtype)
var typed = data.as_int64()
print(typed.is_valid(0)) # True
print(typed.is_valid(4)) # False — null
print(typed.unsafe_get(0)) # 1
print(typed.unsafe_get(3)) # 4
# Export from Mojo back to PyArrow — __arrow_c_array__ protocol is supported
var pa_result = pa.array(data)