PyArrow Interop

Marrow and PyArrow implement the same Apache Arrow columnar format. This page covers how they relate, where Marrow is faster, and how to exchange data between them at the Mojo level via the Arrow C Data Interface.

Same data model

Both libraries represent the same Arrow memory layout. An int64 array in Marrow and in PyArrow are bitwise identical in memory.

import marrow as ma
import pyarrow as pa

# Marrow
ma_arr = ma.array([1, 2, 3, None, 5])
print("marrow:", ma_arr)
print("type:  ", ma_arr.type())

# PyArrow
pa_arr = pa.array([1, 2, 3, None, 5])
print("pyarrow:", pa_arr)
print("type:   ", pa_arr.type)

Construction performance

When converting Python data to Arrow arrays, Marrow is faster than PyArrow for numeric and string types when an explicit type is provided. Results below are from pixi run bench_python on Apple M-series hardware (n=100,000 elements, mean time):

Array type marrow PyArrow speedup
int64 (explicit type) 0.37 ms 0.89 ms 2.4x
int64 + nulls (explicit) 0.36 ms 0.87 ms 2.4x
float64 (explicit) 0.34 ms 0.48 ms 1.4x
float64 + nulls 0.34 ms 0.51 ms 1.5x
string (explicit) 0.72 ms 1.06 ms 1.5x
string + nulls 0.70 ms 1.04 ms 1.5x
struct, primitive fields (explicit) 5.41 ms 6.54 ms 1.2x
int64 (inferred) 1.40 ms 1.27 ms 1.1x slower
string (inferred) 1.57 ms 1.02 ms 1.5x slower
nested list (inferred) 3.92 ms 2.36 ms 1.7x slower

The pattern: provide an explicit type when you know it. Marrow’s builder path is highly optimized for known types. When the type must be inferred from a scan of Python objects, PyArrow currently wins for complex nested structures.

Reproduce with:

pixi run bench_python

Zero-copy C Data Interface

Marrow arrays implement the Arrow C Data Interface protocol (__arrow_c_array__ / __arrow_c_schema__), so they exchange data with PyArrow with zero copies at both the Python and Mojo level.

Python level

import marrow as ma
import pyarrow as pa

# PyArrow -> Marrow (via __arrow_c_array__ protocol)
pa_arr = pa.array([1, 2, 3, None, 5])
ma_arr = ma.array(pa_arr)
print(ma_arr)

# Marrow -> PyArrow (via __arrow_c_array__ protocol)
ma_arr = ma.array([1, 2, 3, None, 5])
pa_arr = pa.array(ma_arr)
print(pa_arr)

Mojo level

At the Mojo level the underlying CArrowSchema and CArrowArray structs are directly accessible for low-level interop.

from std.python import Python
from marrow.c_data import CArrowArray, CArrowSchema

var pa = Python.import_module("pyarrow")
var pyarr = pa.array([1, 2, 3, 4, 5], mask=[False, False, False, False, True])

# Import from PyArrow via the Arrow C Data Interface capsule protocol
var capsules = pyarr.__arrow_c_array__()
var dtype = CArrowSchema.from_pycapsule(capsules[0]).to_dtype()  # int64
var data  = CArrowArray.from_pycapsule(capsules[1])^.to_array(dtype)
var typed = data.as_int64()

print(typed.is_valid(0))    # True
print(typed.is_valid(4))    # False — null
print(typed.unsafe_get(0))  # 1
print(typed.unsafe_get(3))  # 4

# Export from Mojo back to PyArrow — __arrow_c_array__ protocol is supported
var pa_result = pa.array(data)