Save Arrow Record Batches Fast to Parquet With Custom Metadata During Incremental Writes

Saving custom metadata – “schema metadata” or “file metadata” – to Parquet could be really useful. You can put versions of an application’s data format, release notes or many other things right into the data files. The documentation is somewhat lacking on how to accomplish it with PyArrow – but you totally can. Last time I reviewed the docs for Polars and DuckDB they didn’t allow for adding your own metadata to Parquet output at all.

In an earlier post I showed how to save custom file level metadata to Parquet using the Apache Arrow Parquet library in C++, and read it in with C++ or Python or Rust. Here I’m demonstrating how to save custom file level metadata in Python with PyArrow. This assumes you have data in memory in Arrow form and use PyArrow.

Arrow Data

In the current Python Polars / DuckDB / Pyarrow / Pandas landscape, you often exchange data between tools with Arrow data tables. Arrow is an efficient in-memory data storage format which organizes data in “tables” composed of columns or “series.” Anyway, you probably already know this, or else this post won’t be much fun for you.

In my case yesterday in a Python utility I had a large result set from a database query returned as Arrow data. So large, in fact, that I had to return it as Arrow RecordBatches. When I first built this utility I simply returned data as a single Arrow table and saved directly. While doing so, I applied some custom metadata:

conn = duckdb.connect()    
new_data = (
	conn.execute(query)
	.arrow() # return result set as an Arrow table
	.replace_schema_metadata({"version_data_format": "2"})
)    
pq.write_table(new_data, output_parquet_file)
conn.close()

When you search for how to add metadata to a Parquet file, the replace_schema_metadata() method on Arrow tables is what you come up with. Most examples focus on the ssimplest cases. Unfortunately a lot of simple cases don’t scale up to interesting larger cases.

I used this approach in my last post about converting complex Parquet data to simpler flat structured Parquet. It worked, but turned out to take a huge amount of memory on a few big inputs. I succeeded only because I had a machine with 1 TB of RAM. I soon noticed this shortcoming, and also noticed it took quite a while to save these huge tables.

To rerun quickly the next time, I restructured the code to operate on batches of records; you can get DuckDB to return Arrow RecordBatch objects. This ran quite fast and took much less memory.

In the process I introduced a bug: The custom schema metadata didn’t get saved as before. The code calling replace_schema_metadata() adding the metadata didn’t cause a runtime error though – it simply had no effect. So if you don’t use replace_schema_metadata() what do you use? It’s not easy to work out.

Documentation on this point is sparse, hence this post. Here’s the new working code:

#   This queries a DuckDB connection but it could be anything, like a Polars data frame or PyArrow.	
# The reader yields a record batch or record batch iterator.
reader = conn.execute(query).fetch_record_batch(50000)

# You use the 'with_metadata' to add custom metadata to the schema the reader got as a result of the DuckDB query.
# Any Arrow schema can be added to like this.
new_schema = reader.schema.with_metadata({"version_data_format":"2"})        
writer = pq.ParquetWriter(
	output_parquet_file, 
	new_schema        
)

# Keep reading a RecordBatch, one at a  time until the iterator ends
while True:        
	try:
		# This is what I had, but it failed silently; the program simply didn't
		# attach the new metadata to the output.
		#batch = reader.read_next_batch().replace_schema_metadata(
		#	{"version_data_format": "2"}
		#)
		#writer.write_batch(batch)
		
		# The correct solution when you have batches instead of a single table is to update
		# the schema you pass to the `pyarrow.parquet.Writer()` initializer.
		# Then you don't do anything to each batch, the schema is already correct.
		writer.write_batch(reader.read_next_batch())
		
		
	except StopIteration:
		print(f"Completed save.")
		break
writer.close()    

This approach runs many times faster on very large data ( > 50GB uncompressed) and requires orders of magnitude less memory too. It required adding the new schema with extra metadata up-front to the incremental pyarrow.parquet.Writer class instead of the data tables as with the simpler approach. It makes complete sense, as the Writer() initializer takes a schema. Here’s the relevant documentation: with_metadata(). You might expect to get a runtime error if you change the schema after creating the Writer, but no such luck.

Closing Thoughts

I hope this post saved you some frustration and fruitless searching. From what I have picked up it seems like the various “Big Data, single machine” tools like Polars, DuckDB, PyArrow etc. are developing quickly and don’t have all their minor features well developed or documented. If we check back in three years I suspect many of these pain points will have gotten smoothed over nicely.

One trend I’ve noticed is that Python bindings and libraries get a lot of attention and make it all look easy (but some stuff turns out to be tricky nevertheless!) Other languages haven’t received the same attention. Polars in Rust vs. Python is night and day in documentation and conveniences, for instance. With so many users gravitating to Python it makes sense.