Update: 2018-10-19: Specific instructions for building Parquet and Arrow libraries in this post are out of date as of the most recent major release of Arrow. The rest is still correct and useful. See the Arrow homepage for instructions. In essence you build both Parquet and Arrow libraries from the Arrow project ; you can optionally build the Parquet libraries from the Arrow project, or only build the Arrow library.
For a number of reasons you may wish to read and write Parquet format data files from C++ code rather than using pre-built readers and writers found in Apache Spark, Drill, or other big data execution frameworks.
You can achieve more efficient use of limited resources for one, and you may want to create a library for an existing C++ application or another language. If nothing else, it’s very convenient to be able to create stand -alone utilities with the capability to read and write Parquet format or give your existing applications such capabilities.
In my case the unusual data format of IPUMS micro-data necessitated a custom conversion tool of some sort. Also, I knew I needed to convert large batches of data quickly, so C++ seemed the best approach for conserving memory and getting fast execution. With an understanding of columnar data and after studying the “parquet-cpp” API I knew this would be possible.
What follows are my notes on using and building the parquet-cpp C++ Parquet library. I’ll also show parts of two utilities: make-parquet and tabulate-parquet. The former creates Parquet formatted files out of CSV or fixed-width formatted data and the latter reads and tabulates data from Parquet files.
The Arrow and Parquet API
The parquet-cpp library has a low-level API, which is what I used to build “tabulate-pq” and “make-parquet”. There’s a higher level API that could be used to write a tool similar to “tabulate-pq”, and it includes support for the Arrow in-memory data storage library.
In the rest of this post I will show how to use both C++ APIs with some example code.
Among other things Arrow makes dealing with mixed column types dynamically in C++ easier and it allows passing data around without excessive copying, saving memory and time. Arrow defines a type of in-memory table built with an Arrow Schema and columns of data.
The Arrow and Parquet low-level and high-level APIs are defined in the http://github.com/apache/parquet-cpp library. For documentation of the API see the /tools/ and /examples directories. There are two programs in the examples, one demonstrating use of Arrow and Parquet together and a similar example program implemented with the low-level Parquet API.
Low-level interface to Parquet
When I began writing C++ tools to handle Parquet formatted data the low-level API was the only interface to the library, so that’s what I used to make “make-parquet.”
The parquet-cpp library gives you these types:
- parquet types to group together into a schema
ColumnWriter::WriteBatch() to actually move data in and out of Parquet files; compression and buffering get handled by the library.
Detailed working code examples and build instructions follow.
Once you’ve extracted data from a data source, say a CSV or hierarchical fixed width text file, The core of the “make-parquet” program looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
First, a quick over-view of important types from the Arrow library, then some eexample code will follow.
Arrow provides the following types to make sstoring columnar data in memory and reading and writing Parquet format more convenient and fast:
For moving to and from Parquet:
Arrow features data structures called Array that hold columns of same-type data, filled by a “builder” of a given type:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
You make Arrow “tables” by combining an Arrow schema object with Arrow data:
1 2 3 4 5 6 7 8 9 10
You can save Arrow tables into Parquet files (see the example programs with parquet-cpp.)
To read selected columns into Arrow Arrays, an ability crucial for a tool like “tabulate-pq”, you can use the Arrow wrapping of the Parquet API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The concept of row groups is important; if you’re memory constrained you may need to read in one row group worth of a column at a time (these are known as column chuncks.) This way you can read in part of a column, deal with the data by performing some reduce operation and dispose of the memory before moving on to the next row group.
1 2 3 4 5 6 7 8 9 10
Use the arrow::Table class to read in columns in one call (and in parallel threads as an option.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The core of a very simple memory efficient tabulator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
Building and Using the Parquet and Arrow Libraries
The Parquet libraries use CMake to build and in general follow standard practice. I’ll just briefly run through the easy path to building the Parquet libraries. Then I give a bit of a C++ build and link refresher for those who aren’t sure what to do next once they have built the libraries.
First you need to have access to some required libraries. Download the source:
Builds require CMake 3.2 or later; Ubuntu 14.04 ships with 2.8 and you will need to build from source. Additionally a version of curl supporting https is needed which you may need to build from source on older Linux distributions before installing CMake. Also gcc 4.8 or later is required, which ships with Ubuntu 14.04 and higher.
I mention Ubuntu only because it’s what I use and is probably most common; builds should work on nearly any distribution. You can even build on Mac OS without difficulty with the assistance of Home Brew. My development has mostly taken place on a Ubuntu 16.04 distribution running on WSL (Windows 10.)
Next install some Boost libraries and standard UNIX build tools if you don’t have them. On Ubuntu assuming you have the rights:
1 2 3 4
On MacOS you can simply install all of Boost, after ensuring you have XCode 6 or later.
Build Parquet Locally
You simply create the make file with
Then make everything; libraries and example programs will get placed in parquet-cpp-dir/build/latest.
Of course deviating from the happy path is where things get tricky, see the README and the CMakeLists.txt for some help. This build will get you access to binary libraries and let you play with example code in parquet-cpp-dir/examples.
Among other things, the -DPARQUET_LINKAGE_TYPE=static build type has never worked for me even though the notes in CMakeLists.txt indicate it should.
In general, to statically link – which is not the default – you will need to build Boost from source with -fPIC on; the version you get from ‘apt-get’ on Ubuntu is not compiled with this flag and will not work with static builds.
Build Arrow and Parquet Separately and Install
To make life easier down the road you may want to install libraries in the standard locations /usr/local/… or some other location on your system to which you can point PARQUET_HOME and ARROW_HOME environment variables. The parquet-cpp build scripts for the example programs will pick up on these and use them and they will make your builds that use the libraries simpler.
Build Arrow on its own:
1 2 3 4
Then build Parquet, having set the ARROW_HOME environment variable, so that build uses this version of Arrow:
1 2 3 4
Using the Libraries with CMake Builds
Check the parquet-cpp-dir/examples/parquet-arrow/ directory for a sample CMake project that incorporates Arrow and Parquet libraries in a C++ application. To build the example:
1 2 3
If you have ARROW_HOME and PARQUET_HOME defined and pointing to compatible libraries the build will go smoothly. If you don’t define the locations of installed libraries you have to build Parquet locally first and then the build will use the local versions of both (parquet-cpp pulls down and builds a local version of Arrow.)
Remember that if you distribute your binary the locations of the Parquet libraries will need to be the same as with your PARQUET_HOME and ARROW_HOME locations if they aren’t going to be installed in standard locations on target systems.
Simple Builds with Make
In the easy case where you installed to somewhere in the search path
If you’ve built libraries somewhere else, perhaps to test out the latest version available without disrupting the installed versions you would do something like:
1 2 3 4 5 6 7 8 9 10 11
To make a non-statically linked binary for distribution you could distribute the libraries with your own binary and set the LD_RUN_PATH (or RPATH on the gcc command line) to point at the relative path to the deployed libraries. In this example files from $PARQUET_HOME/lib and $ARROW_HOME/lib are expected to be findable in a relative path to the binary ($ORIGIN/lib.)
Set the run path then build:
and the Makefile would look like:
1 2 3 4 5 6 7 8 9
Before running, ensure wherever you place the binary you have a ./lib/ directory locally with all the needed libraries which do not otherwise reside on the standard library search path.
Finally don’t forget you can use ‘ldd’ to verify what libraries your own library or binary links to.