Apache Arrow aims to speed access to Big Data

Pro

Source: Stockfresh

18 February 2016

A new top-level project for the Apache Foundation seeks to provide a fast in-memory data layer for an array of open source projects, both under Hadoop’s umbrella and outside it.

The Apache Arrow project was developed chiefly by Dremio, a start-up founded by ex-employees of Hadoop company MapR who also produced Apache Drill. Arrow transforms data into a columnar, in-memory format – making it far faster to process on modern CPUs – and provides it to a variety of applications via a single, consistent interface.

Data, get in line
Columnar storage is used in big data applications to accelerate searching and sorting large amounts of data, but it has been up to individual Big Data framework components to support it. (The Apache Parquet project provides support for columnar storage in Hadoop.)

With Arrow, applications can access a columnar version of a dataset simply by asking Arrow for it. Data transformed by Arrow can theoretically be processed much faster, since Arrow exploits the Single Instruction Multiple Data (SIMD) instruction sets of modern CPUs to speed up processing the data. Sets of data too big to fit in memory all at once are broken into batches, with the batches sized to fit the CPU’s various cache layers.

The big boon for Arrow, its creators claim, is not just that it makes any one Big Data project faster, but that multiple Arrow-compatible projects can use Arrow as a common data interchange mechanism. Instead of serialising, moving, and de-serialising any given dataset between projects – with all of the overhead and slowness implied there – applications that use Arrow can trade data directly in Arrow’s format.

If two applications are on the same physical node, they can access Arrow data by way of shared memory. This speeds up data access, since the applications are no longer making redundant copies of the data.

Something for everyone
According to Julien Le Dem, architect for Arrow at Dremio and vice president of Apache Parquet, columnar optimisation has long been used in commercial products like Oracle’s databases and SAP HANA. But the open source big data space has, Le Dem claimed, done very little with this type of technology to date.

Many of the projects within Hadoop are already preparing Arrow support of some kind. But Arrow is aiming not just to create a product that serves the Hadoop ecosystem, but to provide something that entire software ecosystems, the Python language, for instance, can connect with and leverage.

Arrow’s creators claim this is already well underway. Le Dem says the plan is not just to include specific projects (Spark, HBase, Cassandra, Pandas, Parquet, Drill, Impala, and Kudu, for openers), but to provide bindings for entire languages. C, C++, Python, and Java bindings are available now, and R, Julia, and JavaScript will follow. “This is not just about Hadoop,” Le Dem said. “[The participants] are engaged across a wide spectrum of different projects, so this is relevant to an extraordinarily large number of [them].”

In that light, Arrow is an example of the way open source big data technologies are growing beyond Hadoop. They are starting with Hadoop, perhaps, but not necessarily staying confined to it.

That said, Arrow and Hadoop unquestionably have a future together; Le Dem noted that three of the leading commercial Hadoop distributions – Cloudera, Hortonworks, and MapR – have all committed to working on Arrow.

IDG News Service

Apache Arrow aims to speed access to Big Data

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular