Apache Spark 3.0 adds Nvidia GPU support for machine learning

Next major release of the in-memory data processing framework will support GPU-accelerated functions courtesy of Nvidia RAPIDS

Pro

Image: Stockfresh

15 May 2020

Apache Spark, the in-memory big data processing framework, will become fully GPU accelerated in its soon-to-be-released 3.0 incarnation. Best of all, today’s Spark applications can take advantage of the GPU acceleration without modification; existing Spark APIs all work as-is.

The GPU acceleration components, provided by Nvidia, are designed to complement all phases of Spark applications including ETL operations, machine learning training, and inference serving.

Nvidia’s Spark contributions draw on the RAPIDS suite of GPU-accelerated data science libraries. Many of RAPIDS’ internal data structures, like dataframes, complement Spark’s own, but getting Spark to use RAPIDS natively has taken nearly four years of work.

Spark 3.0 speedups do not come solely from GPU acceleration. Spark 3.0 also reaps performance gains by minimising data movement to and from GPUs. When data does need to be moved across a cluster, the Unified Communication X framework shuttles it directly from one block of GPU memory to another with minimal overhead.

According to Nvidia, a preview release of Spark 3.0 running on the Databricks platform yielded a seven-fold performance improvement when using GPU acceleration, though details about the workload and its dataset were not available.

No firm date has been given for general availability of Spark 3.0. You can download preview releases from the Apache Spark project website.

IDG News Service