Apache PredictionIO makes for easier machine learning with Spark

Pro

(Image: Stockfresh)

27 October 2017

The Apache Foundation has added a new machine learning project to its roster, Apache PredictionIO, an open-sourced version of a project originally devised by a subsidiary of Salesforce.

Apache PredictionIO is built atop Spark and Hadoop, and serves Spark-powered predictions from data using customisable templates for common tasks. Apps send data to PredictionIO’s event server to train a model, then query the engine for predictions based on the model.

Spark, MLlib, HBase, Spray, and Elasticsearch all come bundled with PredictionIO, and Apache offers supported SDKs for working in Java, PHP, Python, and Ruby. Data can be stored in a variety of back ends: JDBC, Elasticsearch, HBase, HDFS, and their local file systems are all supported out of the box. Back ends are pluggable, so a developer can create a custom back-end connector.

PredictionIO’s most notable advantage is its template system for creating machine learning engines. Templates reduce the heavy lifting needed to set up the system to serve specific kinds of predictions. They describe any third-party dependencies that might be needed for the job, such as the Apache Mahout machine-learning app framework.

Some existing templates include:

A universal recommendation engine
Text classification
Survival analysis (for time-between-failure predictions)
Labelling topics using Wikipedia as a knowledge base
Similarity analysis

Some templates also integrate with other machine learning products. For example, two of the prediction templates currently in PredictionIO’s gallery, for churn rate detection and general recommendations, use H2O.ai’s Sparkling Water enhancements for Spark.

PredictionIO can also automatically evaluate a prediction engine to determine the best hyperparameters to use with it. The developer needs to choose and set metrics for how to do this, but there’s generally less work involved in doing this than in tuning hyperparameters by hand.

When running as a service, PredictionIO can accept predictions singly or as a batch. Batched predictions are automatically parallelised across a Spark cluster, as long as the algorithms used in a batch prediction job are all serialisable. (PredictionIO’s default algorithms are.)

PredictionIO’s source code is available on GitHub. For convenience, various Docker images are available, as well as a Heroku build pack.

IDG News Service