Bald Eagle

Apache Eagle keeps an eye on Big Data usage

Pro
Image: Bruce Hallman/USFWS (via flickr)

27 January 2017

Apache Eagle, originally developed at eBay, then donated to the Apache Software Foundation, fills a Big Data security niche that remains thinly populated, if not bare: It sniffs out possible security and performance issues with big data frameworks.

To do so, Eagle uses other Apache open source components, such as Kafka, Spark, and Storm, to generate and analyse machine learning models from the behavioural data of large data clusters.

Looking in from the inside
Data for Eagle can come from activity logs for various data source (HDFS, Hive, MapR FS, Cassandra) or from performance metrics harvested directly from frameworks like Spark. The data can then be piped by the Kafka streaming framework into a real-time detection system that is built with Apache Storm or into a model training system built on Apache Spark. The former’s for generating alerts and reports based on existing policies; the latter is for creating machine learning models to drive new policies.

This emphasis on real-time behaviour tops the list of “key qualities” in the documentation for Eagle. It is followed by scalability, metadata driven (meaning changes to policies are deployed automatically when their metadata is changed), and extensibility. This last means the data sources, alerting systems, and policy engines used by Eagle are supplied by plugins and aren’t limited to what is in the box.

Because Eagle was put together from existing parts of the Hadoop world, it has two theoretical advantages. One, there is less reinvention of the wheel, and two, those who already have experience with the pieces in question will have a leg up.

Activity
Aside from the above use cases like analysing job performance and monitoring for anomalous behaviour, Eagle can also analyse user behaviours. This isn’t about, say, analysing data from a web application to learn about the public users of the app, but rather the users of the big data framework itself – the folks building and managing the Hadoop or Spark back end. An example of how to run such analysis is included, and it could be deployed as-is or modified.

Eagle also allows application data access to be classified according to levels of sensitivity. Only HDFS, Hive, and HBase applications can make use of this feature right now, but its interaction with them provides a model for how other data sources could also be classified.

Under control
Because Big Data frameworks are fast-moving creations, it’s been tough to build reliable security around them. Eagle’s premise is that it can provide policy-based analysis and alerting as a possible complement to other projects like Apache Ranger. Ranger provides authentication and access control across Hadoop and its related technologies; Eagle gives you some idea of what people are doing once they’re allowed inside.

The biggest question hovering over Eagle’s future – even this early on – is to what degree Hadoop vendors will elegantly roll it into their existing distributions or use their own security offerings. Data security and governance have long been one of the missing pieces that commercial offerings could compete on.

IDG News Service

Read More:


Back to Top ↑

TechCentral.ie