Hadoop is evolving in three key ways

Pro

(Source: Hadoop)

5 October 2015

The recent Strata+Hadoop World 2015 conference in New York was subtitled “Make Data Work,” but given how Hadoop world’s has evolved over the past year (even over the past six months) another apt subtitle might have been “See Hadoop Change.”

Here are three of the most significant recent trends in Hadoop, as reflected by the show’s roster of breakout sessions, vendors, and technologies.

Spark is leading Hadoop
Spark is so hot it had its own schedule track, labelled “Spark and Beyond,” with sessions on everything from using the R language with Spark to running Spark on Mesos.

Some of the enthusiasm is easy to chalk up to Cloudera, a big fan of Spark, being a key sponsor for the show. But Spark’s rising popularity is hard to ignore.

Spark’s importance stems from how it offers self-service data processing, by way of a common API, no matter where that data is stored. (At least half of the work done with Spark is not within Hadoop.) Arsalan Tavakoli-Shiraji, vice president of customer engagement for Databricks, Spark’s chief commercial proponent, spoke of how those tasked with getting business value out of data “eagerly want data, whether they’re using SQL, R, or Python, but hate calling IT.”

Rob Thomas, IBM’s vice president of product development for IBM Analytics, cited Spark as being key in the shift away from “a world of infrastructure to a world of insight.” Hadoop data lakes often become dumping grounds, he claimed, without much in the way of the business value that Spark can provide.

“Instead of hiring 500 IT ops people to handle [Hadoop] clusters, I’d rather hire 250 data scientists to get something meaningful out of the data,” Thomas said.

Target audience shifting
The pitch for Hadoop is no longer about it being a data repository — that is a given — it is now about having skilled people and powerful tools to plug into it in order to get something useful out.

Two years ago, the keynote speeches at Strata+Hadoop were all about creating a single repository for enterprise data. This time around, the words “data lake” were barely mentioned in the keynotes — and only then in a derogatory way. Talk of “citizen data scientists,” “using big data for good,” and smart decision making with data was offered instead.

What happened to the old message? It got elbowed aside by the growing realisation that the culture of self-service tools for data science on Hadoop is of more real value than the ability to aggregate data from multiple sources. If the old Hadoop world was about free-form data storage, the new Hadoop world is (ostensibly) about free-form data science.

The danger, though, is making terms like “data scientist” too generic, in the same way that “machine learning” got watered down through overly broad use.

New tech proving ground
Few would dispute that Hadoop remains important, least of all the big names behind the major Hadoop distributions. But attention and excitement seem less focused on Hadoop as a whole than on the individual pieces emerging from under Hadoop’s big tent — pieces that are being used to create entirely new products.

Spark is the obvious example, both for what it can do and how it goes about doing it. Spark’s latest incarnation features major workarounds for issues with the JVM’s garbage collection and memory management systems, technologies that have exciting implications outside of Spark.

But other new-tech-from-Hadoop examples are surfacing: Kafka, the Hadoop message-broker system for high-speed data streams, is at the heart of products like Mesosphere Infinity and Salesforce’s IoT Cloud. If something can survive being deployed at scale within Hadoop, so the conventional wisdom goes, it is probably good technology.

Unfortunately, one of the side effects of Hadoop being such a fertile breeding ground is fragmentation of Hadoop itself. Efforts to provide a firmer definition of what’s inside the Hadoop tent, like the Open Data Platform Initiative, have inspired as much dissent and division as agreement and consensus. And new additions to the Hadoop toolbox risk further complicating an already complex picture. Kudu, the new Hadoop file system touted by Cloudera as a way to combine the best of HDFS and HBase, is not compatible with HDFS’ protocols — at least, not yet.

There’s little sign that the mix of ingredients that make up Hadoop will become any less ad hoc or variegated with time, thanks to the slew of vendors vying to deliver their own spin on the platform. But whatever becomes of Hadoop, some of its pieces have already proven they can thrive on their own.

Serdar Yegulalp, IDG News Service

Read More: Big Data development Hadoop platform