Big Data is all about the cloud
8 April 2015 | 0
Big Data is not about real-time versus batch processing. It is not a question of either/or, as Ovum analyst Tony Baer and others stress. Given the broad range of options and workloads that make up a successful Big Data strategy, this is not surprising or controversial.
More controversial, though perhaps not surprising, is the nature of the infrastructure required to get the most from big data. For example, Amazon Web Services (AWS) data science chief Matt Wood warns that, while “analytics is addictive,” this positive addiction quickly turns sour if your infrastructure cannot keep up.
The key to big data success, Wood says, is more than Spark or Hadoop. It is running both on elastic infrastructure.
Hortonworks Vice President of Corporate Strategy Shaun Connolly agrees that the cloud has a big role to play in big data analytics. But Connolly believes the biggest factor in determining where big data processing is done is “data gravity,” not elasticity.
The main driver for big data deployments, Connolly says, is to extend and augment traditional on-premise systems, such as data warehouses. Eventually, this leads large organisations to deploy Hadoop and other analytics clusters in multiple locations — typically on site.
Nevertheless, Connolly acknowledges, the cloud is emerging an increasingly popular option for the development and testing of new analytics applications and for the processing of big data that is generated “outside the four walls” of the enterprise.
While AWS big data customers range from nimble start-ups
like Reddit to massive enterprises like Novartis and Merck, Wood suggests three key components to any analytics system.
- A single source of truth. AWS provides multiple ways to store this single source of truth, from S3 storage to databases like DynamoDB or RDS or Aurora to data warehousing solutions like Redshift.
- Real-time analytics. Wood says that companies often augment this single source of truth with streaming data, such as website clickstreams or financial transactions. While AWS offers Kinesis for real-time data processing, other options exist like Apache Storm and Spark.
- Dedicated task clusters. Task clusters are a group of instances running a distributed framework like Hadoop, but spun up specifically for a dedicated task like data visualisation.
With these components in mind, Wood repeats that big data is not a question of batch versus real-time processing, but rather about a broad set of tools that allows you to handle data in multifaceted ways:
“It’s not Spark or Hadoop. It’s a question of “and,” not “or.” If you’re using Spark, that shouldn’t preclude you from using traditional MapReduce in other areas, or Mahout. You get to choose the right tool for the job, versus fitting a square peg into a round hole.”
As Wood sees it, “Real-time data processing absolutely has a role going forward, but it’s additive to the big data ecosystem.”