Big Data is changing the game in back-up and recovery
21 June 2016 | 0
It is a well-known fact in the IT world that changing one part of the software stack leads to a very good chance of having to change another. For a shining example, look no further than Big Data.
First, Big Data shook up the database arena, ushering in a new class of ‘scale out’ technologies. That is the model exemplified by products such as Hadoop, MongoDB, and Cassandra, where data is distributed across multiple commodity servers rather than packed into one massive one. The beauty there, of course, is the flexibility – to accommodate more petabytes, just add another inexpensive machine or two rather than scaling up and paying big bucks for a bigger mammoth.
“Today’s horizontally scalable databases include some capabilities for availability and recovery, but typically they are not as robust as those IT users have become accustomed to,” Dave Russell, Gartner
That has all been great, but now there is a new sticking point: backup and recovery.
“Traditional back-up products have challenges with very large amounts of data,” said Dave Russell, a vice president with Gartner. “The scale-out nature of the architecture can also be difficult for traditional back-up applications to handle.”
Today’s horizontally scalable databases do include some capabilities for availability and recovery, but typically they are not as robust as those IT users have become accustomed to, Russell added.
It is a problem that can leave large enterprises vulnerable when outages strike. But it also causes problems where a new class of data protection products is beginning to enter the picture. Products such as Datos IO’s RecoverX.
“If you have a traditional database like Oracle or MySQL, it’s scale-up, and there’s always the notion of a durable log,” said Tarun Thakur, Datos IO’s co-founder and CEO.
In such scenarios, a copy of that log is what constitutes a backup when problems arise. In the world of today’s next-generation databases, where data is distributed across small machines, it is not quite so simple.
“There is no concept of a durable log because there is no master – each node is working on its own stuff,” Thakur explained. “Different nodes could get different rights, and every node has a different view of an operation.”
That is in part because of a trade-off that has been required to accommodate what’s commonly referred to as the “three V’s” of Big Data – volume, velocity, and variety. Specifically, to offer scalability while accommodating the crazy amounts of diverse data flying at us at ever-more-alarming speeds, today’s distributed databases have departed from the ACID criteria generally promised by traditional relational databases. Instead, they have adopted what are known as BASE principles.
It is a critical distinction. Most pertinent is that where traditional databases promise strong consistency throughout – that’s the ‘C’ in ACID – distributed ones strive instead for what’s called ‘eventual consistency’. Updates will be reflected in all nodes of the database sooner or later, but there’s a time lag.
“If you need scalability, you need to give up consistency – you have to give up one or the other,” Thakur said.
That makes it tough to get a reliable snapshot of the big picture for point-in-time recovery. Not only is it more difficult to track which data might have moved where in a distributed database at any given moment, but the resiliency features that often come baked into newer distributed databases, replication, for example, will not protect you if data gets corrupted, said Simon Robinson, a research vice president with 451 Research.
“You just replicate that corrupted data,” he said.
Recently, Datos IO launched RecoverX to address those concerns through features including what it calls scalable versioning and semantic deduplication. The result is cluster-consistent backups that are both space-efficient and available in native formats, the company says.
Souvik Das, who until recently was CTO and managing vice president of engineering with CapitalOne Auto Finance, has felt the back-up crunch first hand.
After years of using traditional databases, CapitalOne underwent a “massive transformation” a few years back that included rolling out new distributed technologies such as Cassandra, said Das, who is now senior vice president of engineering at healthcare start-up Grand Rounds.
That meant looking for a new strategy for back-up and recovery.
“Most of the back-up vendors and software are typically tuned to the type of systems that they’re backing up,” he explained.
Using an older style back-up product with a newer distributed database could spell trouble, he said.
“Either that software would completely fail because it has no idea how to back up the new data stores, or it would work in a very suboptimal way,” Das said. “We knew going in that we would have to have different back-up solutions.”
CapitalOne has been evaluating Datos IO as well as Talena, another major player in the space, Das said.
Vendors of more traditional back-up products are gradually adjusting their own technologies for Big Data as well.
“It usually takes the incumbent back-up vendors some time to support the newer technologies,” 451 Research’s Robinson said.
“Rewind 10 years and it was very difficult initially to easily do back-ups for VMware virtual machines,” he added. “This opened the door for players like Veeam to enter and steal the VM back-up market from under the noses of the incumbents.”
IDG News Service