Disk-to-disk backup: why, how, where

Pro

12 August 2009

Backup technology is changing, and it is changing fast. Not so long ago, backing up meant copying your primary data from hard disk to tape – initially to the spools of half-inch tape beloved of film directors, and more recently to various types of tape cassettes and cartridges.

Now though, more and more organisations are using hard disks for their backups as well as their primary data, a process that has become known as disk-to-disk backup, or D2D for short.

There are a whole host of reasons for this shift. In particular, the cost of hard disks has fallen dramatically while their capacity has soared, and disk arrays have much better read/write performance than tape drives – this is particularly valuable if an application must be paused or taken offline to be backed-up.

In addition, tape is quite simply a pain to work with, especially if a cartridge must be retrieved, loaded, and scanned in its entirety, just to recover one file. Tapes can be lost or stolen, too. While we put up with all that in the past because we had to, that is no longer the case.

Sure, a tape cartridge on a shelf – albeit in a climate-controlled storeroom – is still the cheapest and least energy-consuming way to store data for the long term, but that is increasingly the role of an archive not a backup. So while tape is unlikely to vanish altogether, its role in backup is declining fast.

Disk can be incorporated into the backup process in many ways, from virtual tape libraries (VTLs) through snapshots to continuous data protection (CDP), and each method may suit some applications or user requirements better than others. An organisation may even use more than one D2D backup scheme in parallel, in order to address different recovery needs.

Disk is also used in many other forms of data protection, such as data replication and mirroring, although it is important to understand that these are not backups. They protect against hardware failure or disasters, but they cannot protect against data loss or corruption as they offer no rollback capability.

Why backup?

It is an inescapable fact that computers and their disk storage go wrong – not often, but they do. Even when they do not, they can be stolen, destroyed by fires or other disasters, damaged by power failures, or suffer one of a host of other events that cause data to be lost, such as a user deleting a file by mistake or saving the wrong version. When that happens, you need a backup.

Replication and mirroring do fill some very important data protection needs, because if one whole system is lost, the other should be able to take over. However, a replica or mirror is not a backup, because if data on one half of a pair is corrupted or deleted, it will also be deleted or corrupted on the other half.

Backup is all about saving data at a specific time and in a consistent state. It allows you to go back to a specific recovery point, for example to recover an earlier version of a corrupted or deleted file or database, to restore a crashed system to its previous working state, or to take a copy off-site for safekeeping.

A backup traditionally has been stored on non-volatile and removable media, such as tape, due to that need to move copies off-site. Increasingly though, this role is being filled by non-removable media at a different site – either your own secondary site, or a facility managed by a service provider – with data networks replacing lorries as the transport mechanism.

When it comes to restoring data, disk’s big advantage over tape is that it is random-access rather than sequential access. That means that if you only need one file or a few files back, it will be faster and easier to find and recover from disk.

What backup and recovery methods you use will depend on two factors – the recovery point objective (RPO), i.e. how much data the organisation can afford to lose or re-create, and the recovery time objective (RTO), which is how long you have to recover the data before its absence causes business continuity problems.

For instance, if the RPO is 24 hours, daily backups to tape could be acceptable, and any data created or changed since the failure must be manually recovered. An RTO of 24 hours similarly means the organisation can manage without the system for a day.

If the RPO and RTO were seconds rather than hours, the backup technology would not only have to track data changes as they happened, but it would also need to restore data almost immediately. Only disk-based continuous data protection (CDP) schemes could do that.

Ways to use disk

Most current disk-based backup technologies fall into one of four basic groups, and can be implemented either as an appliance, or as software which writes to a dedicated partition on a NAS system or other storage array.

Virtual tape library (VTL): One of the first backup applications for disk was to emulate a tape drive. This technique has been used in mainframe tape libraries for many years, with the emulated tape acting as a kind of cache – the backup application writes a tape volume to disk, and this is then copied or cloned to real tape in the background.

<p>Using a VTL means there is no need to change your software or processes – they just run a lot faster. However, it is still largely oriented towards system recovery, and the restore options are pretty much the same as from real tape. Generally, the virtual tapes can still be cloned to real tapes in the background for longer-term storage; this process is known as D2D2T, or disk-to-disk-to-tape.

Simpler VTLs take a portion of the file space, create files sequentially and treat it as tape, so your save-set is the same as real tape. That can waste space though, as it allocates the full tape capacity on disk even if the tape volume is not full

More advanced VTLs get around this problem by layering on storage virtualisation technologies. In particular this means thin provisioning, which allocates a logical volume of the desired capacity but does not physically write to disk unless there is actual data to write, and it has the ability to take capacity from anywhere, e.g. from a Storage Area Network, from local disk, and even from Network Attached Storage.

Disk-to-disk (D2D): Typically this involves backing up to a dedicated disk-based appliance or a low-cost SATA array, but this time the disk is acting as disk, not as tape. Most backup applications now support this. It makes access to individual files easier, although system backups may be slower than streaming to a VTL.

An advantage of not emulating tape is that you are no longer bound by its limitations. D2D systems work as random-access storage, not sequential, which allows the device to send and receive multiple concurrent streams, for example, or to recover individual files without having to scan the entire backup volume.

D2D can also be as simple as using a removable disk cartridge instead of tape. The advantage here is backup and recovery speed, while the disk cartridge can be stored or moved offsite just as a tape cartridge would be.

Snapshot: This takes a point-in-time copy of your data at scheduled intervals, and is pretty much instant. However, unless it is differential (which is analogous to an incremental backup) or includes some form of compression, data reduction or de-duplication technology, each snapshot will require the same amount of disk storage as the original.

Differential snapshot technologies are good for roll-backs and file recovery, but may be dependent on the original copy, so are less useful for disaster recovery.

Many NAS (network attached storage) vendors offer tools which can snapshot data from a NAS server or application server on one site to a NAS server at a recovery location.

However, in recent years snapshot technology has become less dependent on the hardware – it used to be mainly an internal function of a disk array or NAS server, but more and more software now offers snapshot capabilities.

Continuous data protection (CDP): Sometimes called real-time data protection, this captures and replicates file-level changes as they happen, allowing you to wind the clock back on a file or system to almost any previous point in time.

The changes are stored at byte or block level with metadata that notes which blocks changed and when, so there is often no need to reconstruct the file for recovery – the CDP system simply gives you back the version that existed at your chosen time. Any changes made since then will need to be recovered some other way, for example via journaling within the application.

CDP is only viable on disk, not tape, because it relies on having random access to its stored data. Depending on how the CDP process functions, one potential drawback is that the more granular you make your CDP system, the more it impacts performance of the system and application.

Read More: Infrastructure

Disk-to-disk backup: why, how, where

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular