Data De-duplication solutions are not created equal

I’ve been exploring Data Deplication solutions for the SWWHEP project backup strategy.  It started becuase we started looking at HP’s Virtual Tape Library solutions (VTLs) which now feature the technology.  Our suppliers quoted some astronomic prices, but then they were quoting for 88TB solutions.  Now since most de-dup solutions are quoting 20:1 reductions and upwards (to some silly numbers) and we have about 40/50TB in total space (not used space)we should be looking to compress it down to say 2TB, so add a bit for long term retention and we’re still only in the 10Tb range?

Anyway started looking at alternatives from a number of players (Data Domain, Sepaton (used by HP), EMC, Diligent (IBM), NetApp, ExGrid, and others) and obviously I’ve been looking for key differences to help decide what we need.  The key differentiators (so far) are:

  1. Inline or Post-process.  Inline means the data is de-duplicated as it comes into the device.  Post-process means that data is initially written to disk and then de-duplicated in the background later.  So which ones best?  Inline is obviously going to need more grunt in the device to do the work, but then only stores the de-duplicated data.  Post-process will need to cache the data, which could easily be an entire nights backups (5Tb or more here) and then has to catch up before the next nights backups.  My feeling is that inline is going to be better.
  2. The de-duplication algorithm.  Most of the players in this market use hashing of the data to determine whether there’s a duplicate in the system.  Diligent are unique in having their own proprierty technology to handle this.  Their augment is that there is a chance of of two non-identical blocks of data having the same hash key and therefore you could get corruption.  However, the odds are actually very very long – and are worse than other factors affecting the system – e.g. disk failure, hit by asteroid etc.  However, hashing doesn’t scale particularly well.  I’ll accept that hashing works, but would prefer an alternative – I recall a particular slick solution to storing a dictionary for Scrabble which managed to cram 65000 words into less than 64k using a finite state machine – apply the same tech to disk blocks and it’s going fly?
  3. WAN replication – some players have solutions which enable the de-duplicated data to be replicated over a WAN link.  Because of the high deduplication rates this is suitable even for large amounts of data over very slow links.  Many of the players have this coming soon!  For the SWWHEP project this functionality is critical.
  4. VTL and NAS emulation.  Most solutions are only Virtual Tape Libraries, so it hooks into your traditional backup software.  This is fine, but in some cases (and in our case) the backups are a little different.  Specifically, we’re backing up a lot of VMware images via Vizoncore vRanger.  This product just writes files to disk, so a De-duplication system which looks like a filestore is a better solution, otherwise we have to backup to disk (without de-duplication) and then backup to VTL.
  5. Agent vs Agent-less.  Some solutions (well one I think – EMC) use an Agent on the server being backed up – which de-dups before it even leaves the machine – so presumably it generates the hash, and then checks this against the backup device, rather than passing the block over the LAN.  But it’s messy having more agents to manage, and then there’s support issues for anything other than Windows/RedHat/SUSE?
  6. Built in disk solutions.  The majority of the solutions are appliances with built in disks.  This is fine, but solutions which allow a choice of back end disk subsystems could be a better option.

So that I think covers most of it.  My preference I think, is a solution which uses inline processing, something cleverer than hashing, has WAN replication, has VTL and NAS, is agent-less, and allows the use of alternative backend disk solutions.  Which of course leaves me with a choice of zero products!  So dropping some of the requirements – accepting hashing and appliances with their own disks, seems to give me a choice of one solution – Data Domain.   Of the others the most interesting is Diligent as it uses it’s own “hyperFactor” data matching solution and supports clusters of appliances. But WAN replication is coming soon, it can use other disk backends –  but how long for – it’s merging with IBM’s product line and hence their disk systems and it’s a VTL not a NAS device. (Diligent at the moment also resells via Hitachi, Sun (with replication), FalStor – but again how long for?)

HP’s solution buys in technology from Sepaton, which is a post-process solution (don’t like that), WAN replication is coming soon (Sepaton already have it), and it’s a VTL only too.  So going to talk to Data Domain…

This entry was posted in Storage and tagged . Bookmark the permalink.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.