Erasure Coding For Fun and Profit

HGST LogoAs part of Data Field Day 1, we spoke to a company called HGST. A team from HGST actually invented the hard drive back in the 1950s, not far from their current office in Silicon Valley. They’ve been bought a sold a few times, most recently to Western Digital, and they’re still very much in the business of making disk drives.

There are plenty of companies banging on about flash being the latest great thing — flash is wonderful, don’t get me wrong — but flash solves mostly a speed problem. It doesn’t help with volume. To store a lot of data (and there is a lot of data being produced) flash is nowhere near as cost-effective as hard-disk (or tape).

We’re going to have spinning media around for a while yet, so it’s nice to hear about all the progress that’s being made on that front, particularly when we’re talking about data analysis which can involve very large datasets indeed.

Direct Object Storage

We spoke to the Elastic Storage Platforms Group, who are responsible for building what HGST call their Active Archive system. It’s 4.7 PB (yes, petabytes) object store in a single rack, requiring just 6.5 kVA to run. That’s quite impressive density for such little power.

The system is designed to serve as an object store, not a traditional SAN or NAS. You talk to it using S3 primitives, so it’s ideally designed for large datasets to be analysed using things like Hadoop, or for media storage such as streaming video. 4k cameras produce a lot of data, and you never know when you might want to re-release a Super-Definitive-Special-Remastered-Director’s-Uber-Cut.

We spent a bit of time talking about the hardware and I am, once again, reminded how much hardware matters, just like with X-IO Storage in the past, and with Pure Storage again more recently. The drives themselves are filled with Helium, which is less dense than Nitrogen (which makes up about 78% of regular air) and provides many benefits over standard hard-drives, particularly in terms of lifespan. The drive enclosures are mounted using special floating mounts to reduce vibration and noise, as are the enclosure sleds themselves. Again, this helps to extend the lifespan of the drives, and to keep performance even if you decide to yell at the disk array.

Hardware matters.

But it’s not just the hardware that matters, because hardware is used to run software. That’s why we have all this computer gear in the first place.

Everything to do with data lately is emphasising that you’re going to have to learn some statistics if you don’t know any already. Data analysis is really just a fancy term for statistics, and it does statistics a disservice because of all the neat things that you can do with the various kinds. The science is already cool enough, but Marketing need to slap a buzzword on it to feel like they contributed something.

WTF is Erasure Coding?

We shall now take a detour into erasure coding and fault tolerance, so if you don’t want to geek out with me, skip ahead to the section titled Online Archive.

You’re probably familiar with RAID: a Redundant Array of Inexpensive Disks. RAID 0, or striping, means you combine a bunch of disks together into a bigger dataset, and this doesn’t provide you much fault tolerance at all. If a disk dies, you can’t use any of the data (pretty much). It’s like losing the middle 40 seconds of a song, or tape 3 of 27. RAID1, or mirroring, is a bit better. You make full copies of everything on one drive on on other drive, so if you lose one, you have the other one to fall back on.

Having a full copy of the data somewhere else is great for fault tolerance, but it’s expensive. You need at least twice as much storage to keep your data safe. What if you could provide the same level of protection without needing to use 2x the storage?

Enter erasure coding.

Erasure coding essentially uses maths to add a little bit of extra data to the end of the actual data so that if you lose part of this new, bigger amount of data, you can still get all of the original data back. A simple version is a checksum: sum all the ones and zeros and put that at the end. If you lose any one of the bits, you can figure out what it was by re-calculating the checksum and comparing it to the stored checksum. The difference is what the bit was, basically. This is a vast over-simplification, but that’s basically it.

RAID4 (and NetApp’s RAID-DP), 5, 6 are all special forms of this kind of checksum, called a parity check, and the difference is mostly about how many copies of the parity checksum are stored and where.

There’s a downside (there’s always a downside). If you lose a disk, you have to rebuild all the data from the parity blocks scattered around the place, which reduces the performance of the array because some of the time is spent on the rebuild instead of serving up the data. Plus, the more data you have on a single disk, the more data you have to rebuild, and the longer it takes. If you have a lot of data, it can take longer to rebuild the data than the time before another disk breaks. If you lose too many disks (often as few as 2 per RAIDset) you lose the ability to recover at all, and now you have to get it from a mirror (like a backup). Faster CPUs help, but it’s still a losing battle.

But there are other, fancier, techniques.

The 4-state barcodes on Australia Post and UK Royal Mail envelopes are an example of an erasure code called a Reed-Solomon code. It’s also (apparently) used in QR-codes, CDs, DVDs, Blu-ray, and a host of other applications. You can lose multiple chunks of data and still rebuild it, depending on how long your check code is.

The downside is that Reed-Solomon codes are slower than parity to calculate, so writing data is slower, which isn’t so great for performance of a disk array. But, there are even fancier techniques, and HGST apparently uses one of the more complex ones called a Tornado code, which requires a bit more check data than Reed-Solomon, but they are much, much faster.

Pretty much all of the modern approaches to large data storage use these more complex erasure codes rather than simple parity-RAID. If you’re at all involved with data storage and don’t know much about them, start learning about them now.

Online Archive

The fancy software parts of the HGST solution appear to come from the Himalaya object storage software created by Amplidata (now owned by HGST). The Active Archive product was developed in partnership with Amplidata, so it looks like HGST liked the software so much they bought the company.

The goal of the product is to provide large scale storage of data that can just keep growing as you need to store more and more of it, in a similar way to ‘cloud’ type providers, but for inside companies. It’s a similar goal to that of Spectra Logic’s Black Pearl, but with all of the data on disk, instead of a cache in front of tape, you have access to the data much faster.

This is where my questioning of HGST’s cloud positioning comes in. Right now, it’s a object storage system with an S3 interface, so the link to it being ‘cloud’ is that you can talk to the storage using the S3 API. The nebulous nature of what ‘cloud’ even means makes this a bit confusing.

It sounds like this is being positioned against things like AWS Glacier for long-term archive of data, only it lives on your site so it’s easier (and cheaper) to get the data back. The marketing message here isn’t as strong as that of big data storage platform because the main reason to have something on disk and accessible is to be able to read it back, not just store it for a long time. If I’m just throwing some data onto some sort of system for as long as possible, I want to go with something really cheap, and disk still isn’t cheaper than tape.

But if you focus on the Active part of Active Archive, then I think HGST might have something fairly compelling here. I certainly hope that’s the direction they go with, because reading lots and lots of data across a WAN link from my cloud isn’t something I want to do. That means I either move all my data and analysis capability to a cloud provider (and keep it there) or I do all of it on my own site.

And now I’m much more interested in looking at something that makes online bulk storage of data easy to use, like an Active Archive.

Bookmark the permalink.