XtremIO Offers Xtrem Tradeoffs

EMC XtremIO

Getting a clear picture of what XtremeIO was about from EMC’s presentation was a significant challenge. It’s taken me a few rewatches of the video, and triangulation from other sources (including Solidfire’s presentation, ironically) to figure out how it works.

XtremeIO ‘X-bricks’ are dual-active controllers connected to 25 eMLC flash SSDs in a separate drive shelf. The logical path to the SSDs has two logical lookups: firstly its metadata (such as the ultimate location of the data block) and then the data itself. Data (and metadata) is stored on the SSDs using a form of wide-striped 23+2 parity-RAID, which EMC call XDP because they don’t want to call it RAID for some reason. Differentiation from competitors, I assume. It looks, walks, and quacks like parity-RAID to me.

The controllers are connected to each other over a fully-meshed RDMA network running over Infiniband, which is nicely speedy. It does limit the size of the cluster to the number of ports on the largest Infiniband switch on the market, but that’s not a big deal, because the maximum cluster size is currently 4 X-bricks (8 controllers).

The full-mesh architecture is a big part of where the consistent latency story comes from: the path length to each data block is the same no matter how you reach it, meaning the same number of hops, which provides the same latency (assuming all hops perform the same, which they should) regardless of the path taken.

Disruptive Upgrades

Of concern, though, is that adding bricks to the cluster is disruptive. Having to take the entire storage system offline for an upgrade is far from ideal, particularly in this day and age, and particularly for something that bills itself as a “scale-out” solution. It means you need to be pretty sure what you’ll need before you put things into production, or be able to budget for an outage if you want to upgrade later.

In fact, because of its tightly-coupled architecture, XtremIO is far more like a scale-up solution than scale-out. It’s the elasticity of being able to add, and remove, components that makes a solution ‘scale-out’ more than any other feature, in my opinion. This is generally achieved through some sort of shared-nothing style architecture, while XtremIO is more a ‘shared everything’ architecture because of the RDMA fabric, and the fact that data is only stored in one place.

In my opinion, XtremIO is a lot more like a Symmetrix [PDF] than a scale-out system, but let’s not get too hung up on that point.

No Compression

XtremIO doesn’t do compression yet. As per Solidfire’s presentation, this is tricky for XtremIO to do because of their choice of fixed 4k blocks for their RAID stripes, but one assumes they’re working on it. It’s not actually clear to me how much of an advantage compression is compared to deduplication, as I’ve not dug into the details of how other vendors do this, and how it performs in real world situations. There’s marketing material out there, but vendors will always put a positive spin on their own thing.

Snapshots

Snapshots (and clones) are brand new for the platform (they were due to be announced at EMCWorld this week), which is a little odd, given that snapshots and clones are a core storage array feature these days. The implementation of snapshots on XtremIO also seem a bit odd, and sound a lot like the way NetApp snapshots work in ONTAP 7-mode. Ssnapshots are fast, because they’re just a pointer to a metadata list, and they also don’t take up much space, because there are no duplicate pointers in lists. Snaps of snaps are also fast, because the pointers are partial lists that indirectly point to the parent block, not full lists.

But what happens if you delete a snapshot in the ‘middle’ of a tree of snaps? That removes pointers from the middle of your indirected pointer list, so you have to go through all the child snaps of the snap you want to delete, and update the pointers for all unchanged blocks to point at the snap’s parent blocks. It’s like cutting a branch at the fork, removing one branch of the fork, and then gluing the rest of the branch back on. Doing this merge is a background process, and is apparently not strictly necessary, plus the metadata is all held in DRAM, so it’s pretty fast.

Replication

Array native replication is also not available on XtremIO. If you want replication, you need to use VPLEX in front of XtremIO. The XtremIO team are apparently working with the RecoverPoint people to figure out how to put RecoverPoint techniques into the XtremIO software so you can do it natively, but it’s vapourware today.

Optimisation Means Choices

I actually think Dave Wright’s summation of the choices made by the different vendors is a great way to look at how the different systems he walked through work, and why. XtremIO’s architecture looks the way it does because of the choices they made to optimise in a certain direction. That makes it different from other solutions, but not necessarily better. It might suit certain use cases better than an alternative solution, but it won’t be a good fit for all use-cases.

Unfortunately, during the presentation the marketing message of “we’re the best!” eclipsed the fact that you can’t optimise for everything simultaneously. Too much of the message from EMC was that other choices are inherently, and objectively, bad rather than different choices made to optimise in a different way.

XtremIO have optimised for predictable low-latency block storage performance. The maximum capacity of the system isn’t as large as some other offerings, but that’s not what it’s for. (Tape is much better, but the latency sucks). The choice of full-mesh RDMA fabric, and fixed 4k block parity-RAID provide consistent latency. Fast Inifiniband interconnect, DRAM storage of metadata, wide-striping, all contribute to fast performance. Time to market was also important for XtremIO, so some of their choices (UPS backed DRAM, shared disk) have been made to get a product out there and selling quickly.

Who Is It For?

XtremIO look reasonable if you want predictable, low-latency performance for block storage in the 10-80 TB capacity range, and you’re pretty sure how much performance and capacity your application(s) will need, and you have the money to spend (or know how to drive a hard bargain). I suspect that it might also turn into a good choice if you’re already an EMC shop and want something that will work well with all your existing EMC products and tools. While that tight integration isn’t there today, it’ll probably arrive in relatively short order. If the integration is important to you, I’d be asking for a roadmap briefing, possibly under NDA, to understand if EMC’s roadmap aligns with yours.

Bookmark the permalink.

Comments are closed.