TFD9 Pre-work: Nutanix

Nutanix are old hands at Tech Field Day, so I expect them to be giving a polished performance. Again, Nutanix are privately held, so no spreadsheets.

Nutanix Products

Nutanix sell a Virtual Computing Platform. I’ve not used their gear, so this post is all based on the research I’ve done in the past week.

It’s basically x86 servers running proprietary software. It’s “converged” meaning they put storage into the servers and don’t use SANs. It’s a distributed cluster.

Again, let’s ignore marketing hype and look at what this kind of architecture is suited to: workloads that are inherently distributed where individual sessions share very little data. And that is indeed what they talk about on their use-cases page: virtual desktops, Hadoop, private-cloud.

These are simple, well defined workloads that behave like a (possibly large) group of individual servers. It’s not a data-warehouse, or a vertically scaled system. It doesn’t have special data management capabilities available from dedicated arrays. It’s a generalist platform.

Nutanix have commercialised the kind of custom setup that Google (and later Amazon, Facebook and others) developed: a distributed, cluster filesystem on commodity x86 hardware. It’s designed to not care if the hardware breaks, because it’s expected to. Compute is fragmented into small chunks that execute on specific servers, with results collapsed together at the end (MapReduce or similar). The whole thing is coordinated by smart software.

Coordination

The chief issue with this kind of architecture is coordinating all the moving parts. When you have a single, large server with a standby node connected to shared disk (the simplest SAN possible), you don’t have to deal with the kinds of coordination problems you get with a distributed cluster.

It’s a bit like organising lunch when there’s only one or two of you, and one person is CEO and the other is the junior hire. Odds are, the CEO will pick the lunch venue, and off you’ll go. Compare and contrast to the process of getting an engineering team of 15 to go to lunch today. Where do you go? Is everyone coming? Two people are ready now, a couple decide to check email “just quickly” and the whole process can take 15 minutes just to get into the lift.

It’s the classic n(n-1) communications problem that gets harder and harder as you add nodes.

There are benefits to this approach: as you have more nodes, you care less about any individual node, provided they are all roughly equal and data replication is robust. But doing clustered data replication with appropriate integrity guards is quite tricky (which replica is correct if they get out of sync, split-brain issues, etc.) so the software has to be much smarter.

Like all things, it’s a trade-off, and this architecture is not equally good at all kinds of workloads.

I’m sure Nutanix has its place. I look forward to learning more about what the company believes that place is.

Bookmark the permalink.

Comments are closed.