Resilient Apps Or Hardware? A DevOps Conundrum


Much is being made of DevOps of late, and I’ve been doing a lot of work with clients in this area. What amuses me greatly is how little things have changed in the past several decades. This post by eBay cloud architect Subbu Allamaraju highlights a central struggle of IT: how to keep apps online.

This is a hard problem to solve, and there are two major approaches: infrastructure, or application. Each approach makes certain techniques available, but neither makes the problem ‘easier’ per se, but they make parts of the situation easier.

Resilient Infrastructure

This is the way we tried to do things for a very long time. Instead of having to make software deal with infrastructure failures (mostly hardware in the early days, but also software infrastructure, particularly as virtualisation became more popular) we try to make infrastructure that doesn’t fail. This lets the software running on the infrastructure pretend that its infrastructure is perfect, which makes writing software much much easier. I don’t have to check if my persistent storage is still there, because I assume it always is. I don’t have to check if the network is up, because I assume it always is.

As an application developer, I push all these issues of reliability off to someone else to deal with so I can concentrate on things like correctness; making sure my app does the right things when transforming inputs and outputs.

And so we have things like Tandem/HP NonStop hardware, RAID, active/passive failover, and so on. A variety of techniques at the infrastructure layer all so applications can pretend that infrastructure is perfect.

Of course, infrastructure isn’t perfect, so we relax our reliability criteria from 100% to 99%. Or 99.99%. Or however many nines you want. And then we gasp at how expensive infrastructure is.

Resilient Applications

Then one day someone thought “What if,” they said, “Just what if, for the sake of argument,” they continued, their excitement palpable, “we didn’t assume the infrastructure is perfect. What if we wrote applications that assumed infrastructure would fail and worked around it?”


Because now, instead of spending a lot on infrastructure that tries really hard to be perfect, you just buy any old cheap gear that does the job well enough, but you buy five of them (for half the price of before) and if some of them dies you just throw it out and buy a new one! By Grabthar’s Hammer, what a savings!

Ah, but now your software has to deal with failing infrastructure. Consider all the different kinds of infrastructure you need to use, and the ways in which it can fail. Now you have to deal with all of them, or your app will die, because no one else is taking care of that for you any more. That’s a lot of work. And, it turns out, distributed systems are really hard. Managing a few thousand fragile servers, and all the storage and networking between them, isn’t all that trivial.

What we’ve done is move the resilience problem up the stack. This doesn’t make it easier to solve, it just means we can use different techniques to solve it. And, some companies do very well at this. The ones with the scale to make it work well (because any one failure is such a small percentage of the total) and the money to hire the smartypants developers who can solve the hard problems of distributed computing.

What Is Best In Life?

I’ve yet to see a detailed analysis of one approach over the other (meta-analysis really, because we need a broad sample of different people attempting both methods) to see which one actually works better/costs less. Because you might save a bunch on hardware, but then you have to pay expensive labour.

And the march of technology means that, for example, disks don’t fail as much now as they used to, because the manufacturers bake a lot of smart software to deal with hardware failure issues into the firmware that runs the hardware. Flash is basically magic because of the complex statistics required on-chip to figure out how many electrons have quantum-tunneled their way out of the cells today.

My intuition is that we’re getting better overall, but neither approach is clearly superior. The underlying principles of good design work with either: modular components, loosely coupled, permissive inbound but strict outbound, fault tolerance.

In short, it’s a systems problem, and needs to be managed as one.

It’s a hard problem, and should be treated with respect.

Bookmark the permalink.

Comments are closed