TFD18 Prep: VMware

This is one of my traditional preparatory posts for Tech Field Day 18.

VMware Logo

 

According to this blog by VMware, they’ll be presenting on vSAN and vSphere Health. It seems the focus will be on vSphere Health.

I don’t have any experience with this product, but I do have a long and fairly tedious history with monitoring and automation systems. I’m an ops guy at heart, really, since that’s where most stuff spends most of its useful life. I started my career in systems administration and helpdesk, so keeping things alive is baked into my worldview of IT.

I love the idea of vSphere Health. For decades, enterprises have been trying to build knowledge bases and configuration databases and expert systems and all kinds of ways of helping to fix stuff faster. If something breaks a lot, then we should write down the fix and maybe try to do preventative maintenance to stop it from breaking in the first place.

None of this is a new idea. If you want to see places that do it well, look for capital intensive industries that rely on plant uptime to make money. Printing presses. Airlines. Factories. Power generation. Mines. Stuff where downtime on the big, expensive machine costs you millions of dollars a minute. Owners of these things tend to invest in regular maintenance and careful monitoring to try to detect problems before they take systems offline.

They’re not perfect, because we still have breakdowns (Fukushima anyone?), and I’m sure the operations people at various plants have lots of complaints about how things could be done better, but they’re generally much better at it than a lot of IT shops. This is mostly down to practice and experience. They’ve had longer to build up a body of knowledge (and tools and processes to go with it) that mostly works. These systems get better by accumulation as you get more familiar with what good looks like.

IT Changes Rapidly

IT makes learning harder because of how fast things change relative to, for example, printing presses and gas turbines. Software can get replaced much quicker than a billion dollar power station, so you have to relearn where all the corner cases are. Operating the same plant for a decade or two gives you more time to figure out how stuff breaks, how to fix it when it does, and how to notice that it’s about to break.

But IT also provides advantages, such as sensors that collect data very quickly, store it digitally on disk that gets cheaper constantly, and use compute that gets faster and cheaper constantly to analyse it with software that can do more with the vast amount of data available. Computer-assisted analysis of failure conditions can uncover causes that are not obvious to human operators without decades of experience.

vSphere Health can share information about everyone’s environments, so you don’t have to personally experience a corner case before you can learn about what causes the problem, and how to fix it. We can all learn from one another’s misfortune, which improves the health of all of us.

That’s the theory, of course.

The reality is that computers are so unbelievable complex that the range of ways they can break increases constantly. The compatibility matrix grows constantly, and every time a new item is added, it’s another variable in an already astoundingly complicated polynomial to analyse.

Just Automate It

A lot of the outcomes from vSphere Health appears to be focussed on advisory, and not on automatically fixing things. This blog from late last year, for example, shows that issues get detected but the outcome is to advise a human operator to take action.

I understand the reticence to just apply fixes without an approval step where a human can veto the change. Updates can break things, particularly firmware updates that might be hard to undo. There are just so many moving parts that it’s virtually impossible to test things well enough to ensure that nothing will break.

And yet we’ve moved on from manual patching for most operating systems, particularly for security patches. I’m firmly on the side of aggressive automation of this kind of routine operational task. Yes, some systems are critical enough that manual oversight of changes is required, but humans tend to overestimate their own ability to perform that oversight, and also which systems really are critical. That leads to maintenance getting delayed, then delayed again, and then you wake up one day with out-of-maintenance Windows 2003 servers running critical services.

If you default to automation, then you free up resources to deal with the infrequent issues caused by bad patches. If a particular vendor is bad at releasing untested patches, then widespread backlash from customers is more likely to get them to clean up their act than lots of manual workarounds performed in isolation. Just look at Microsoft’s turnaround with respect to security.

This is about treating IT as a system with many supply chain components, most of which exist outside of your organisation. This system is far too complex for individual humans to manage without automated assistance, so it’s long past time we made full use of the tools available to us.

Disclosure: VMware has been a client of PivotNine in the past.

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.