Automation is all the rage in IT of late, and it’s something I’ve been reflecting on quite a lot lately. And by lately I mean 15 years or so.
The auto- prefix is Greek, and means self. Automatos means ‘acting of itself’. An automated system performs actions without external intervention: by itself. Automation is relatively simple to implement: Do this. When you’re done, there will be some sort of result that I can see or detect. Consider an automatic garage door opener, for example. I press a button, the door gets opened, but I don’t need to take any further action to make it happen.
The word autonomy is slightly different. nomos means laws, so autonomous means ‘self governing’ and implies a feedback mechanism. The autonomic nervous system, for example, keeps things like your body temperature within certain limits to keep you alive without you having to think about it. An air-conditioner is a pretty simple form of autonomy: I tell my A/C to keep my house at 23°C and it will make that happen, turning on and off depending on whether it senses that the air in my house is too hot or too cold.
The two concepts go together, but the addition of feedback makes things vastly more complex. There is a rich literature of automatic control systems, and we studied them back when I was at uni doing electrical engineering in the mid-’90s. The basic concepts remain the same today, even though the methods for collecting information and providing feedback have advanced quite a lot.
What Good Looks Like
A key concept here is understanding whether or not a system is operating under control. When a system goes ‘out of control’ it can be damaged, get broken, or die.
Knowing if something is under control or not requires knowing what good looks like. Without a reference point to compare the current state of reality to, you can’t know if things are normal or not. Is wearing a coat normal? What about a hat? Flared trousers?
In most organisations, the system administrators form a large part of the feedback and control loop. They hold in their head an idea of what good looks like. If reality starts to look different to what they think good is, then the system is deemed to be moving outside of control and some action is needed to put it back under control.
This could be CPU utilisation being ‘too high’ or storage space getting filled up. If the storage fills all the way up, the database runs out of space to store transaction logs so it stops, which means the e-commerce app it supports stops, and the business can’t take orders any more. In this—admittedly simplistic—example, you can see that the system can be more than just the IT components, and in a modern organisation you should be thinking in these terms.
But what is ‘too high’? Ideally all resources would be 100% utilised, because that’s the most efficient use of resources (which cost money to buy or rent). But then you have no slack to respond to changes: if everyone is busy all the time, there’s no room to deal with the unexpected. So, most people maintain some headroom for growth and change. But what is the right amount? Is 80% CPU utilisation too high? What about 50%?
Figuring out the answer to these questions is an optimisation problem. But, because things also change, you need some flexibility in the system to be able to cope with variance around the ideal, average level.
Hold on, an average level and variance… that sounds a lot like statistics. That’s right!
Statistical Process Control
Statistical process control comes from manufacturing, and if you think of modern IT as a factory, then you can port a lot of the same concepts across with surprisingly few changes. Your infrastructure—the servers, storage, networks, etc.—are machinery in your factory. If they are operating within control, then the variance of metrics like CPU utilisation, storage used, etc. will not vary ‘too far’ from ‘normal’.
Normal means the average, but that requires understanding how you calculate the average. Averaged over what timeframe? Minute? Hour? Day? Too far also needs to be defined. How far is too far?
Generally you start with whatever the system is doing already, and measure the average and variance. 95% of the time, the system should stay within two standard deviations of the mean. If you see a metric go outside that level too often, then it implies that something has changed to make that happen, and you need to look into it.
The challenge comes when the system needs to adapt to changing conditions without breaking. Once things start to move around, everything is more complex.
Consistency and Change
All of this statistical control work is designed to create a stable system that does the same thing over and over again, consistently. It might be “keeping body temperature at 37 degrees Celsius ±0.5 degrees” or it could be “there must be at least 5% free space on the storage”. But what if the external environment changes? If you go for a run, your heart rate increases, but once you stop running, your body adjusts and brings your heart rate back down. Well it should. If it doesn’t, see a doctor.
As alluded to above, normal depends a lot on context. What if payroll runs once a month on the 15th, spiking the CPU to 95% for three hours before settling back to 30% utilisation? That’s normal, but it could look like the system going out of control if your control system is too simple. Seasonality, as this sort of behaviour is called, can make controlling a system challenging.
If your system is autonomous rather than merely automatic then your control system will need to be able to deal with a certain amount of this complexity. The more complexity it can deal with, the more autonomous it can be.
But there are also longer term trends that require adjustments. Maybe your factory is churning out lots of widgets that customers no longer want to buy. It no longer matters how well in control your factory is. If you look at the system in a broader context, it has moved out of control, because Sales are dropping. It’s not possible for your factory, however complex its control system or how autonomous it might be, to be able to adjust to this situation because the information about whether or not it’s in control exists outside of the system.
Instead, you need another, larger system than can detect and adjust.
Are You Automated, Or Autonomous?
Have a think about the organisation you work for, and your place in it. How much of your work is automatic, versus autonomous. Are you self-governing, or merely self-acting? How far does your ability to self-govern extend? What feedback mechanisms do you use? Which feedback mechanisms are you a part of, and who do they serve?