This is a sponsored post, paid for by SolarWinds. The opinions expressed in this post are mine, and editorial control over the post remains mine.
SolarWinds asked me to take a look at the new Recommendations feature in the latest version of their Virtualization Manager (VMAN) product, and I must say it looks pretty good.
The overall look-and-feel of the Orion web console is clean and uncluttered, and the aesthetic has been carried through into the design of the Recommendations feature. You can have a look at a demo online and follow along with this review if you like.
Hopefully you’ve had a look at my post on Automation and Autonomy. It started as an aside to this one to provide some larger context, so do give it a read as it might explain where I’m coming from in this review.
Unlike some other tools, I don’t feel overwhelmed with detail. I have a simple list of things that VMAN thinks are broken and need fixing, sorted by severity of the issue. VMAN has modelled what good looks like under the hood, and that’s what it’s using as the basis for its recommendations. This is gap analysis, basically, between the current state of the world and what VMAN thinks it should be. Each recommendation is what VMAN is the action it suggests you should take to move from the current state to the desired state.
I like that the description of what’s wrong is simple and easy to understand, as is the recommendation beside it. I’d like to see some sort of category icon added here as well so I could tell at a glance if it’s a CPU issue or RAM or storage or something else. Iconography is quite useful for that sort of thing.
I also like that you can click on the recommendation to get more detail about what’s going on. I’m a fan of systems that manage the tension between not enough detail (dumbed down) and too much detail (expert level only) and help you manage your learning curve. When you start using a system, you can’t deal with all the complex nuances of the full data model it uses, but as you become more familiar with the tool and the problem domain, you start wanting to know more about the details and can better appreciate subtleties.
Advisory, Not Autonomy
If you click on a recommendation, you can see more details about each one. I really like the level of detail provided here, and the way it’s explained. You can see what the problem is, what steps should be taken to fix it, and what alerts will be resolved if the recommended action is followed.
You can also drill into the statistical detail of how this recommendation came to be. This helps you to understand how the internal model of VMAN is working, so you can see why it thinks the recommended action is a good idea. That’s important, because if you can’t inspect the internals of a system, you have no way of figuring out why it’s doing the wrong thing if it’s broken. This is one of the problems of recent attempts at machine learning systems: some people treat whatever comes out of the computer model as gospel, as if a computer is somehow magic. Computers are dumber than a bag of hammers, and if the model is flawed, they just make lots of mistakes really, really quickly.
I also really like the explanation of what the recommended action will do. If the action is moving a VM from one system to another, the UI displays the effect on both the source system, and the target system. I’d like to see a visual representation of this effect on the chart, as it would make the impact more obvious that just a number, but this is still neat information to have.
You don’t have to agree with the recommendations VMAN provides. You can choose to ignore it for now (via the More Actions dropdown), and have it only pop up again later if the conditions that led to the recommendation are still valid. That might be a good idea if you happen to know something that VMAN doesn’t, like some temporary maintenance that’s going on right now. Some other alerting systems have a ‘pause’ type button you can turn on while scheduled maintenance is occurring, but they can be dangerous if you forget to turn them back on again once maintenance has completed. Ideally you want the system to start alerting again all by itself, so this temporary ignore mechanic is a good idea.
If you do agree, though, you can apply the action now, or at a later, scheduled time. That’s quite handy if you have change control and change windows and that kind of thing. You could queue up all the actions to be automatically applied during the change window and not waste a bunch of human clicking time, which will help keep change windows small.
A subtle aspect here is that you, the human, are making decisions but leaving the details of how the action is performed up to the computer. Computers are very good at doing exactly the same thing over and over again, but humans are pretty bad at it. They forget things, and make mistakes. Putting something like VMAN in charge of dealing with those details is a good idea.
A Matter Of Trust
If we trusted VMAN enough, we would want to be able to cede control of taking these actions to VMAN, which would convert it from being an advisory system to more autonomous. VMAN doesn’t have that ability yet, but it’s not complex to remove the human confirmation step from the feedback loop here.
However, the human processes that surround such a system are much more complex. You would need to trust that the decisions VMAN is making are ones that you consider to be appropriate. Being able to look into the system and why it made the recommendation helps a lot here.
The history feature is also great for this. Right now, you can use the history to see who approved which action and why it was run, and whether or not the action was performed successfully or not. If someone approved an action that shouldn’t have been run, you can see it and learn from it. If VMAN is giving bad recommendations, you can also trace it through history and try to understand why it’s happening and tune the system.
In future, if you have a highly tuned system that you trust will work well, you will be more likely to give the system greater autonomy and look after itself.
More Work To Be Done
This is the first incarnation of the recommendation engine in VMAN, and I expect that it’ll become more sophisticated over time. I hope that SolarWinds build it so that you can turn on greater levels of sophistication as you get used to using them. Jumping straight into a Formula1 car if you’ve never driven before isn’t a great plan for anyone involved.
One nit is that there doesn’t seem to be a difference between ‘active’ and ‘predicted’ recommendations in the left filter mechanism. Apparently this is a known bug, so expect it to be fixed pretty quickly.
The statistical charts look a lot like SPC control charts, but they’re not quite right. There are warning and error thresholds, but they’re upper limits, not statistical boundaries both above and below a mean value. The documentation says that you can configure dynamic baseline thresholds, and VMAN will automatically set thresholds 2 and 3 standard deviations above the mean, but that’s not enough. You really want to know if the metric goes above or below the mean value by a lot, because either case could indicate the system is moving out of control. A sudden drop in CPU or memory utilisation could indicate that an important VM is offline, for example.
I’d also like to add multiple options as recommendations. Often there’s more than one way to solve a problem, and it might not be obvious to the computer which is the ‘best’ one. However, if there are multiple options then if I make the system fully autonomous, how will it decide between those options to pick the ‘best’ one? If I can’t describe how to make that decision as an algorithm, then I can’t tell the computer (or anyone else) how to make the same decision I would.
And if greater autonomy is added, I’d like for the area of autonomy to be granular, or modular, depending on how much I trust that I’ve tuned the system well. Maybe I think the system will work nicely for keeping CPU and RAM allocated well, but storage use is highly variable and difficult to predict, so I wouldn’t want those recommendations applied automatically.
Being able to gradually add more and more autonomy to the system in stages means it could keep pace with me and how well I understand my environment. Ideally using a tool like VMAN will help me and my team to understand what good looks like and how to keep the system operating smoothly. Eventually, it could be dealing with most of the fine-tuning required to keep everything humming along and most of my work could be spent on building higher-order systems that themselves gradually become autonomous.
That’s the dream, anyway.