NetApp Active IQ Adds Machine Learning to Autosupport

This is part of a series of posts related to Tech Field Day 18.

NetApp has added some machine learning magic it calls Active IQ that uses its large Autosupport dataset as a starting point.

NetApp has been gathering data about its fleet of storage systems for many years, “for two decades at least,” according to Shankar Pasupathy, Technical Director, AI and Data Engineering, Active IQ. I’m reasonably familiar with an older version of Autosupport from my days as a storage admin and system architect.

About ten years ago I wrote a tool for analysing Autosupport data to check for deviations from a known-good configuration, warn of potential capacity growth issues, that sort of thing. It was a fairly simplistic tool, in my view, but it was amazing how much detecting simple configuration drift helped us to prevent problems.

NetApp is now collecting 400 terabytes of telemetry data each month, and has 15 petabytes of data to draw on (with about 4 petabytes in a “hot data lake”). This is the kind of sizable dataset that lends itself to certain of the more data hungry statistical techniques referred to as machine learning. NetApp has long performed predictive analysis of disk failures to ship out replacement disks to customers before disks actually fail, thus helping to keep customer storage arrays online, and to smooth out its supply chain. Now it’s added some more sophisticated analyses to help administrators.

NetApp is using natural language processing (NLP) to parse existing documentation and essentially reverse engineer an expert system from the support knowledge base. This has saved humans having to go through the docs and manually create audit rules like “if option A is turned on, option B should be off”. Manually coding these rules is what I was building into my audit system ten years ago, so I am somewhat bemused that NetApp is finally catching on to the idea, but hey, progress is progress.

NetApp is also using a technique called association rule learning (though they call it associative rule mining) to predict the likelihood of a disruption if certain risks exist. The goal is to help guide human operators where to focus their efforts. Generally you want to focus limited resources on the most important or urgent problems. If a filer is likely to go down in the next 24 hours, I probably want to look at that problem first, rather than the potential for running out of disk in three months. The benefit of machine learning here is that statistics can notice obscure correlations that humans might miss. This isn’t always good, as sometimes the correlations are meaningless, but sometimes they can bring up something important that the humans have overlooked.

NetApp is also using a (quite simple) technique called k-means clustering to figure out if a given upgrade will fix certain problems (or remove certain risks). This is based on looking at what happened to other systems in a similar situation to yours before and after an upgrade. It’s a bit like watching a lot of other people eat willow bark. If a lot of them say their headache went away, it’s reasonably likely that your headache will also go away if you chew on some primitive aspirin. It’s also a bit like watching other people eat the mushrooms you just picked, and if they don’t die, you can assume it’s likely safe for you to eat the mushrooms, too.

Isn’t statistics fun?

NetApp is working to extend this approach to do some more sophisticated analysis. They’re working on creating recommendations for optimal configuration settings for different workloads based on the experiences of users across the installed fleet. This is useful because it’ll be based on actual data and not theoretical lab conditions used in benchmarks that largely don’t exist out in the real world. This isn’t to suggest that benchmarks aren’t useful—they are—but this real-world data can work in concert with the basic theory from the lab to find practical and workable solutions. It’s a bit like taking willow bark, isolating the active ingredient, and then checking to see what the safe dose is for people who are also taking a variety of other medications at the same time.

I think this is an excellent, if modest, beginning to help humans manage quite complex systems. There are a bunch of things I’d like to see on the platform, such as industry benchmarking so you can see how well utilised your systems are compared to similar size/complexity peers in the same or similar industries, or perhaps different industries. Too many customers are isolated from what’s really happening out there in the world and are hungry for this kind of information. A trusted partner like NetApp that has this kind of information could provide a valuable service to helping customers who are below average to improve, and thus lift the performance of entire industries. Customers are also a little too keen to view themselves as special snowflakes who don’t need to follow the rules, so if there’s some hard data from an external and (relatively) impartial source (assuming the source and the data can be trusted) that the people who follow the rules are doing better than you, it might help customers to stop hitting themselves quite so much.

Bookmark the permalink.