For some time now, Riverbed has been repositioning itself from the WAN acceleration company we know to more of a network awareness and insight company.
The latest presentation at Tech Field Day 23 contained a lot about how Riverbed could see all the data flows and collect lots of information about lots of things and then they could analyse and present that data in ways that would be useful.
The rise of machine learning and AI has supplanted the rhetoric about Big Data that you may recall from a few years ago. The general Big Data pitch went something like this:
- Collect data.
The Underpants Gnomes business model was pretty obvious, but then people figured out how to take massive pools of data and build genuinely useful things by using AI/ML techniques that require massive amounts of data to work, such as artificial neutral networks (ANNs). Having massive piles of otherwise useless data laying around became useful in a few specific areas like image classifiers.
AI/ML came along just in time, because the shine was starting to wear off the Big Data hype and we needed something to replace it in order to keep the speculative investment money flowing.
And it was supported by the massive explosion in profitable exploitation of security flaws. The one-two punch of ransomware and cryptocurrency has made extortion very profitable and relatively low risk. A substantial problem with a decent price tag means there’s a large addressable market for vendors to go after, and collecting lots of data is held up as a major requirement for success.
What we have now is people creating huge piles of data in the hope that it’ll somehow be useful, either because you can stir the pile with a big stick of AI/ML until insight starts to congeal, or you can sift through it after you get pwned to find out what simple security thing you could have done to prevent it long after it’s too late.
Sarcasm aside, there are genuinely useful reasons to collect data, but figuring them out takes work and people generally don’t like to do that. It’s much easier to just try to collect everything so you don’t miss some vital source of data. This is an entirely rational response to a complex and volatile world. You can’t go back in time to collect the data you need but don’t have.
But this data hoarding isn’t without risks and costs. If you’re trying to find needles, building bigger haystacks makes the challenge worse. More data means more noise. More data means more of it is badly labelled, or biased, or in the wrong format, or in the wrong place. More data means more to leak or steal.
A simplistic “just keep everything” approach wasn’t a good idea in the Big Data era and it’s not a good idea now. Every day we see yet another naive team build an automated phrenology machine because they mistook correlation for insight.
As statistician George E. P. Box said “Essentially, all models are wrong, but some are useful.” While it’s true that generating more models means you might find some more useful ones, you will also find many, many more that are wrong. And now you have a new problem: you have to build a system for figuring out which models are wrong, and which are useful.
What Is Useful?
This is the area I’d like vendors to concentrate a little more on. Less about the speeds and feeds of how many data points they collect, or how many models they can create, but on which problems they solve and how.
You are selling to human beings, so how does your thing make their lives better? Generic claims of “you will get greater insight” isn’t helpful. Insight into what? How, exactly, is this insight valuable, and how is it obtained? What effort is removed? What value is provided and to whom?
These are the answers we need, and the machines can’t tell us until we learn to ask them the right questions.
The answer to the meaning of life, the universe, and everything is 42.