In recent days, I’ve noticed that a couple of monitoring products out in the market are still doing something dumb. Cacti is one, and NetApp’s Operations Manager (nee DFM) is another. They both do the same dumb thing: They both poll everything in their database all at once.
Cacti polls every single thing in its database every 5 minutes. A quick look at the Cacti forums suggests that this is because cacti uses cron as its scheduler, which then runs cactid, which polls everything. A client of mine has an environment with 35,000+ elements, so every 5 minutes, cacti dutifully floods the network with SNMP polls and then spends the next 4+ minutes processing all the data in RRD. They just upgraded the box to a bigger one they had sitting idle, but it can’t cope either, so now they’re going to shell out 5 figures on new hardware.
Ops Manager is slightly less dumb, in that it partitions what it polls into different areas: get disk information every 4 hours, get vFiler information every 8 hours, get CPU stats every 5 minutes. This means that at certain times, all the polling intervals line up, and Ops Manager polls for all the information it gathers all at once. For every Filer it monitors. We’ve seen this in our environment as a significant spike in CPU on all our Filers (and SNMP traffic, which is what tipped me off) every 4 hours. My guess is that the snmpd process in ONTAP is single threaded.
There are other, commercial and non-, products out there that do the same thing. Some of these products cost high 6 and sometimes 7 figures in licensing costs alone. All of these products have a design flaw that constitutes a Denial of Service attack on your network, and this is standard operating procedure.
seafelt doesn’t do this. In fact, seafelt hasn’t done this since version 1. This kind of ‘bulk poll everything and flood the network/server’ behaviour is something we specifically decided not to do from the very beginning. seafelt uses ‘random offset polling’ when scheduling its polls. Whenever you start a seafelt poller, it loads the configuration database of all enabled elements that it wants to poll and then staggers the polls at a random point somewhere in its configured polling interval.
Here’s an example: Let’s say you have 3 kinds of elements, called element types in seafelt-speak. Each element type has a default polling interval, so let’s say elemtype 1 has a polling interval of 5 minutes, elemtype 2 is 15 minutes and elemtype 3 is 1 hour. So, when setting up the polling schedule, seafelt randomises over that polling interval to find when it should poll the element for the first time. All the elements of type 1 will thus get polled for the first time somewhere in the first 5 minutes, type 2 within 15 minutes, and type 3 within an hour. Each subsequent poll will then be made 5, 15 or 60 minutes from the first poll.
This spreads the polling load more or less evenly over the entire day, so you won’t get these massive peaks in polling traffic, and consequent database load from having to process all the data at the same time (like Cacti). You won’t get big spikes when all the polling intervals line up every 4 hours (like Operations Manager). So, you won’t flood your network (and have to buy more network kit to handle the frequent peaks), you won’t flood the server (data is continually being loaded at roughly the same rate), and you won’t flood the endpoint gear (which is trying to spend all of its time serving data to customers, not responding to a ping-flood).
So if you’d rather buy hardware to service your business, not your monitoring software, upgrade to seafelt today.