Troubleshooting Information Overload
Let’s face it; most of us in IT spend entirely too much time in firefighting mode. We talk a great game about being proactive, and keeping ahead of issues; monitoring our systems for utilization and capacity so we can schedule upgrades before things bog down. But then, with our phones beeping almost non-stop, having to check email every 30 seconds for critical alerts, we end each day further behind, and spend half our free time checking our smartphones to make sure nothing has gone down. It is all too easy to fall into the trap of monitoring so much that you drown in the information, which then leads to the even more dangerous condition of missing the alerts that are really critical because they get lost in a sea of noise, or even worse, creating rules to sort the alerts and not even bothering to check them on a regular basis. It is all too common for me to find clients who say they have great monitoring and alerting, but when I ask them to show me what they are doing, they show me their Outlook client with a massive tree of folders, dozens of rules, and thousands of unread messages.
When your monitoring and reporting solution overwhelms you with noise, it is going to be human nature for you to simply ignore it. You’d never get anything done otherwise. And while you may have the best intentions, when it comes to reviewing all those folders, far too often it is to find that one alert email that you should have seen before something became a critical failure. Your monitoring has put you into information overload, and it’s time to troubleshoot your way out of that mess.
How much is too much?
Real-time alerts, or those that the monitoring system sends you as soon as it detects a condition that requires immediate attention, should be kept to a minimum. Service failures, low disk space warnings, failed backups (that don’t automatically schedule themselves to try again), virus detections, privileged account lockouts – these are the things you really should be looking at fairly quickly. Anything else is noise. The first thing that as a team you need to do is look at a representative days’ worth of alert messages, and as a team, identify the ones that don’t need immediate attention. Anything that can be safely ignored or looked at later, is something that shouldn’t be an immediate alert. Informational messages are the same way. If you really need to know about every success, then you need a NOC or dedicated monitoring team. The idea here is to weed out all the noise so that when your phone buzzes in the middle of a meeting, it is only because there’s something you really need to look at.
Who’s on deck?
Another common problem I see is alerts that go to a distribution list, and everyone assumes someone else has got it covered. D/Ls are the right thing to use for alerts, but you need to set up a rotation of who is the first responder, and who is the backup, and when an alert is received, whoever is actually going to respond needs to reply-all that they have it covered. That way you all know it is getting taken care of, and you don’t have two (or more) people trying to do the same thing.
But it’s during scheduled maintenance
If you have maintenance windows, make sure your monitoring system is configured to stop alerting during that window. Whether you are doing system upgrades, patching, recabling, or anything else, you don’t want alerts waking people up during the expected reboots for patching.
Oh yeah, you can ignore that, I rebooted
Look for monitoring systems that have a really simple pause button, and then make sure that you press that pause before doing something that would trigger an alert, like restarting a service or rebooting a server. We don’t want others to respond to a perceived service failure when you are actively working with the box; it’s those sort of “boy who cried wolf” alerts that make people start ignoring them.
PING doesn’t mean all is well
Pinging a box to make sure it is online and reachable is great, but that doesn’t tell you anything about the running services. Implement monitors that actually test services, either by generating queries, submitting GETs, logging on, checking mail, etc. I’ve seen servers hard crashed whose NICs still responded to PINGs, so don’t rely on just that to be sure everything is up.
Daily summaries are your friend
Remember all those extra alerts we weeded out in the first step? Those should be moved to daily summaries that hit the team’s inboxes first thing in the morning. Once you get logged on and down to business, each team member should take turns reviewing the summary alerts so that those things that could wait until the next day do get the attention they need.
Automate your responses
If the appropriate response to an alert in the middle of the night is to restart the service, run a script, or bounce the box, let your monitoring solution do that for you. Only if the service doesn’t come back up after the automated action should the on-call admin have to remote in for further investigation.
Use SMS to get people’s attention
Ideally, you should use SMS to send text alerts to admins’ phones, instead of email. We all get far too much email around the clock, and the on-call guy shouldn’t have to lose sleep unless something really goes wrong. Silencing your email alerts while keeping SMS alerts audible lets you sleep through the night, but will actually wake you up if something critical does occur.
By reducing the noise to manageable levels, automating responses, and moving informational alerts to daily summaries, you can get a better handle on your monitoring and alerting, actually provide appropriate and timely responses to the alerts that need you, and start moving away from that daily firefighting mode.










This is a very sound way to set up an alert system, and something that anyone should use as a guideline for their own system. Too often have I been stuck in the cycle of way too many needless alerts, and another thing you need to look for is who really needs to be looking at the alerts generated. Maybe your operations team wants to be copied on them, but sit down and talk over exactly what they need to see that’s critical or what you can just as easily cover in a personal, one-line email or in a daily summary.
Here’s what I did to handle information overload (checking emails). I migrated from Microsoft Outlook to Gmail’s POP3 system. The reason behind this migration is that the latter is not that interactive. You’ll receive your email 15-50 minutes later. Unlike when you’re using Outlook – you’ll receive notifications about new emails instantly.
As time goes by, by using Gmail POP3, you’ll learn to be patient in receiving your email. Try it. It worked for me.
First of all, there is a major difference between being totally obsessed (or more commonly known as having an Obsessive–compulsive disorder – OCD) and being an effective employee. If you constantly check your emails every now and then even if you are not expecting something, you’re totally being paranoid at work. For you to be an effective worker, maximize the time that is given to you.
While waiting for emails or other assigned tasks, do something else productive – like reorganizing your schedules, make a report, clean your desk, have lunch with a co-worker, submit proposals, etc.
There is no such thing as information overload. If there is or was, we would have been so overwhelmed when the Internet became mainstream in the 1990s and popular in the early 2000s. Information Technology (more specifically the World Wide Web) made our work easier and more effective. It also made globalization a reality to many 3rd world countries.
I remember the Technical Manager of a company I worked for many years ago, who had explicitly asked us not to send him any information that is not vital. He claimed every new message alert was a source of stress for him. Well, maybe the fact that he pretty often had to deal with not very happy customers of the company had to do something with the stress each new email brought but he simply hated receiving emails as a result of this.