October on paper looked like a fine month. Almost all of our clients systems ran 100% every day of the month. Globally as a company our customers experienced 99.93% uptime across all IT systems.
We are also very satisfied to see no malware across any of our customers systems for the entire month. The feeling of everything being secure and having no issues used to be normal. May of this year our peace and quiet of the past 24 months was shattered. It feels good to have at least 30 days of clear sailing.
Lessons learned in October
Great IT operations needs to learn from the past and learn from peers. In our case it appears in October as we did neither when it comes to preventing downtime. After updating our monitoring software we did not realize a critical situation was not being properly escalated to the Customer Care departments attention like before the upgrade.
Not noticing this alert wasn’t working as expected led to issues occurring and customer downtime. Of course, when an alert we were counting on doesn’t work we don’t just have one client impacted. We have three in two days! After getting all three client systems up and running (all 3 in less than 90 minutes) it was time to repair the early warning configuration and prevent this from recurring in the future. It was also necessary to better document what happens after a software upgrade so we have a doublecheck in the future.
In fact, our lessons learned in October included improving three recurring IT issues. Two of the processes involved human processes. Hopefully, every client will be benefited going forward by these three new processes and we don’t repeat October’s results a second time.