What a great feeling to look at every single client and every single system that connects their technology together and see it was 100% available.
Friday we reviewed our company dashboard and when we got to these results it was time to celebrate! Not one internet failure that affected our clients, not one application that broke when being updated by the vendor, not one problem with any of what we call sub-systems at our clients sites. Since carefully collecting these data points the last couple months usually there was some sort of a problem somewhere so when we hit 100% uptime it was a great way to end the week.
Our process involves three different ways of analyzing this data. First there is uptime percent. Basically if everything worked perfectly uptime is 100%. If there was any sort of downtime on any system then we deduct this time from 100%. So if a cloud service was down for two hours we would express that as 95% uptime for that client that week.
The next data point we collect is client downtime. Here we measure in hours all of the sub-systems that were down. In the above case it would be 2 hours down time if the cloud service was down for 2 hours. If more than one system is down we multiply the hours times the number of systems down. This really helps us look at down time from our clients perspective. They typically don’t hand out congratulations for 100% uptime achievements. Rather we get phone calls like “Look our internet service has been down for ‘x’ amount of hours, can we get an update?” Even if we keep in touch in these downtime situations customers will still remember the DOWN time. We want to see and be reminded of this from their perspective.
Both of these approaches seemed to miss a key component. That is that some things are simply out of our hands. PCIT can work with a cloud vendor, we can implement automatic internet fail over services and so on but at the end of the day we are still needing other vendors to deliver a great product. If they fail there is DOWN time and less than 100% UP time. Taking the example already used, if that cloud service is with Amazon or Azure we have no control over the uptime (other than trying to recommend the best vendors to start with). Also to say there was 95% uptime unfairly represents that really there were lot of other business functions that the client could do. These would be things like email, internet usage, other cloud applications and so on while this cloud application was down. Without doing a ton of research we came up with our own methodology.
We counted key components of our clients networks and then averaged the results to come up with a number of 7 sub-systems per client. We then count how many of them PCIT is directly responsible for. So back to the case of the failed cloud service PCIT’s sub-system time down would be 0%. If instead there was problem we were responsible for like a firewall, switch or Cloud server (not cloud service) and it had 2 hours downtime our subsystem downtime % would be how many subsystems were affected divided by 40 hours per week to get our Sub-system downtime percent. In this case 2 hours for one subsystem would be .7% subsystem downtime. (2 hours divided by 7 subsystems*40 hours expresses as a %)
Perhaps no one will ever care but PCIT about our little sub-system downtime percentage methodology. However, it gives us some consolation to report .13% down across our client base instead of the tragic looking 4 hours downtime. Downtime in whatever form it arrives in usually translates into pain for someone. With the sub-system approach it is helpful to picture some situations as like a mosquito bite on the pain meter versus a giant kick in the shins.