Demystifying Risk Part 1 – Uptime Levels

A big part of my job at Emineo revolves around risk. How to measure risk, how to mitigate risk and how to speak to clients about the risks they face. This series of blog posts is my opportunity to hopefully demystify some topics that come up time and time again. This first post is about uptimes.

Uptimes

I am sure you will have seen the uptimes of various system described in terms of 99% or 99.99%. What do these really mean though?

The table below lists the yearly downtime for these various terms:

AvailabilityHours Downtime/Year
95.000%438
98.000%175
98.500%131
99.000%88
99.500%44
99.900%8.8
99.990%.88
99.999%.088

As humans, we’re not really designed to work in tiny fractions of numbers. 99% uptime may sound pretty good on first glance but that would equate to 3.6 days of downtime over a year. Now that might be fine for a single user’s PC but how about company email or your phone system?

Combining Uptimes

As a business, you may have a number of systems that you absolutely need to do your business. For example, you have an email system to receive orders, you have a CRM system to process those orders, and you have an ERP system to book the stock and invoice customers. These three systems each have a 99% uptime; but all of them need to be available in order for the business to operate. If any one of them goes down you are stuck in the water. We have to combine the uptimes and the results may surprise you:

Combined Uptime = 99% * 99% * 99% = ~73% Uptime

Our three systems, each of which has a 99% uptime, results in a combined uptime of only 73%. Now the real world may not work out exactly as the statistics show but our combined system has the potential to be down 98 days of the year. Suddenly 99% uptime doesn’t sound very good at all.

Redundant Systems

In IT we reduce the chance of downtime by making systems redundant or even multiply-redundant. For example, we have a web-server that is business critical. We decide that 99% uptime is just not good enough for this server so we decide to purchase a second server that can take over in case the first stops working.

Redundant Uptime = 99% + (1-99%)*99% = 99.99% Uptime

By adding a second web-server we’ve gone from 88 hours of downtime a year to less than an hour. Seems like a no-brainer, right?

Cost / Benefit Analysis

The problem is that 1-hour downtime may still not be good enough. We often get asked what we need to do to ensure 100% uptime. The simple fact is that 100% is impossible. You can tend towards 100% but you’ll never actually achieve it.

The major obstacle to increasing redundancy is cost. As you get closer to 100% uptime, each additional investment buys you less and less benefit.

Web Servers123456
Additional Cost£3,000£3,000£3,000£3,000£3,000£3,000
Combined Uptime80.00000%96.00000%99.20000%99.84000%99.96800%99.99360%
Combined Cost3000£6,000£9,000£12,000£15,000£18,000
Benefit Increase16.00000%3.20000%0.64000%0.12800%0.02560%

Too many numbers – How about a graph?

Each additional server costs the same as the last (bulk ordering discounts aside). However, the additional benefit gained gets exponentially smaller. Each additional server does gain us something but not as much as the last one. We therefore end up with a cost / benefit analysis. When does chasing 100% uptime become financially impractical?

R**k is not a dirty word

Risk is just a way to quantify the threat of something happening. Far from being a bad thing, risk can ensure that you are spending your valuable budgets wisely. Why spend money if there is no risk? Without a thorough appraisal of the risks facing a business there is always the chance that the controls put in place are either not suitable or address the wrong risks entirely.

Perhaps it’s time to embrace risk in your organisation?

Tim Goldfield – BSc, MBA, CISSP

Post Attachments »

  • »

Blog Categories »
Blog Archives »