Increasing your uptime

Strategies for high availability to boost customer confidence

Globally, the average uptime for a website is 99.41% This may sound high, but equates to more than two whole days of downtime annually, during which customers may lose confidence, and won't be able to make any online purchases. Here, we look at ways to ensure minimal downtime for your site, servers, and other systems, and close the uptime gap on your competitors.

What is uptime and downtime?

Before we explore ways to improve uptime for your business, it is useful to establish what is meant by uptime and downtime. This might seem obvious, but there are some details that are worth considering.

Uptime is the time an application or service is operational and either able to produce output, or facilitate revenue-generating activities. Downtime is the opposite of uptime, being the time when applications or services cease output, or stop facilitating revenue-generating activities. We can deduce from this that to increase uptime we must decrease downtime.

It is important to understand that not all downtime is bad, and in fact 100% uptime might not be desirable for a business, as there would be no time to take your site or service offline for essential maintenance or improvements. Most organisations will therefore decide how much downtime is tolerable and necessary for their business each year, and schedule this in advance, with plenty of prior warning for customers to help manage expectations. This is referred to as 'planned downtime'.

Uptime is measured in 'nines'. One 'nine' is equal to 90% uptime, or 36.5 days of downtime annually. Five nines is equal to 99.999% uptime, or just 5.26 minutes of downtime per year. For emergency services and other organisations critical to public health and safety, five nines is considered the standard acceptable availability across all systems. The same is true of FinTech and trading applications, where even seconds of downtime can result in significant financial losses. However, other businesses may not need availability that is quite so high to maintain an optimal service. Ultimately, the decision is down to the business owners, and commitments to a certain level of uptime will usually be captured in Service Level Agreements (SLAs), so it is important that whatever level is decided is achievable and maintainable. Whatever your industry, higher availability is almost certain to mean higher levels of customer trust and loyalty, so achieving the best uptime that is feasible for your business should be a priority. Ensuring you are able to meet your SLAs regarding service availability means knowing how much planned downtime your business will need to operate optimally, and mitigating against unplanned downtime effectively.

How to improve uptime

If your nines are low but easily maintained, it might be quite easy to jump to the next nine. There are various ways to achieve this, including the introduction of partitions to keep your applications running in separate virtual machines (VMs). This architecture makes it easier to isolate issues quickly, and mitigates against a system-wide failure if one component goes down. Expanding on this strategy, you might decide to introduce redundant servers. These are servers that are on 'standby', ready to take on up to 100% of the load if your primary server fails. Alternative setups might see you sharing the load among multiple servers simultaneously to keep them all running well below their individual maximum capacities at all times, with each having the ability to take on more load if any of the other servers in the architecture fail. This is referred to as 'load balancing', which reduces failure risk, and could also extend server working life.

It's also worth backing up your servers regularly, and if you've virtualised the servers, taking snapshots often. This will allow you to roll back to previous versions, should a corruption be introduced, and enable you to restore the entire working environment, rather than just the files. If backing up your data daily still leaves your business vulnerable, you may want to adopt a Continuous Data Protection (CDP) strategy. This will introduce a system into your architecture that backs up data every time a change is made, which is referred to as True CDP. An alternative is Near CDP, which is a more traditional back up system, but one that creates restoration points much more regularly. The former will achieve a Recover Point Object (RPO) of zero, meaning no data will be lost between backups. The latter will have a higher RPO than zero, but will still be much more resilient to data loss than traditional backup methods.

Setting Key Performance Indicators (KPIs)

Having taken steps to improve and maintain your systems' availability, you can ensure the steps have been successful by setting KPIs to monitor. The most common KPIs for monitoring uptime are: Mean Time To Repair (MTTR), Mean Time Between Failures (MTBF), and Mean Time To Detect (MTTD). Each plays an important role in maximising your systems' uptime, but there is a degree of overlap between them all. The most effective approaches will incorporate all KPIs, and potentially other, less common ones as well.

MTTR

This is a measure of the average amount of time it takes to repair system failures. Time is measured from the moment work begins on a repair, to the point where the system is fully functional again. This means that MTTR also encompasses any testing associated with the repair.

To calculate annual MTTR, the total amount of time spent on repairs is divided by the number of repair incidents. So, in a very simplified example: if you had four hours of total repair time on the system throughout the year, and two repair incidents, you can deduce that your MTTR is two hours.

MTTR is not an exact measure of how much downtime your system has experienced. The delay between discovery of a failure and commencement of a repair is not accounted for. Therefore, this KPI is more useful at identifying how long it is taking to fix issues, which is vital information for forecasting and planning.

MTBF

To measure MTBF, uptime is divided by the number of unplanned downtime incidents. For example: a system may have run for 4,000 hours between the first and second downtime incidents, then another 1,500 hours before a third downtime occurrence, and further 3,000 hours before a fourth downtime incident. To calculate the MTBF in this scenario, we would add all the uptime together to give us 8,500, then divide this by the number of downtime incidents (four) to give an MTBF of 2,125 hours. This is an entirely hypothetical example, of course; an actual MTBF would require much more data on which to base its calculation.

MTBF gives an indication of a system's overall performance and dependability. It gives insights into how long systems can be expected to run without interruption, and provides valuable guidance on when preventative maintenance should be scheduled to avoid unplanned downtimes. The more MTBF data is gathered, the more insights you will have, helping you make decisions on everything from component replacements to alert system improvements.

MTTD

This is the time average time it takes from the occurrence of a failure to its detection (which will usually be automated with dedicated software). It is calculated by adding all the times between failures and detection, and dividing this between the total number of failures. For example: if, within a year, your system experienced a failure that took two minutes and thirty seconds to detect, a second failure that took one minute to detect, a third failure that took five minutes to detect, a fourth failure that took one minute and thirty seconds to detect, and a fifth failure that took thirty seconds to detect, we can see that this gives us a total time of ten minutes and thirty seconds. Dividing ten minutes and thirty seconds by five incidences of failure gives an MTTD of two minutes and six seconds.

Faster detection can result in less downtime, so this is an important metric to observe and, where possible, reduce. There's no substitute for training and educating your team in how to spot failures, and what the appropriate responses should be depending on their nature. Help can be provided for maintenance and security teams by introducing monitoring software that automatically detects and reports failures. Any software introduced to help lower MTTD should ensure that the data collected can be centrally correlated, so all team members have optimal visibility.

Other metrics include Rate of Occurrence of Failure (ROCOF), which provides insights into the frequency of failures; Probability Of Failure On Demand (POFOD), which measures the likelihood of systems failing when a service is requested, making it useful for systems receiving infrequent requests; and Availability (AVAIL) which measures the likelihood of the system being available during a specific time period, helping to provide insights into the impact of downtime on the user experience.

Preventing unplanned downtime

Beyond KPI metrics, technology teams can prevent failures in a number of ways. Validating code via automated testing during the early stages of the development cycle can help systems reach the higher nines of uptime. Monitoring platforms can be configured to provide detailed error reports that go beyond simple alert notifications, and provide context on the nature of the failure. This information is invaluable to teams diagnosing and resolving issues in a timely manner.

Businesses are also moving their assets, infrastructure. and software to the cloud, taking advantage of Kubernetes and microservices to remove single points of failure. This is achieved via multi-master clusters using many master nodes, each of which have access to the same worker nodes.

Chaos engineering (or 'resiliency engineering') is another way technologists can reveal vulnerabilities in the system and address them before they cause issues for users. This is achieved by running an experiment that deliberately infects your systems with bugs in a controlled manner, and observing how your monitoring platforms, remedial processes, and other fail-safes perform. This is usually conducted on a small, simple area of the system first. Once you have verified that this passes your tests, you might consider automating it during the build process in your CI/CD pipeline to ensure constant testing and confirmation of its usability. Teams could then progress on to more complex components and system areas, repeating these processes and expanding until comprehensive system coverage is achieved.

Improving your uptime is a complex undertaking, and the most effective solutions are likely to be bespoke in order to address the unique requirements of an individual business and its audience. Despite these challenges, the rewards of a more highly available system range from better customer retention to increased turnover, and as such can be vital to growing and maintaining a successful business. That's why many organisations seek out technology partners who specialise in maximising uptime.

If you'd like to explore strategies for improving your systems' uptime, drop us a message or give us a call on +44 (0) 8456 808 805 today.

The rewards of a more highly available system range from better customer retention to increased turnover, and as such can be vital to growing and maintaining a successful business.
Our website makes use of cookies to enhance your browsing experience and provide additional basic functionality. You can read more here