Improving Availability: Where to Start

One of the most difficult issues that a technology leader can face is addressing inadequate systems availability or ‘production’. This reference page provides best practices to address availability issues and enables IT leaders to overcome dire situations. These practices, along with the other pages under the production best practice menu, are learned from long experience and have worked in both large and medium size organizations, even when the availability issues and performance seemed intractable. 

Production issues have plagued large and small corporate shops, even where large budgets have allowed ample investment. Repeated issues have occurred in most industries affecting critical services.  Even cloud providers, such as Microsoft and Google have experienced outages (impacted cloud email apps). And these are not minor blips, as in summer 2013, Microsoft took an entire weekend to fully recover its cloud Outlook service. Large financial services providers have also had repeated outages such as where Bank of America experienced internet site availability issues. Unfortunately for Bank of America that was their second outage in 6 months, though they are not alone in having problems as Chase suffered a similar production outage on their internet services the week following. Further, these service failures are just the regular production issues, not including the unavailability of websites and services due to a series of DD0S attacks. I would note that outages continue to impact substantially many shops throughout this decade. In 2017 and 2018, we saw numerous airlines have major outages (BA, United, etc) impacting customers and their flights. More recently, TSB, a major bank in the UK, has had significant issues with the latest one occurring in November, 2019.

Perhaps 15 or certainly 20 years ago, such outages with production systems would have resulted in far less notice by their customers as the front office personnel would have worked alternate systems and manual procedures until the systems were restored. But with customers accessing the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis, it is very difficult to avoid direct and widespread impact to customers in the event of a system failure. Your production performance becomes very evident to your customers. And your customers’ expectations have continued to increase such that they expect your company and your services to be available pretty much whenever they want to use them. And while being available is not the only attribute that customers value (usability, feature, service and pricing factor in importantly as well) companies that consistently meet or exceed consumer availability expectations gain a key edge in the market.

So how do you deliver to current and future rising expectations around availability of your online and mobile services? And if both BofA and Chase, which are large organizations that offer dozens of services online and have massive IT departments have issues delivering consistently high availability, how can smaller organizations deliver compelling reliability?

And often, the demand for high availability must be achieved in an environment where ongoing efficiencies have eroded the production base and a tight IT labor market has further complicated obtaining adequate expertise. If your organization is struggling with availability or you are looking to achieve top quartile performance and competitive service advantage, here’s where to start. I would also note that the recommendations here are absolutely time-tested. They work as well today as 5 or 10 years ago. And properly applied, they will deliver significant improvements — some shops have seen 90%, 95% or even 98% reduction in downtime, customer impacts and number of incidents. So, here are the steps to achieve proper availability.

First, understand that poor availability, at its root, is a quality issue. And quality issues can only be resolved if you address all aspects. You must set quality and availability as a priority, as a critical and primary goal for the organization. And you will need to ensure that your incentives and rewards are aligned to your team’s availability goal.

Second, you will need to address the IT change processes. In organizations with poor availability, poor quality changes often are at the root of 60% to 90% of all incidents. Thus, you must stop the inflow of further defects to then drain the swamp of further systemic issues.   You should look to implement a change process based on ITIL. But don’t wait for a fully defined process to be implemented. You can start by limiting changes to appropriate windows, ensuring changes are done off hours, when volume is low and the business is not operating. Further, establish regular release dates and windows for major systems and accompanying subsystems and utilize a release management process. Avoid changes during key business hours or just before the start of the day. I still remember the ‘night programmer’ at Ameritrade at the beginning of our transformation there. Staying late one night as CIO in my first month, I noticed two guys come in at 10:30 PM. When I asked what they did, they said ‘ We are the night programmers. When something breaks with the nightly batch run, we go in and fix it.’  And done with no change records, minimal testing and minimal documentation. Of course, my hair stood on end hearing this. We quickly discontinued that practice and instead made changes as a team, after they were fully engineered and tested, not in the middle of the night on the fly. I would note that combining this action with a number of other measures mentioned here enabled us to quickly reach a stable platform that had the best track record for availability for all online brokerages.

Importantly, you should ensure that adequate change review and documentation is being done by your teams for their changes. Ensure they take accountability for their work and their quality. Drive to an improved change process with templates for reviews, proper documentation, back out plans, and validation. Ensure a strong technical lead runs the change approval meetings – with the authority to insist on quality. Most failed changes are due to issues with the basics: a lack of adequate review and planning, poor change documentation of deployment steps, or missing or ineffective validation, or one person doing an implementation in the middle of the night when you should have at least two people doing it together (one to do, and one to check). You should establish a proper checklist for teams to use to ensure their work is production ready, that is, properly done so that it will be successful and work well in production. By insisting on high quality change, you can quickly reduce the number of incidents due to failed changes, and get the breathing space necessary to address more systemic issues.

Also, you should measure the proportion of incidents due to change. If you experience mediocre or poor availability and failed changes contribute to more than 30% of the incidents, you should recognize change quality is a major contributor to your issues. In many shops with poor availability, failed changes can account for 60% or even 80% of all production issues. You will need to zero in on the areas with chronic change issues. Measure the change success rate (percentage of changes executed successfully without production incident) of your teams. Publish the results by team (this will help drive more rapid improvement). Work closely with the worst teams to address their lack of quality — identify the root causes and help them solve them. Often, you can quickly find which of your teams has inadequate quality because their change success rate ranges from a very poor mid-80s percentage to a mediocre mid-90s percentage. Good shops deliver above 98% and a first quartile shop consistently has a change success rate of 99% or better.

Third, ensure all customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). And the ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues requires consolidating these capabilities and extending their reach. Even better, identify the top 5 or 10 business services, integrate and extend your component monitoring to enable a dashboard of these top 10 services, and provide this to your ECC. Now, not only can you detect the issues faster and understand the impacts, but you can correlate them faster because the dashboard will show where (what component(s)) the business service is failing.

If you have an ECC in place, ensure that all customer impacting issues are fully reported and handled. Underreporting of issues that impact a segment of your customer base, or the siphoning off of a problem to be handled by a local team, is akin to trying to handle a house fire with a garden hose and not calling the fire department. Call the fire department first, and then get the garden hose out while the fire trucks are on their way.

Fourth, you must execute strong root cause and followup. These efforts must be at the individual issue or incident level as well as at a summary or higher level. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Are they cluster with one application or infrastructure component? Are they caused primarily by change? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Track both the fixes for individual issues as well as the efforts to address systemic issues. The systemic efforts to do closed loop improvement will yield all the actions that eliminate future issues.

These four efforts will set you on a solid course to improved availability. If you couple these efforts will diligent engagement by senior management and disciplined execution, the improvements will come slowly at first, but then will yield substantial gains that can be sustained.

You can achieve further momentum with work in several areas:

  • Document configurations for all key systems.  If you are doing discovery during incidents it is a clear indicator that your documentation and knowledge base is highly inadequate.
  • Review how incidents are reported. Are they user reported or did your monitoring identify the issue first? At least 70% of the issues should be identified first by you, and eventually you will want to drive this to a 90% level. If you are lower, then you need to look to invest in improving your monitoring and diagnostic capabilities.
  • Do you report availability in technical measures or business measures? If you report via time based systems availability measures or number of incidents by severity, these are technical measures. You should look to implement business-oriented measures such as customer impact availability. to drive great transparency and more accurate metrics.
  • In addition to eliminating issues, reduce your customer impacts by reducing the time to restore service (Microsoft can certainly stand to consider this area given their latest outage was three days!). For mean time to restore (MTTR – note this is not mean time to repair but mean time to restore service), there are three components: time to detect (MTTD), time to diagnose or correlation (MTTC), and time to fix (time to actually implement the fix to restore service or MTTF). An IT shop that is effective at resolution normally will see MTTR at 2 hours or less for its priority issues where the three components each take about 1/3 of the time. If your MTTD is high, again look to invest in better monitoring. If your MTTC is high, look to improve correlation and diagnostic tools, systems documentation or engineering knowledge. And if your MTTF is high, again look to improve documentation (especially of the fix path) or engineering knowledge but also look to automate recovery or failover procedures.
  • Consider investing in greater resiliency for key systems. It may be that customer expectations of availability exceed current architecture capabilities. Thus, you may want to invest in greater resiliency and redundancy or build a more highly available platform.

As you can see, providing robust availability for your customers is a complex endeavor. By implementing these steps, you can enable sustainable and substantial progress to top quartile performance and achieve business advantage in today’s 7×24 world.

What would you add to these steps? What were the key factors in your shop’s journey to high availability?

Best, Jim Ditmore

2 thoughts on “Improving Availability: Where to Start”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.