The Elusive High Availability in the Digital Age

Well, the summer is over, even if we have had great weather into September. My apologies for the delay in a new post, and I know I have several topic requests to fulfill 🙂 Given our own journey at Danske Bank on availability, I thought it was best to re-touch this topic and then come back around to other requests in my next posts. Enjoy and look forward to your comments!

It has been a tough few months for some US airlines with their IT systems availability. Hopefully, you were not caught up in the major delays and frustrations. Both Southwest and Delta suffered major outages in August and September. Add in power outages affecting equipment and multiple airlines recently in Newark, and you have many customers fuming over delays and cancelled flights. And the cost to the airlines was huge — Delta’s outage alone is estimated at $100M to $150M and that doesn’t include the reputation impact. And such outages are not limited to the US airlines, with British Airways also suffering a major outage in September. Delta and Southwest are not unique in their problems, both United and American suffered major failures and widespread impacts in 2015. Even with large IT budgets, and hundreds of millions invested in upgrades over the past few years, airlines are struggling to maintain service in the digital age. The reasons are straightforward:

  • At their core, services are based on antiquated systems that have been partially refitted and upgraded over decades (the core reservation system is from the 1960s)
  • Airlines have struggled earlier this decade to make a profit due to oil prices, and minimally invested in the IT systems to attack the technical debt. This was further complicated by multiple integrations that had to be executed due to mergers.
  • As they have digitalized their customer interfaces and flight checkout procedures, the previous manual procedures are now backup steps that are infrequently exercised and woefully undermanned when IT systems do fail, resulting in massive service outages.

With digitalization reaching even further into the customer interfaces and operations, airlines, like many other industries, must invest in stabilizing their systems, address their technical debt, and get serious about availability. Some should start with the best practices in the previous post on Improving Availability, Where to Start. Others, like many IT shops, have decent availability but still have much to do to get to first quartile availability. If you have made good progress but realize that three 9’s or preferably four 9’s of availability on your key channels is critical for you to win in the digital age this post covers what you should do.

Let’s start with the foundation. If you can deliver consistently good availability, then your team should already understand:

  • Availability is about quality. Poor availability is a quality issue. You must have a quality culture that emphasizes quality as a desired outcome and doing things right if you wish to achieve high availability.
  • Most defects — which then cause outages — are injected by change. Thus, strong change management processes that identify and eliminate defects are critical to further reduce outages.
  • Monitor and manage to minimize impact. A capable command center with proper monitoring feeds and strong incident management practices may not prevent the defect from occurring but it can greatly reduce the time to restore and the overall customer impact. This directly translates into higher availability.
  • You must learn and improve from the issues. Your incident management process must be coupled with a disciplined root cause analysis that ensures teams identify and correct underlying causes that will avoid future issues. This continuous learning and improvement is key to reaching high performance.

With this base understanding, and presumably with only smoldering areas of problems for IT shop left, there are excellent extensions that will enable your team to move to first quartile availability with moderate but persistent effort. For many enterprises, this is now a highly desirable business goal. Reliable systems translate to reliable customer interfaces as customers access the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis. Your production performance becomes very evident, very fast to your customers. And if you are down, they cannot transact, you cannot service them, your company loses real revenue, and more importantly, damages it’s reputation, often badly. It is far better to address these problems and gain a key edge in the market by consistently meeting or exceeding costumer availability expectations.

First, if you have moved up from regularly fighting fires, then just because outages are not everyday, does not mean that IT leadership no longer needs to emphasize quality. Delivering high quality must be core to your culture and your engineering values. As IT leaders, you must continue to reiterate the importance of quality and demonstrate your commitment to these values by your actions. When there is enormous time pressure to deliver a release, but it is not ready, you delay it until the quality is appropriate. Or you release a lower quality pilot version, with properly set customer and business expectations, that is followed in a timely manner by a quality release. You ensure adequate investment in foundational quality by funding system upgrades and lifecycle efforts so technical debt does not increase. You reward teams for high quality engineering, and not for fire-fighting. You advocate inspections, or agile methods, that enable defects to be removed earlier in the lifecycle at lower cost. You invest in automated testing and verification that enables work to be assured of higher quality at much lower cost. You address redundancy and ensure resiliency in core infrastructure and systems. Single power cord servers still in your data center? Really?? Take care of these long-neglected issues. And if you are not sure, go look for these typical failure points (another being SPOF network connections). We used to call these ‘easter eggs’, as in the easter eggs that no one found in a preceding year’s easter egg hunt and then you find the old, and quite rotten, easter egg on your watch. It’s no fun, but it is far better to find them before they cause an outage.

Remember that quality is not achieved by not making mistakes — a zero defect goal is not the target — instead, quality is achieved by a continuous improvement approach where defects are analyzed and causes eliminated, where your team learns and applies best practices. Your target goal should be 1st quartile quality for your industry, that will provide competitive advantage.  When you update the goals, also revisit and ensure you have aligned the rewards of your organization to match these quality goals.

Second, you should build on your robust change management process. To get to median capability, you should have already established clear change review teams, proper change windows and moved to deliveries through releases. Now, use the data to identify which groups are late in their preparation for changes, or where change defects are clustered around and why. These understandings can improve and streamline the change processes (yes, some of the late changes could be due to too many approvals required for example). Further clusters of issues may be due to specific steps being poorly performed or inadequate tools. For example, often verification is done as cursory task and thus seldom catches critical change defects. The result is that the defect is then only discovered in production, hours later, when your entire customer base is trying but cannot use the system. Of course, it is likely such an outage was entirely avoidable with adequate verification because you would have known at the time of the change that it had failed and could have take action then to back out the change. The failed change data is your gold mine of information to understand which groups need to improve and where they should improve. Importantly, be transparent with the data, publish the results by team and by root cause clusters. Transparency improves accountability. As an IT leader, you must then make the necessary investments and align efforts to correct the identified deficiencies and avoid future outages.

Further, you can extend the change process by introducing production ready.  Production ready is when a system or major update can be introduced into production because it is ready on all the key performance aspects: security, recoverability, reliability, maintainability, usability, and operability. In our typical rush to deliver key features or products, the sustainability of the system is often neglected or omitted. By establishing the Operations team as the final approval gate for a major change to go into production, and leveraging the production ready criteria, organizations can ensure that these often neglected areas are attended to and properly delivered as part of the normal development process. These steps then enable a much higher performing system in production and avoid customer impacts. For a detailed definition of the production ready process, please see the reference page.

Third, ensure you have consolidated your monitoring and all significant customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). This modern ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues these capabilities must be consolidated and integrated. Once you have an integrated ECC, you can extend it by moving from component monitoring to full channel monitoring. Full channel monitoring is where the entire stack for a critical customer channel (e.g. online banking for financial services or customer shopping for a retailer) has been instrumented so that a comprehensive view can be continuously monitored within the ECC. The instrumentation is such that not only are all the infrastructure components fully monitored but the databases, middleware, and software components are instrumented as well. Further, proxy transactions can and are run at a periodic basis to understand performance and if there are any issues. This level of instrumentation requires considerable investment — and thus is normally done only for the most critical channels. It also requires sophisticated toolsets such as AppDynamics. But full channel monitoring enables immediate detection of issues or service failures, and most importantly, enables very rapid correlation of where the fault lies. This rapid correlation can take incident impact from hours to minutes or even seconds. Automated recovery routines can be built to accelerate recovery from given scenarios and reduce impact to seconds. If your company’s revenue or service is highly dependent on such a channel, I would highly recommend the investment. A single severe outage that is avoided or greatly reduced can often pay for the entire instrumentation cost.

Fourth, you cannot be complacent about learning and improving. Whether from failed changes, incident pattern analysis, or industry trends and practices, you and your team should always be seeking to identify improvements. High performance or here, high quality, is never reached in one step, but instead in a series of many steps and adjustments. And given our IT systems themselves are dynamic and changing over time, we must be alert to new trends, new issues, and adjust.

Often, where we execute strong root cause and followup, we end up focused only at the individual issue or incident level. This can be all well and good for correcting the one issue, but if we miss broader patterns we can substantially undershoot optimal performance. As IT leaders, we must always consider the trees and the forest. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Do they cluster with one application or infrastructure component? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Do you have some teams that create far more defects than the norm? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Discuss these issues and analysis with your management team and engineering leads. Tackle fixing them as a team, with your quality goals prioritizing the efforts. By correcting things both individually and systemically you can achieve far greater progress. Again, the transparency of the discussions will increase accountability and open up your teams so everyone can focus on the real goals as opposed to hiding problems.

These four extensions to your initial efforts will set your team on a course to achieve top quartile availability. Of course, you must couple these efforts with diligent engagement by senior management, adequate investment, and disciplined execution. Unfortunately, even with all the right measures, providing robust availability for your customers is rarely a straight-line improvement. It is a complex endeavor that requires persistence and adjustment along the way. But by implementing these steps, you can enable sustainable and substantial progress and achieve top quartile performance to provide business advantage in today’s 7×24 digital world.

If your shop is struggling with high availability or major outages, look to apply these practices (or send your CIO the link to this page 🙂 ).

Best, Jim Ditmore

6 thoughts on “The Elusive High Availability in the Digital Age”

  1. Hi Jim,

    As always an insightful read. Thanks for sharing, I would be interested to know how you can translate the above into a Dev Ops model for continuous delivery.

    Regards,

    Bhavesh

    1. Bhavesh,

      I trust all is well. For Dev Ops to truly work in an organization, the team must be extremely mature and disciplined using it. Presumably, you can complete testing as you progress development, thus verifying the quality of the code and the changes. But if you have incomplete coverage, or inadequate systems integration testing, than it is highly likely you will regularly deploy releases with defects. DevOps is intended to enable organizations to develop and deploy changes more quickly. But all change must be done in a high quality and disciplined manner or you are simply increasing customer frustration and overall cost. Thus, I recommend organizations consider carefully if they (and their customers) are really ready for the skill and maturity the DevOps requires or would simply an interative release process with high quality provide both the speed and more importantly high availability necessary to compete?

      Best, Jim Ditmore

  2. Hi Jim

    I think this is another good article from your hand. I hope that a lot of people are reading it.
    If I were add anything to your article then it would be about instilling a continuous improvement mindset. I believe that any service in the service catalogue should have someone who is accountable for the service delivery and this responsibility should include an improvement plan. I am not suggesting that we should strive for perfection, but merely that we stay in control of the service life cycle and ensure adjustments to improve fit-for-purpose (value of service + “Production Ready” parameters) and cost.
    If you ever think your are finished – then you are wrong.

    Thanks.

  3. Thank you for new recipe. Definitely, HA nowdays play important role keeping customers satisfied and loyal.
    Test, monitor and test again – probably most reasonable way to minimize possible negative impact, as, unfortunately,
    external vendor product license covered liabilities usually does not protect us from possible failures. Solution
    architecture engineering I think is right point to implement features needed to minimize impact from possible service disruption.
    Looks like a good choice to achieve that would be logical application/service clustering instead of
    separate HW components clustering. It could allow simplified full functional channel monitoring and recovery automation, therefore it could help, from my

    point of view, to simplify whole operations process. More, if whole solution architecture (both HW and SW) has been engineered and developed with quite high

    autonomy from functional point of view(engineered as LEGO cubes), even with some poor data normalization within databases, in case of some components

    failures the rest part could continue with some limitations/restrictions to customers, which is much better, than whole service failure possibly leading

    significant loss of reputation.
    On the other hand, majority big IT shops nowadays should have more than one geographically separated data centers, with replicated at least major data,
    needed for operations base.
    And also, looks like we should not forget BCP. Even if we invested a lot into HA, however nobody is protected against “force majeure”, as it happened on

    Sept.11 WTC attack, when, for instance, due to some BCP availability, NYBOT operations recovered by the end of day.
    How do you think, should we concentrate into HW redundancy for HA, or target #1 should be reliable app architecture giving us a choice to continue

    production in restricted mode? Does above correlate with DGIT views?
    Rgds, Rimantas

  4. Great stuff Jim,

    You perception and understanding of the IT world is a great guides in a world always in pursued of best practice. So thank you for that.

    I focused a lot on the 4 point in relation to root cause analyze and bring things into the bigger picture. I saw that you had an article on Big data from 2013 by Anthony Watson. The focus is still the same today. Identify the root cause, so you know the question and then look for the answer in Big data to give you the courage to disrupt your organizations for the better. So we have the fire fighting covered, we have implemented changes management (so-so) and total monitoring is on the way. But the root cause shows that it is the foundation of the IT system that need to be changes. A lot of systems are 20-30 years old and are working based on add on’s through 30 years. So how do you find the courage and documentation to start on the renewal journey to disrupt you organization, so you get a better system that is easier to coverer through BCP, change management, monitoring so you have a system that helps being proactive – where do you start? – big data?, root cause?, change management ?

    Hope my question makes sense. If you have an article already written on how to move in to the next generation IT then I will be happy to read it…

    Best
    Jan

    1. Jan,

      Thanks for the comments and the question. This is a tough situation for most organizations where they have steadily built up technical debt in the legacy systems and tackling it in one massive project is usually not possible either to get enough funding or time. But, I think the way it is solved is by first ensuring a persistent focus on reducing the debt with every release and fix. I recommend starting with documentation — most development of undocumented systems is hampered substantially by lack of documentation resulting is ‘discovery’ being the primary analysis task (expanding it by 2 to 10 fold). Each fix or release is then much more expensive. So, first invest in documentation. Then, I recommend tow key areas to address:
      – automating unit testing for frequently changed areas
      – analyzing the defects and production incidents to determine which areas of the application are worst (these are then the areas that are candidates for rewrite or redesign)
      Any major update or release effort, must include redesign where you begin to insert APIs. The APIs should correlate fully with your enterprise architecture. BY inserting APIs and breaking a complex system into more modular pieces through APIs, you can now make much more flexible and easy to maintain system. The API work must also align with the target data model wherever possible. For data elements not in the APIs or for data edits occurring out of place, these can then be more easily corrected (or moved) after the APIs have been inserted.

      So the order should be: documentation, then testing automation and problematic modules reworked, next APIs with any major redesign with data model fixes directly related to the APIs, then the remaining data updates and functional placement.

      Last advice: be patient and persistent — it has taken 20 years often to get into such debt, it will take 2 or 3 years of concerted effort to get out of the debt and you will need to ensure 20 or even 30% of your development effort during this time is focused on the technical debt.

      Hope the helps. Best, Jim Ditmore

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.