Hi ho! Hi ho! It’s Off to Cloud we go!

With the ongoing stampede to public cloud platforms, it is worth a clearer look at some of the factors leading to such rapid growth. Amazon, Azure, Google, and IBM and a host of other public cloud services saw continued strong growth in 2018 of 21% to $175B, extending a long run of rapid revenue growth for the industry, according to Gartner in a recent Forbes article. Public cloud services, under Gartner’s definition, include a broad range of services from more traditional SaaS to infrastructure services (IaaS and PaaS) as well as business process services. IaaS, perhaps most closely associated with AWS, is forecast to grow 26% in 2019, with total revenues increasing from $31B in 2018 to $39.5B in 2019. AWS does have the lion’s share of this market with 80% of enterprises either experimenting with or using AWS as their preferred platform. Microsoft’s Azure continues to make inroads as well with increase of enterprises using the Azure platform from 43% to 58%. And Google is proclaiming a recent upsurge in its cloud services in its quarterly earning announcement. It is worth noting though that both more traditional SaaS and private cloud implementations are expect to also grow at near 30% rates for the next decade – essentially matching or even exceeding public cloud infrastructure growth rates over the same time period. The industry with the highest adoption rates of both private and public cloud is the financial services industry where adoption (usage) rates above 50% are common and even rates close to 100% are occurring versus median rates for all industries of 19%.

At Danske Bank, we are close to completing a 4 year infrastructure transformation program that has migrated our entire application portfolio from proprietary dedicated server farms in 5 obsolete data centers to a modern private cloud environment in 2 data centers. Of course, we migrated and updated our mainframe complex as well. Over that time, we have also acquired business software that is SaaS-provided as well as experimented with or leveraged smaller public cloud environments. With this migration led by our CTO Jan Steen Olsen, we have eliminated nearly all of our infrastructure layer technical debt, reduced production incidents dramatically (by more than a 95% ), and correspondingly improved resiliency, security, access management, and performance. Below is a chart that shows the improved customer impact availability achieved through the migration, insourcing, and adoption of best practice.

These are truly remarkable results that enable Danske Bank to deliver superior service to our customers. Such reliability for online and mobile systems is critical in the digital age. Our IT infrastructure and applications teams worked closely together to accomplish the migration to our new, ‘pristine’ infrastructure. The data center design and migration was driven by our senior engineers with strong input from top industry experts, particularly CS Technology. A critical principle we followed was not to just move old servers to the new centers but instead to set up a modern and secure ‘enclave’ private cloud and migrate old to new. Of course this is a great deal more work and requires extensive update and testing to the applications. Working closely together, our architects and infrastructure engineers partnered to design our private that established templates and services up to our middleware, API, and database layers. There were plenty of bumps in the road especially in the our earliest migrations as worked out the cloud designs, but our CIO Fredrik Lindstrom and application teams dug in, and in partnering with the infrastructure team, made room for the updates and testing, and successfully converted our legacy distributed systems to the new private cloud environments. While certainly a lengthy and complex process, we were ultimately successful. We are now reaping the benefits of a fully modernized cloud environment with rapid server implementation times and lower long term costs (you can see further guidelines here on how to build a private cloud). In fact, we have benchmarked our private cloud environment and it is 20 to 70% less expensive than comparable commercial offerings (including AWS and Azure). A remarkable achievement indeed and for the feather in the cap, the program led by Magnus Jacobsen was executed on a relatively flat budget as we used savings generated from insourcing and consolidations to fund much of the needed investments.

Throughout the design and the migration, we have stayed abreast of the cloud investments and results at many peer institutions and elsewhere. We have always looked at our cloud transformation as an infrastructure quality solution that could provide secondary performance and cycle time and cost savings. But our core objective was focused on achieving the availability and resiliency benefits and eliminating the massive risk due to legacy data center environmentals. Yet, much of the dialogue in the industry is focused on cloud as a time to market and cost solution for companies with complex legacy environments, enabling them to somehow significantly reduce systems costs and greatly improve development time to market.

Let’s consider how realistic is this rationale. First, how real is the promise of reduced development time to market due to public cloud? Perhaps, if you are comparing an AWS implementation to a traditional proprietary server shop with mediocre service and lengthy deliver times for even rudimentary servers, then, yes, you enable development teams to dial up their server capacity much more easily and quickly. But compared to a modern, private cloud implementation, the time to implement a new server for an application is (or should be) comparable. So, on an apples to apples basis, generally public and private cloud are comparably quick. More importantly though, for a business application or service that is being developed, the server implementation tasks should be done as parallel tasks to the primary development work with little to no impact on the overall development schedule or time to market. In fact, the largest tasks that take up time in application development are often the project initiation, approval, and definition phases (for traditional waterfall) and project initiation, approval, and initial sprint phases for Agile projects. In other words, management decisions and defining what the business wants the solution to do remain the biggest and longest tasks. If you are looking to improve your time to market, these are the areas where IT leadership should focus. Improving your time to get from ‘idea to project’ is typically a good investment in large organizations. Medium and large corporations are often constrained as much by the annual finance process and investment approval steps as any other factor. We are all familiar with investment processes that require several different organizations to agree and many hurdles to be cleared before the idea can be approved. And the larger the organization, the more likely the investment process is the largest impact to time to market.

Even after you have approval for the idea and the project is funded, the next lengthy step is often ensuring that adequate business, design, and technology resources are allocated and there is enough priority to get the project off the ground. Most large IT organizations are overwhelmed with too many projects, too much work and not enough time or resources. Proper prioritization and ensuring that not too many projects are in flights at any one time are crucial to enable projects to work at reasonable speed. Once the funding and resources are in place, then adopting proper agile approaches (e.g. joint technology and business development agile methods) can greatly improve the time to market.

Thus, must time to market issues have little to do with infrastructure and cloud options and almost everything to do with management and leadership challenges. And the larger the organization, the harder it is to focus and streamline. Perhaps the most important part of your investment process is on what not to do, so that you can focus your efforts on the most important development projects. To attain the prized time to market so important in today’s digital competition, drive instead for a smooth investment process coupled with a properly allocated and flexible development teams and agile processes. Streamlining these processes, and ensuring effective project startups (project manager assigned, dedicated resources, etc) will yield material time to market improvements. And having a modern cloud environment will then nicely support your streamlined initiatives.

On the cost promise of public cloud, I find it surprising that many organizations are looking to public cloud as silver bullet for improving their costs. For either legacy or modern applications, the largest costs are the software development and software maintenance cost – ranging anywhere from 50% to 70% percent of the full lifetime cost of a system. Next will come IT operations – running the systems and the production environment – as well as IT security and networks at around 15-20% of total cost. This leaves 15 to 30% of lifetime cost for infrastructure, including databases, middleware, and messaging as well as the servers and data centers. Thus, the servers and storage total perhaps 10-15% of the lifetime cost. Perhaps you can achieve a a 10%, or even 20% or 30% reduction in this cost area, for a total systems cost reduction of 2-5%. And if you have a modern environment, public cloud would actually be at a cost disadvantage (at Danske Bank, our new private cloud costs are 20% to 70% lower than AWS, Azure, and other public clouds). Further, focusing on a 2% or 5% server cost reduction will not transform your overall cost picture in IT. Major efficiency gains in IT will come from far better performance in  your software development and maintenance — improving productivity, having a better and more skilled workforce with fewer contractors, or leveraging APIs and other techniques to reduce technical and improve software flexibility.  It is disingenuous to suggest you are tackling primary systems costs and making a difference for your firm with public cloud. . You can deliver 10x total systems cost improvements by introducing and rolling out software development best practices, achieving an improved workforce mix and simplifying your systems landscape than simply substituting public cloud for your current environment. And as I noted earlier, we have actually achieved lower costs with a private cloud solution versus commercial public cloud offerings. And there are hidden factors to consider with public cloud. For example, when testing a new app on your private cloud, you can run the scripts in off hours to your heart’s content at minimal to no cost, but you would need to watch your usage carefully if on a public cloud, as all usage results in costs. The more variable your workload is also means it could cost less on a public cloud — the reverse being the more stable the total workload is, the more likely you can achieve significant savings with private cloud.

On a final note, with public cloud solutions come lock-in, not unlike previous generations of proprietary hardware or wholesale outsourcing. I am certain a few of you recall the extensions done to proprietary Unix flavors like AIX and HP-UX that provided modest gains but then increased lock-in of an application to that vendor’s platform. Of course, the cost increases from these vendors came later as did migration hurdles to new and better solutions. The same feature extension game occurs today in the public cloud setting with Azure or AWS or others. Once you write your applications to take advantage of their proprietary features, you have now become an annuity stream for that vendor, and any future migration off of their cloud with be arduous and expensive. Your ability to move to another vendor will typically be eroded and compromised with each system upgrade you implement. Future license and support price increases will need to be accepted unless you are willing to take on a costly migration. And you have now committed your firm’s IT systems and data to be handled elsewhere with less control — potentially a long term problem in the digital age. Note your application and upgrade schedules are now determined by the cloud vendor, not by you. If you have legacy applications (as we all do) that rely on an older version of infrastructure software or middleware, thus must be upgraded and keep pace, otherwise they don’t work. And don’t count on a rollback if problems are found after the upgrades by the cloud vendor.

Perhaps more concerning, in this age of ever bigger hacks, is that the public cloud environments become the biggest targets for hackers, from criminal gangs to state sponsored. And while, they have much larger security resources, there is still a rich target surface for hackers. The recent Capital One breach is a reminder that proper security remains a major task for the cloud customer.

In my judgement, certainly larger corporations are better off maintaining control of their digital capabilities with a private cloud environment than a public cloud. This will likely be supplemented with a multi-cloud environment to enable key SaaS capabilities or leverage public cloud scalability and variable expense for non-core applications. And with the improving economies of server technology and improved cloud automation tools, these environments can also be effectively implemented by medium-sized corporations as well. Having the best digital capabilities — and controlling them for your firm —  is key to outcompeting in most industries today. If you have the scale to retain control of your digital environment and data assets, then this is the best course to enabling future digital success.

What is your public or private cloud experience? Has your organization mastered private, public, or multi-cloud? Please share your thoughts and comments.

Best, Jim

P.S. Worth noting that public clouds are not immune to availability issues as well as reported here.

The Infrastructure Engineering Lifecycle – How to Build and Sustain a Top Quartile Infrastructure

There are over 5,000 books on applications development methods on Amazon. There are dozens of industry and government standards that map out the methodologies for application development. And for IT operations and IT production processes like problem and change management, IT Service Management and ITIL standards provide excellent guidance and structure. Yet for the infrastructure systems on which the applications rely on fully, there is scarcely a publication which outlines the approaches organizations should use to build and sustain a robust infrastructure. ITIL ventures slightly in this area but really just re-defines a waterfall application project cycle in infrastructure terms. During many years of building, re-building, and sustaining top quartile infrastructure I have developed a life cycle methodology for infrastructure or ‘Infrastructure Engineering Life Cycle’ (IELC).

The importance of infrastructure should not be overlooked in our digital age. Not only have customer expectations increased for services where they expect ‘always on’ web sites and transaction capabilities, but they also require quick response and seamless integration across offerings. Certainly the software is critical to provide the functionality, but none of these services can be reliably and securely provided without a well-built infrastructure underpinning all of the applications. A top quartile infrastructure delivers outstanding reliability (on the order of 99.9% or better availability), zippy performance, excellent unit costs, all with robust security and resiliency.

Often enterprises make the mistake of addressing infrastructure only when things break, and they only fix or invest enough to get things back running instead of re-building correctly a modern plant. It is unfortunate because not only will they likely experience further outages and service impacts but also their full infrastructure costs are likely to be higher for their dated, dysfunctional plant than for an updated, modern plant. Unlike most assets, I have found that a modern, well-designed IT infrastructure is cheaper to run than a poorly maintained plant that has various obsolete or poorly configured elements. Remember that every new generation of equipment can basically do twice as much s the previous so  you have fewer components, less maintenance, less administration, less things that can go wrong. In addition, a modern plant also boosts time to market for application projects and reduces significantly the portion of time spent on fixing things by both infrastructure and application engineers.

So, given the critical nature of well-run technology infrastructure in the world of digitalization, how do enterprises and CIOs build and maintain a modern plant with outstanding fit and finish? It is not just about buying lots of new equipment, or counting on a single vendor or cloud provider to take care of all the integration or services. Nearly all major enterprise have a legacy of systems that result in complexity and complicate the ability to deliver reliable services or keep pace with new capabilities. These complexities can rarely be handled by a systems integrator or single service provider. Further, a complete re-build of the infrastructure often requires major capital investment and and can put the availability even further at risk. The best course is usually then is not to go ‘all-in’ where you launch a complete re-build or hand over the keys to a sole outsourcer, but instead to take a ‘spiral optimization’ approach which addresses fundamentals and burning issues first, and then uses the newly acquired capabilities to advance and address more complex or less pressing remaining issues.

A repeated, closed cycle approach (‘spiral optimization’) is our management approach. This management approach is coupled with an Infrastructure Engineering Lifecycle (IELC) methodology to build top quartile infrastructure. For the first cycle of the infrastructure rebuild, it is important to address the biggest issues. Front and center,  the entire infrastructure team must focus on quality. Poorly designed or built infrastructure becomes a blackhole of engineering time as rework demands grow with each failure or application built upon a teetering platform. And while it must also be understood that a everything cannot be fixed at once, those things that are undertaken, must be done with quality. This includes documenting the systems and getting them correctly into the asset management database. And it includes coming up with a standard design or service offering if none exists. Having 5000 servers must be viewed as a large expense requiring great care and feeding — and the only thing worse is having 5000 custom servers because your IT team did not take the time to define the standard, keep it up to date, and maintain and patch it consistently. 5000 custom servers are a massive expense that likely cannot be effectively and efficiently maintained or secured by any team. There is no cheaper time than the present moment to begin standardizing and fixing the mess by requiring that the next server built or significantly updated be done such that it becomes the new standard. Don’t start this effort though until you have the engineering capacity to do it. A standard design done by lousy engineers is not worth the investment. So, as an IT leader, while you are insisting on quality, ensure you have adequate talent to engineer your new standards. If you do not have it on board, leverage top practitioners in the industry to help your team create the new designs.

In addition to quality and starting to do things right, there are several fundamental practices that must be implemented. Your infrastructure engineering work should be guided by the infrastructure engineering lifecycle – which is a methodology and set of practices that ensure high quality platforms that are effective, efficient, and sustainable.

The IELC covers all phases of infrastructure platforms – from an emerging platform to a standard to declining and obsolete platforms. Importantly, the IELC is comprised of three cycles of activity that recognize that infrastructure requires constant grooming and patching where inputs come typically from external parties, and, all the while, technology advances regularly occur such that over 3 to 10 years nearly every hardware platform becomes obsolete and should and must be replaced. The three cycles of activity are:

  • Platform – This is the foundational lifecycle activity where hardware and utility software is defined, designed and integrated into a platform to perform a particular service. Generally, for medium and large companies, this is  a 3 to 5 year lifecycle. A few examples could be a server platform, storage platform or an email platform.
  • Release – Once a platform is initial designed and implemented, then organizations should expect to refresh the platform on a regular basis to incorporate major underlying product or technology enhancements, address significant design flaws or gaps, and improve operational performance and reliability. Release should be planned for 3 to 12 month intervals over the life of the platform (which is usually 3 to 5 years).
  • Patch – A patch should also be employed where on a regular and routine basis, minor upgrades (both fixes and enhancements) are applied. The patch cycle should synchronize with both the underlying patch cycle of the OEM (Original Equipment Manufacturer) for the product and with the security and production requirements of the organization. Usually, patch cycles are used to incorporate security fixes and significant production defect fixes issued by the OEM. Typical patch cycles can be weekly to every 6 months.

Below is a diagram that represents the three infrastructure engineering life cycles and the general parameters of the cycles.

Infrastructure Engineering Cycles
Infrastructure Engineering Cycles

In subsequent posts, I will further detail key steps and practices within the cycles as well as provide templates that I have found to be effective for infrastructure teams.  As a preview, here is the diagram of the cycles with their activities and attributes.

IELC Preview
IELC Preview

What key practices or techniques have you used for your infrastructure teams to enable them to achieve success? I look forward to you thoughts and comments.

Best, Jim Ditmore

 

Getting to Private Cloud: Key Steps to Build Your Cloud

Now that I am back from summer break, I want to continue to further the discussion on cloud and map out how medium and large enterprises can build their own private cloud. As we’ve discussed previously, software-as-a-service, engineered stacks and private cloud will be the biggest IT winners in the next five to ten years. Private clouds hold the most potential — in fact, early adopters such as JP Morgan Chase and Fidelity are seeing larger savings and greater benefits than initially anticipated.

While savings is a key reason to move to a private cloud, shorter development cycles and faster time to market are more significant. Organizations can test risky ideas more easily as small, low-cost projects, quickly dispensing with those projects that fail and accelerating those that show more promise.

While savings is a key driver to moving to private cloud, faster development cycles and better time to market are turning out to be both more significant and more valuable to early adopter firms than initially estimated. And it is not just a speed improvement but a qualitative improvement where smaller projects can trialled or riskier pilots can be executed with far greater speed and nominal costs. This allows a ‘fast fail’ approach on corporate innovation that greatly speeds the selection process, avoids extensive wasted investment in lengthier traditional pilots (that would have failed anyway) and greatly improves time to market on those ideas that are successful.

As for the larger savings, early implementations at scale are seeing savings well in excess of 50%. This is well beyond my estimate of 30% and is occurring in large part because of the vastly reduced labor requirements to build and administer a private cloud versus traditional infrastructure.

So with greater potential benefits, how should an IT department go about building a private cloud? The fundamental building blocks required for private cloud are a base of virtualized servers utilizing commodity servers and leveraging open systems. And of course you need the server engineering and administration expertise to support the platform. There’s also a strong early trend toward leveraging open source software for private clouds, from the Linux operating system to OpenNebula and Eucalyptus for infrastructure management. But just having a virtualized server platform does not result in private cloud. There are several additional elements required.

First, establish a set of standardized images that constitute most of the stack. Preferably, that stack will go from the hardware layer to the operating system to the application server layer, and it will include systems management, security, middleware and database. Ideally, go with a dozen or fewer server images and certainly no more than 20. Consider everything else to be custom and treated separately and differently from the cloud.

Once you have established your target set of private cloud images you should build a catalogue and ordering process that is easy, rapid, and transparent. The costs should be clear, and the server units should be processor-months or processor-weeks. You will need to couple the catalogue with highly automated provisioning and de-provisioning. Your objective should be to deliver servers quickly, certainly within hours, preferably within minutes (once the costs are authorized by the customer). And de-provisioning should be just as rapid and regular. In fact, you should offer automated ‘sunset’ servers in test and development environments (e.g., after 90 days the server(s) are allocated, they are automatically returned to the pool). I strongly recommend well-published and clear cost and allocation reporting to drive the right behaviors among your users. It will encourage quicker adoption, better and more efficient usage and rapid turn-in when no longer needed. With these 4 prerequisites in place (standard images, a catalogue and easy ordering process, clear costs and allocations, and automated provisioning and de-provisioning) you are ready to start your private cloud.

Look to build your private cloud in parallel to your traditional data center platforms. There should be both a development and test private cloud as well as a production private cloud. Seed the cloud with an initial investment of servers of each standard type. Then transition demand into the private as new projects initiate and proceed to grow it project by project.

You could begin by routing small and medium size projects to the private cloud environment and as it builds up scale and provisioning kinks are ironed out, migrate more and more server requests until nearly all requests are routed through your private cloud path. As you begin to achieve scale and you prove out your ordering and provisioning (and de-provisioning processes) you can begin to tighten the criteria for projects to proceed with traditional custom servers. Within 6 months, custom, traditional servers should be the rare exception and should be charged fully for the excess costs they will generate.

 Once the private cloud is established you can verify the costs savings and advantages. And there will be additional advantages such as improved time to market because of improvements in the speed of your development efforts given server deployment is no longer a long pole in the tent. Well-armed with this data, you can now circle back and tackle existing environments and legacy custom servers. While often the business case for a platform transition is not a good investment, a transition to private cloud during another event (e.g., major application release, server end-of-life migration) should easily become a winning investment. A few early adopters (such as JPMC or Fidelity) are seeing outsized benefits and strong developer push into these private cloud environments. So, if you build it well, you should be able to reap the same advantages.

How is your cloud journey proceeding? Are there other key steps necessary to be successful? I look forward to hearing your perspective.

Best, Jim Ditmore

 

Looking to Improve IT Production? How to Start

Production issues, as Microsoft and Google can tell you, impact even cloud email apps. A few weeks ago, Microsoft took an entire weekend to full recover its cloud Outlook service. Perhaps you noted the issues earlier this year in financial services where Bank of America experienced internet site availability issues. Unfortunately for Bank of America that was their second outage in 6 months, though they are not alone in having problems as Chase suffered a similar production outage on their internet services the week following. And these are regular production issues, not the unavailability of websites and services due to a series of DD0S attacks.

Perhaps 10 or certainly 15 years ago, such outages with production systems would have resulted in far less notice by their customers as the front office personnel would have worked alternate systems and manual procedures until the systems were restored. But with customers accessing the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis, it is very difficult to avoid direct and widespread impact to customers in the event of a system failure. Your production performance becomes very evident to your customers. And your customers’ expectations have continued to increase such that they expect your company and your services to be available pretty much whenever they want to use them. And while being available is not the only attribute that customers value (usability, feature, service and pricing factor in importantly as well) companies that consistently meet or exceed consumer availability expectations gain a key edge in the market.

So how do you deliver to current and future rising expectations around availability of your online and mobile services? And if both BofA and Chase, which are large organizations that offer dozens of services online and have massive IT departments have issues delivering consistently high availability, how can smaller organizations deliver compelling reliability?

And often, the demand for high availability must be achieved in an environment where ongoing efficiencies have eroded the production base and a tight IT labor market has further complicated obtaining adequate expertise. If your organization is struggling with availability or you are looking to achieve top quartile performance and competitive service advantage, here’s where to start:

First, understand that availability, at its root, is a quality issue. And quality issues can only be changed if you address all aspects. You must set quality and availability as a priority, as a critical and primary goal for the organization. And you will need to ensure that incentives and rewards are aligned to your team’s availability goal.

Second, you will need to address the IT change processes. You should look to implement an ITSM change process based on ITIL. But don’t wait for a fully defined process to be implemented. You can start by limiting changes to appropriate windows. Establish release dates for major systems and accompanying subsystems. Avoid changes during key business hours or just before the start of the day. I still remember the ‘night programmer’ at Ameritrade at the beginning of our transformation there. Staying late one night as CIO in my first month, I noticed two guys come in at 10:30 PM. When I asked what they did, they said ‘ We are the night programmers. When something breaks with the nightly batch run, we go in and fix it.’  And done with no change records, minimal testing and minimal documentation. Of course, my hair stood on end hearing this. We quickly discontinued that practice and instead made changes as a team, after they were fully engineered and tested. I would note that combining this action with a number of other measures mentioned here enabled us to quickly reach a stable platform that had the best track record for availability for all online brokerages.

Importantly, you should ensure that adequate change review and documentation is being done by your teams for their changes. Ensure they take accountability for their work and their quality. Drive to an improved change process with templates for reviews, proper documentation, back out plans, and validation. Most failed changes are due to issues with the basics: a lack of adequate review and planning, poor change documentation of deployment steps, or missing or ineffective validation, or one person doing an implementation in the middle of the night when you should have at least two people doing it together (one to do, and one to check).

Also, you should measure the proportion of incidents due to change. If you experience mediocre or poor availability and failed changes contribute to more than 30% of the incidents, you should recognize change quality is a major contributor to your issues. You will need to zero in on the areas with chronic change issues. Measure the change success rate (percentage of changes executed successfully without production incident) of your teams. Publish the results by team (this will help drive more rapid improvement). Often, you can quickly find which of your teams has inadequate quality because their change success rate ranges from a very poor mid-80s percentage to a mediocre mid-90s percentage. Good shops deliver above 98% and a first quartile shop consistently has a change success rate of 99% or better.

Third, ensure all customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). And the ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues requires consolidating these capabilities and extending their reach.  If you have an ECC in place, ensure that all customer impacting issues are fully reported and handled. Underreporting of issues that impact a segment of your customer base, or the siphoning off of a problem to be handled by a local team, is akin to trying to handle a house fire with a garden hose and not calling the fire department. Call the fire department first, and then get the garden hose out while the fire trucks are on their way.

Fourth, you must execute strong root cause and followup. These efforts must be at the individual issue or incident level as well as at a summary or higher level. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Are they cluster with one application or infrastructure component? Are they caused primarily by change? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Track both the fixes for individual issues as well as the efforts to address systemic issues. The systemic efforts will begin to yield improvements that eliminate future issues.

These four efforts will set you on a solid course to improved availability. If you couple these efforts will diligent engagement by senior management and disciplined execution, the improvements will come slowly at first, but then will yield substantial gains that can be sustained.

You can achieve further momentum with work in several areas:

  • Document configurations for all key systems.  If you are doing discovery during incidents it is a clear indicator that your documentation and knowledge base is highly inadequate.
  • Review how incidents are reported. Are they user reported or did your monitoring identify the issue first? At least 70% of the issues should be identified first by you, and eventually you will want to drive this to a 90% level. If you are lower, then you need to look to invest in improving your monitoring and diagnostic capabilities.
  • Do you report availability in technical measures or business measures? If you report via time based systems availability measures or number of incidents by severity, these are technical measures. You should look to implement business-oriented measures such as customer impact availability. to drive great transparency and more accurate metrics.
  • In addition to eliminating issues, reduce your customer impacts by reducing the time to restore service (Microsoft can certainly stand to consider this area given their latest outage was three days!). For mean time to restore (MTTR – note this is not mean time to repair but mean time to restore service), there are three components: teime to detect (MTTD), time to diagnose or correlation (MTTC), and time to fix (to restore service or MTTF). An IT shop that is effective at resolution normally will see MTTR at 2 hours or less for its priority issues where the three components each take about 1/3 of the time. If your MTTD is high, again look to invest in better monitoring. If your MTTC is high look to improve correlation tools, systems documentation or engineering knowledge. And if your MTTF is high, again look to improve documentation or engineering knowledge or automate recovery procedures.
  • Consider investing in greater resiliency for key systems. It may be that customer expectations of availability exceed current architecture capabilities. Thus, you may want to invest in greater resiliency and redundancy or build a more highly available platform.

As you can see, providing robust availability for your customers is a complex endeavor. By implementing these steps, you can enable sustainable and substantial progress to top quartile performance and achieve business advantage in today’s 7×24 world.

What would you add to these steps? What were the key factors in your shop’s journey to high availability?

Best, Jim Ditmore

Turning the Corner on Data Centers

Recently I covered the ‘green shift’ of servers where each new server generation is not only driving major improvements in compute power but is also requires about the same or even less environmentals (power, cooling, space) as the previous generation. Thus, compute efficiency, or compute performance per watt, is improving exponentially. And this trend in servers, which started in 2005 or so, is also being repeated in storage. We have seen a similar improvement in power per terabyte  for the past 3 generations (since 2007). Current storage product pipeline suggests this efficiency trend will continue for the next several years. Below is a chart showing representative improvements in storage efficiency (power per terabyte) across storage product generations from a leading vendor.

Power (VA) per Terabyte
Power (VA) per Terabyte

With current technology advances, a terabyte of storage on today’s devices requires approximately 1/5 of the amount of power as a device from 5 years ago. And these power requirements could drop even more precipitously with the advent of flash technology. By some estimates, there is a drop of 70% or more in power and space requirements with the switch to flash products. In addition to being far more power efficient, flash will offer huge performance advantages for applications with corresponding time reductions in completing workload. So expect flash storage to quickly convert the market once mainstream product introductions occur. IBM sees this as just around the corner, while other vendors see the flash conversion as 3 or more years out. In either scenario, there are continued major improvements in storage efficiency in the pipeline that deliver far lower power demands even with increasing storage requirements.

Ultimately, with the combined efficiency improvements of both storage and server environments over the next 3 to 5 years, most firms will see a net reduction in data center requirements. The typical corporate data center power requirements are approximately one half server, one third storage, and the rest being network and other devices. With the two biggest components experiencing ongoing dramatic power efficiency trends, the net power and space demand should decline in the coming years for all but the fastest growing firms. Add in the effects of virtualization, engineered stacks and SaaS and the data centers in place today should suffice for most firms if they maintain a healthy replacement pace of older technology and embrace virtualization.

Despite such improvements in efficiency, we still could see a major addition in total data center space because cloud and consumer firms like Facebook are investing major sums in new data centers. This resulting consumer data center boom also shows the effects of growing consumerization in the technology market place. Consumerization, which started with PCs and PC software, and then moved to smart phones, has impacted the underlying technologies dramatically. The most advanced compute chips are now those developed for smart phones and video games. Storage technology demand and advances are driven heavily by smart phones and products like the MacBook Air which already leverage only flash storage. The biggest and best data centers? No longer the domain of corporate demand, instead, consumer demand (e.g. Gmail, FaceBook, etc) drives bigger and more advanced centers. The proportion of data center space dedicated to direct consumer compute needs (a la GMail or Facebook) versus enterprise compute needs (even for companies that provide directly consumer services) will see a major shift from enterprise to consumer over the next decade. This will follow the shifts in chips and storage that at one time were driven by the enterprise space (and previously, the government) and are now driven by the consumer segment. And it is highly likely that there will be a surplus of enterprise class data centers (50K – 200K raised floor space) in the next 5 years. These centers are too small and inefficient for a consumer data center (500K – 2M or larger), and with declining demand and consolidation effects, plenty of enterprise data center space will be on the market.

As an IT leader, you should ensure your firm is riding the effects of the compute and storage efficiency trends. Further multiply these demand reduction effects by leveraging virtualization, engineered stacks and SaaS (where appropriate). If you have a healthy buffer of data center space now, you could avoid major investments and costs in data centers in the next 5 to 10 years by taking these measures. Those monies can instead be spent on functional investments that drive more direct business value or drop to the bottom line of your firm. If you have excess data centers, I recommend consolidating quickly and disposing of the space as soon as possible. These assets will be worth far less in the coming years with the likely oversupply. Perhaps you can partner with a cloud firm looking for data center space if your asset is strategic enough for them. Conversely, if you have minimal buffer and see continued higher business growth, it may be possible to acquire good data center assets for far less unit cost than in the past.

For 40 years, technology has ridden Moore’s Law to yield ever-more-powerful processors at lower cost. Its compounding effects have been astounding — and we are now seeing nearly 10 years of similar compounding on the power efficiency side of the equation (below is a chart for processor compute power advances and compute power efficiency advances).

Trend Change for Power Efficiency

The chart above shows how the compute efficiency (performance per watt — green line) has shifted dramatically from its historical trend (blue lines). And it’s improving about as fast as compute performance is improving (red lines), perhaps even faster.

These server and storage advances have resulted in fundamental changes in data centers and their demand trends for corporations. Top IT leaders will be take advantage of these trends and be able to direct more IT investment into business functionality and less into the supporting base utility costs of the data center, while still growing compute and storage capacities to meet business needs.

What trends are you seeing in your data center environment? Can you turn the corner on data center demand ? Are you able to meet your current and future business needs and growth within your current data center footprint and avoid adding data center capacity?

Best, Jim Ditmore