The Elusive High Availability in the Digital Age

September 19, 2016 Jim D

Well, the summer is over, even if we have had great weather into September. My apologies for the delay in a new post, and I know I have several topic requests to fulfill 🙂 Given our own journey at Danske Bank on availability, I thought it was best to re-touch this topic and then come back around to other requests in my next posts. Enjoy and look forward to your comments!

It has been a tough few months for some US airlines with their IT systems availability. Hopefully, you were not caught up in the major delays and frustrations. Both Southwest and Delta suffered major outages in August and September. Add in power outages affecting equipment and multiple airlines recently in Newark, and you have many customers fuming over delays and cancelled flights. And the cost to the airlines was huge — Delta’s outage alone is estimated at $100M to $150M and that doesn’t include the reputation impact. And such outages are not limited to the US airlines, with British Airways also suffering a major outage in September. Delta and Southwest are not unique in their problems, both United and American suffered major failures and widespread impacts in 2015. Even with large IT budgets, and hundreds of millions invested in upgrades over the past few years, airlines are struggling to maintain service in the digital age. The reasons are straightforward:

At their core, services are based on antiquated systems that have been partially refitted and upgraded over decades (the core reservation system is from the 1960s)
Airlines have struggled earlier this decade to make a profit due to oil prices, and minimally invested in the IT systems to attack the technical debt. This was further complicated by multiple integrations that had to be executed due to mergers.
As they have digitalized their customer interfaces and flight checkout procedures, the previous manual procedures are now backup steps that are infrequently exercised and woefully undermanned when IT systems do fail, resulting in massive service outages.

With digitalization reaching even further into the customer interfaces and operations, airlines, like many other industries, must invest in stabilizing their systems, address their technical debt, and get serious about availability. Some should start with the best practices in the previous post on Improving Availability, Where to Start. Others, like many IT shops, have decent availability but still have much to do to get to first quartile availability. If you have made good progress but realize that three 9’s or preferably four 9’s of availability on your key channels is critical for you to win in the digital age this post covers what you should do.

Let’s start with the foundation. If you can deliver consistently good availability, then your team should already understand:

Availability is about quality. Poor availability is a quality issue. You must have a quality culture that emphasizes quality as a desired outcome and doing things right if you wish to achieve high availability.
Most defects — which then cause outages — are injected by change. Thus, strong change management processes that identify and eliminate defects are critical to further reduce outages.
Monitor and manage to minimize impact. A capable command center with proper monitoring feeds and strong incident management practices may not prevent the defect from occurring but it can greatly reduce the time to restore and the overall customer impact. This directly translates into higher availability.
You must learn and improve from the issues. Your incident management process must be coupled with a disciplined root cause analysis that ensures teams identify and correct underlying causes that will avoid future issues. This continuous learning and improvement is key to reaching high performance.

With this base understanding, and presumably with only smoldering areas of problems for IT shop left, there are excellent extensions that will enable your team to move to first quartile availability with moderate but persistent effort. For many enterprises, this is now a highly desirable business goal. Reliable systems translate to reliable customer interfaces as customers access the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis. Your production performance becomes very evident, very fast to your customers. And if you are down, they cannot transact, you cannot service them, your company loses real revenue, and more importantly, damages it’s reputation, often badly. It is far better to address these problems and gain a key edge in the market by consistently meeting or exceeding costumer availability expectations.

First, if you have moved up from regularly fighting fires, then just because outages are not everyday, does not mean that IT leadership no longer needs to emphasize quality. Delivering high quality must be core to your culture and your engineering values. As IT leaders, you must continue to reiterate the importance of quality and demonstrate your commitment to these values by your actions. When there is enormous time pressure to deliver a release, but it is not ready, you delay it until the quality is appropriate. Or you release a lower quality pilot version, with properly set customer and business expectations, that is followed in a timely manner by a quality release. You ensure adequate investment in foundational quality by funding system upgrades and lifecycle efforts so technical debt does not increase. You reward teams for high quality engineering, and not for fire-fighting. You advocate inspections, or agile methods, that enable defects to be removed earlier in the lifecycle at lower cost. You invest in automated testing and verification that enables work to be assured of higher quality at much lower cost. You address redundancy and ensure resiliency in core infrastructure and systems. Single power cord servers still in your data center? Really?? Take care of these long-neglected issues. And if you are not sure, go look for these typical failure points (another being SPOF network connections). We used to call these ‘easter eggs’, as in the easter eggs that no one found in a preceding year’s easter egg hunt and then you find the old, and quite rotten, easter egg on your watch. It’s no fun, but it is far better to find them before they cause an outage.

Remember that quality is not achieved by not making mistakes — a zero defect goal is not the target — instead, quality is achieved by a continuous improvement approach where defects are analyzed and causes eliminated, where your team learns and applies best practices. Your target goal should be 1st quartile quality for your industry, that will provide competitive advantage. When you update the goals, also revisit and ensure you have aligned the rewards of your organization to match these quality goals.

Second, you should build on your robust change management process. To get to median capability, you should have already established clear change review teams, proper change windows and moved to deliveries through releases. Now, use the data to identify which groups are late in their preparation for changes, or where change defects are clustered around and why. These understandings can improve and streamline the change processes (yes, some of the late changes could be due to too many approvals required for example). Further clusters of issues may be due to specific steps being poorly performed or inadequate tools. For example, often verification is done as cursory task and thus seldom catches critical change defects. The result is that the defect is then only discovered in production, hours later, when your entire customer base is trying but cannot use the system. Of course, it is likely such an outage was entirely avoidable with adequate verification because you would have known at the time of the change that it had failed and could have take action then to back out the change. The failed change data is your gold mine of information to understand which groups need to improve and where they should improve. Importantly, be transparent with the data, publish the results by team and by root cause clusters. Transparency improves accountability. As an IT leader, you must then make the necessary investments and align efforts to correct the identified deficiencies and avoid future outages.

Further, you can extend the change process by introducing production ready. Production ready is when a system or major update can be introduced into production because it is ready on all the key performance aspects: security, recoverability, reliability, maintainability, usability, and operability. In our typical rush to deliver key features or products, the sustainability of the system is often neglected or omitted. By establishing the Operations team as the final approval gate for a major change to go into production, and leveraging the production ready criteria, organizations can ensure that these often neglected areas are attended to and properly delivered as part of the normal development process. These steps then enable a much higher performing system in production and avoid customer impacts. For a detailed definition of the production ready process, please see the reference page.

Third, ensure you have consolidated your monitoring and all significant customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). This modern ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues these capabilities must be consolidated and integrated. Once you have an integrated ECC, you can extend it by moving from component monitoring to full channel monitoring. Full channel monitoring is where the entire stack for a critical customer channel (e.g. online banking for financial services or customer shopping for a retailer) has been instrumented so that a comprehensive view can be continuously monitored within the ECC. The instrumentation is such that not only are all the infrastructure components fully monitored but the databases, middleware, and software components are instrumented as well. Further, proxy transactions can and are run at a periodic basis to understand performance and if there are any issues. This level of instrumentation requires considerable investment — and thus is normally done only for the most critical channels. It also requires sophisticated toolsets such as AppDynamics. But full channel monitoring enables immediate detection of issues or service failures, and most importantly, enables very rapid correlation of where the fault lies. This rapid correlation can take incident impact from hours to minutes or even seconds. Automated recovery routines can be built to accelerate recovery from given scenarios and reduce impact to seconds. If your company’s revenue or service is highly dependent on such a channel, I would highly recommend the investment. A single severe outage that is avoided or greatly reduced can often pay for the entire instrumentation cost.

Fourth, you cannot be complacent about learning and improving. Whether from failed changes, incident pattern analysis, or industry trends and practices, you and your team should always be seeking to identify improvements. High performance or here, high quality, is never reached in one step, but instead in a series of many steps and adjustments. And given our IT systems themselves are dynamic and changing over time, we must be alert to new trends, new issues, and adjust.

Often, where we execute strong root cause and followup, we end up focused only at the individual issue or incident level. This can be all well and good for correcting the one issue, but if we miss broader patterns we can substantially undershoot optimal performance. As IT leaders, we must always consider the trees and the forest. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Do they cluster with one application or infrastructure component? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Do you have some teams that create far more defects than the norm? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Discuss these issues and analysis with your management team and engineering leads. Tackle fixing them as a team, with your quality goals prioritizing the efforts. By correcting things both individually and systemically you can achieve far greater progress. Again, the transparency of the discussions will increase accountability and open up your teams so everyone can focus on the real goals as opposed to hiding problems.

These four extensions to your initial efforts will set your team on a course to achieve top quartile availability. Of course, you must couple these efforts with diligent engagement by senior management, adequate investment, and disciplined execution. Unfortunately, even with all the right measures, providing robust availability for your customers is rarely a straight-line improvement. It is a complex endeavor that requires persistence and adjustment along the way. But by implementing these steps, you can enable sustainable and substantial progress and achieve top quartile performance to provide business advantage in today’s 7×24 digital world.

If your shop is struggling with high availability or major outages, look to apply these practices (or send your CIO the link to this page 🙂 ).

Best, Jim Ditmore

Infrastructure Engineering – Leveraging a Technology Plan

January 30, 2016 Jim D

Our recent post discussed using the Infrastructure Engineering Lifecycle (IELC) to enable organizations to build a modern, efficient and robust technology infrastructure. One of the key expressions that both leverages and IELC approach and helps an infrastructure team properly plan and navigate the cycles is the Technology Plan. Normally, the technology plan is constructed for each major infrastructure ‘component’ (e.g. network, servers, client environment, etc). A well-constructed technology plan creates both the pull – outlining how the platform will meet the key business requirements and technology objectives and the push – reinforcing proper upkeep and use of the IELC practice.

Digitalization continues to sweep almost every industry, and the ability of firms to actually deliver the digital interfaces and services requires a robust, modern and efficient infrastructure. To deliver an optimal technology infrastructure, one must utilize an ‘evergreen’ approach and maintain an appropriate technology pace matching the industry. Similar to a dolphin riding the bow wave of a ship, a company can optimize both the feature and capability of its infrastructure and minimize its cost and risk by staying consistently just off the leading pace of the industry. Often companies make the mistake of either surging ahead and expending large resources to get fully leading technology or eking out and extending the life of technology assets to avoid investment and resource requirements. Neither strategy actually saves money ‘through the cycle’ and both strategies add significant risk for little additional benefit.

For those companies that choose to minimize their infrastructure investments and reduce costs by overextending asset lives, they typically incur greater additional costs through higher maintenance, greater fix resources required, and lower system performance (and staff productivity). Obviously, extending your desktop PC refresh cycle from 2 years to 4 years is workable and reasonable, but extending the cycle much beyond this and you quickly run into:

Integration issues – both internal and external compatibility as your clients and partners have newer versions of office tools that are incompatible with yours
potentially higher maintenance costs as much hardware has no maintenance cost for the first 2 or 3 years, and increasing costs in subsequent years
greater environmentals costs as power and cooling savings from newer generation equipment is not realized
longer security patch cycles for older software (though some benefit as it is also more stable)
greater complexity and resulting cost within your environment as you must integrate 3 or 4 generations of equipment and software versus 2 or 3 versions
longer incident times as the usual first vendor response to an issue is ‘you need to upgrade to the latest version of the software before we can really fix this defect’

And if you press the envelope further and extend infrastructure life to the end of the vendor’s life cycle or beyond, expect significantly higher failure rates, unsupported or expensively support software, and much higher repair costs. In my experience, where multiple times we modernized an overextended infrastructure, we were able to reduce total costs by 20 or 30%, and this included the costs of the modernization. In essence you can run 4 servers from 3 generations ago on 1 current server, and having modern PCs and laptops means far less service issues, fewer service desk calls, far less breakage (people take care of newer stuff) and more productive staff.

For those companies that have surged to the leading edge on infrastructure, they are typically paying a premium for nominal benefit. For the privilege of being first, frontrunners encounter an array of issues including:

Experiencing more defects – trying out the latest server or cloud product or engineered appliance means you will find far more defects.
Paying a premium – being first with new technology means typically you will pay a premium because it is well before the volumes and competition can kick in to drive better pricing.
Integration issues – having the latest software version often means third party utilities or extensions have not yet released their version that will properly work with the latest software
Higher security flaws – all the backdoors and gaps have been uncovered yet as there are not enough users. Thus, hackers have a greater opportunity to find ‘zero day’ flaws and exploit them to attack you

Typically, those groups that I have inherited that were on the leading edge, were doing so because they had either an excess of resources or were solely focused on technology product(and not business needs). There was inadequate dialogue with the business to ensure the focus was on business priorities versus technology priorities. Thus, the company was often expending 10 to 30% more for little tangible business benefit other than to be able to state they were ‘leading edge’. In today’s software world, seldom does the latest infrastructure provide compelling business benefit over above that of a well-run modern utility infrastructure. Nearly all of the time the business benefit is derived by compelling services and features enabled by the application software running on the utility. Thus, typically the shops that are tinkering with leading edge hardware or are always on the latest version first are shops that are doing hobbies disconnected from the business imperatives. Only where organizations are operating at massive scale or actually providing infrastructure services as a business does leading edge positioning make business sense.

So, given our objective is to be in the sweet spot riding the industry bow wave, then a good practice to ensure proper consistent pace and connection to the business is a technology plan for each of the major infrastructure components that incorporates the infrastructure engineering lifecycle. A technology plan includes the infrastructure vision and strategy for a component area, defines key services provided in business terms, and maps out an appropriate trajectory and performance for a 2 or 3 year cycle. The technology plan then becomes the roadmap for that particular component and enables management to both plan and track performance against key metrics as well as ensuring evolution of the component with the industry and business needs.

The key components of the technology plan are:

Mission, Vision for that component area
Key requirements/strategy
Services (described in business terms)
Key metrics (definition, explanation)
Current starting point – explanation (SWOT) – as needed by service
Current starting point – Configuration – as needed by service
Target – explanation (of approach) and configuration — also defined by service
Metrics trajectory and target (2 to 3 years)
Gantt chart showing key initiatives, platform refresh or releases, milestones (can be by service)
Configuration snapshots at 6 months (for 2 to 3 years, can be by service)
Investment and resource description
Summary
Appendices
1. Platform Schedule (2 -3 years as projected)
2. Platform release schedule (next 1 -2 years, as projected)
3. Patch cycle (next 6 – 12 months, as planned)

The mission and vision should be derived and cascaded from the overall technology vision and corporate strategy. It should emphasis key tenets of the corporate vision and their implication for the component area. For example if the corporate strategy is to be ‘easy to do business with’ then the network and server components must support a highly reliable, secure and accessible internet interface. Such reliability and security aspirations then have direct implications on component requirements, objectives and plans.

The services portion of the plan should translate the overall component into the key services provided to the business. For example, network would be translated into data services, general voice services, call center services, internet and data connection services, local branch and office connectivity, wireless and mobility connectivity, and core network and data center connectivity. The service area should be described in business terms with key requirements specified. Further, each service area should then be able to describe the key metrics to be used to gauge its performance and effectiveness. The metrics could be quality, cost, performance, usability, productivity or other metrics.

For each service area of a component, the plan is then constructed. If we take the call center service as the example, the current technology configuration and specific services available would define the current starting point. A SWOT analysis should accompany the current configuration explaining both strengths and where the services falls short of business needs. The the target is constructed where both the overall architecture and approach are described as well as the target configuration (high to medium level of definition) is provided (e.g. where will the technology configuration for that area be in 2 or 3 years).

Then, given the target, the key metrics are mapped from their current to their future levels and a trajectory established that will be the goals for the service over time. This is subsequently filled out with a more detailed plan (Gantt chart) that shows the key initiatives and changes that must be implemented to achieve the target. Snapshots, typically at 6 month intervals, of the service configuration are added to demonstrate detailed understanding of how the transformation is accomplished and enable effective planning and migration. Then the investment and resource needs and adjustments are described to accompany the technology plans.

If well done, the technology plan then provides an effective roadmap for the entire technology component team to both understand how what they do delivers to the business, where they need to be, and how they will get there. It can be an enormous assist for productivity and practicality.

I will post some good examples of technology plans in the coming months.

Have you leveraged plans like this previously? If so, did they help? Would love to to hear from you.

All the best, Jim Ditmore

The Infrastructure Engineering Lifecycle – How to Build and Sustain a Top Quartile Infrastructure

November 13, 2015 Jim D

There are over 5,000 books on applications development methods on Amazon. There are dozens of industry and government standards that map out the methodologies for application development. And for IT operations and IT production processes like problem and change management, IT Service Management and ITIL standards provide excellent guidance and structure. Yet for the infrastructure systems on which the applications rely on fully, there is scarcely a publication which outlines the approaches organizations should use to build and sustain a robust infrastructure. ITIL ventures slightly in this area but really just re-defines a waterfall application project cycle in infrastructure terms. During many years of building, re-building, and sustaining top quartile infrastructure I have developed a life cycle methodology for infrastructure or ‘Infrastructure Engineering Life Cycle’ (IELC).

The importance of infrastructure should not be overlooked in our digital age. Not only have customer expectations increased for services where they expect ‘always on’ web sites and transaction capabilities, but they also require quick response and seamless integration across offerings. Certainly the software is critical to provide the functionality, but none of these services can be reliably and securely provided without a well-built infrastructure underpinning all of the applications. A top quartile infrastructure delivers outstanding reliability (on the order of 99.9% or better availability), zippy performance, excellent unit costs, all with robust security and resiliency.

Often enterprises make the mistake of addressing infrastructure only when things break, and they only fix or invest enough to get things back running instead of re-building correctly a modern plant. It is unfortunate because not only will they likely experience further outages and service impacts but also their full infrastructure costs are likely to be higher for their dated, dysfunctional plant than for an updated, modern plant. Unlike most assets, I have found that a modern, well-designed IT infrastructure is cheaper to run than a poorly maintained plant that has various obsolete or poorly configured elements. Remember that every new generation of equipment can basically do twice as much s the previous so you have fewer components, less maintenance, less administration, less things that can go wrong. In addition, a modern plant also boosts time to market for application projects and reduces significantly the portion of time spent on fixing things by both infrastructure and application engineers.

So, given the critical nature of well-run technology infrastructure in the world of digitalization, how do enterprises and CIOs build and maintain a modern plant with outstanding fit and finish? It is not just about buying lots of new equipment, or counting on a single vendor or cloud provider to take care of all the integration or services. Nearly all major enterprise have a legacy of systems that result in complexity and complicate the ability to deliver reliable services or keep pace with new capabilities. These complexities can rarely be handled by a systems integrator or single service provider. Further, a complete re-build of the infrastructure often requires major capital investment and and can put the availability even further at risk. The best course is usually then is not to go ‘all-in’ where you launch a complete re-build or hand over the keys to a sole outsourcer, but instead to take a ‘spiral optimization’ approach which addresses fundamentals and burning issues first, and then uses the newly acquired capabilities to advance and address more complex or less pressing remaining issues.

A repeated, closed cycle approach (‘spiral optimization’) is our management approach. This management approach is coupled with an Infrastructure Engineering Lifecycle (IELC) methodology to build top quartile infrastructure. For the first cycle of the infrastructure rebuild, it is important to address the biggest issues. Front and center, the entire infrastructure team must focus on quality. Poorly designed or built infrastructure becomes a blackhole of engineering time as rework demands grow with each failure or application built upon a teetering platform. And while it must also be understood that a everything cannot be fixed at once, those things that are undertaken, must be done with quality. This includes documenting the systems and getting them correctly into the asset management database. And it includes coming up with a standard design or service offering if none exists. Having 5000 servers must be viewed as a large expense requiring great care and feeding — and the only thing worse is having 5000 custom servers because your IT team did not take the time to define the standard, keep it up to date, and maintain and patch it consistently. 5000 custom servers are a massive expense that likely cannot be effectively and efficiently maintained or secured by any team. There is no cheaper time than the present moment to begin standardizing and fixing the mess by requiring that the next server built or significantly updated be done such that it becomes the new standard. Don’t start this effort though until you have the engineering capacity to do it. A standard design done by lousy engineers is not worth the investment. So, as an IT leader, while you are insisting on quality, ensure you have adequate talent to engineer your new standards. If you do not have it on board, leverage top practitioners in the industry to help your team create the new designs.

In addition to quality and starting to do things right, there are several fundamental practices that must be implemented. Your infrastructure engineering work should be guided by the infrastructure engineering lifecycle – which is a methodology and set of practices that ensure high quality platforms that are effective, efficient, and sustainable.

The IELC covers all phases of infrastructure platforms – from an emerging platform to a standard to declining and obsolete platforms. Importantly, the IELC is comprised of three cycles of activity that recognize that infrastructure requires constant grooming and patching where inputs come typically from external parties, and, all the while, technology advances regularly occur such that over 3 to 10 years nearly every hardware platform becomes obsolete and should and must be replaced. The three cycles of activity are:

Platform – This is the foundational lifecycle activity where hardware and utility software is defined, designed and integrated into a platform to perform a particular service. Generally, for medium and large companies, this is a 3 to 5 year lifecycle. A few examples could be a server platform, storage platform or an email platform.
Release – Once a platform is initial designed and implemented, then organizations should expect to refresh the platform on a regular basis to incorporate major underlying product or technology enhancements, address significant design flaws or gaps, and improve operational performance and reliability. Release should be planned for 3 to 12 month intervals over the life of the platform (which is usually 3 to 5 years).
Patch – A patch should also be employed where on a regular and routine basis, minor upgrades (both fixes and enhancements) are applied. The patch cycle should synchronize with both the underlying patch cycle of the OEM (Original Equipment Manufacturer) for the product and with the security and production requirements of the organization. Usually, patch cycles are used to incorporate security fixes and significant production defect fixes issued by the OEM. Typical patch cycles can be weekly to every 6 months.

Below is a diagram that represents the three infrastructure engineering life cycles and the general parameters of the cycles.

In subsequent posts, I will further detail key steps and practices within the cycles as well as provide templates that I have found to be effective for infrastructure teams. As a preview, here is the diagram of the cycles with their activities and attributes.

What key practices or techniques have you used for your infrastructure teams to enable them to achieve success? I look forward to you thoughts and comments.

Best, Jim Ditmore

Moving from Offshoring to Global Shared Service Centers

March 26, 2014 Jim D

My apologies for the delay in my post. It has been a busy few months and it has taken an extended time since there is quite a bit I wish to cover in the global shared service center model. Since my NCAA bracket has completely tanked, I am out of excuses to not complete the writing, so here is the first post with at least one to follow.

Since the mid-90s, companies have used offshoring to achieve cost and capacity advantages in IT. Offshoring was a favored option to address Y2K issues and has continued to expand at a steady rate throughout the past twenty years. But many companies still approach offshoring as ‘out-tasking’ and fail to leverage the many advantages of a truly global and high performance work force.

With out-tasking, companies take a limited set of functions or ‘tasks’ and move these to the offshore team. They often achieve initial economic advantage through labor arbitrage and perhaps some improvement in quality as the tasks are documented and standardized in order to make it easier to transition the work to the new location. This constitutes the first level of a global team: offshore service provider. But larger benefits around are often lost and typically include:

further ongoing process improvement,
better time to market,
wider service times or ‘follow the sun’,
and leverage of critical innovation or leadership capabilities of the offshore team.

In fact, the work often stagnates at whatever state it was in when it was transitioned with little impetus for further improvement. And because lower level tasks are often the work that is shifted offshore and higher level design work remains in the home country, key decisions on design or direction can often take an extended period – actually lengthening time to market. In fact, design or direction decisions often become arbitrary or disconnected because the groups – one in home office, the other in the offshore location – retain significant divides (time of day, perspective, knowledge of the work, understanding of the corporate strategy, etc). At its extreme, the home office becomes the ivory tower and the offshore teams become serf task executors and administrators. Ownership, engagement, initiative and improvement energies are usually lost in these arrangements. And it can be further exacerbated by having contractors at the offshore location, who have a commercial interest in maintaining the status quo (and thus revenue) and who are viewed as with less regard by the home country staff. Any changes required are used to increase contractor revenues and margins. These shortcomings erase many of the economic advantages of offshoring over time and further impact the competitiveness of the company in areas such as agility, quality, and leadership development.

A far better way to approach your workforce is to leverage a ‘global footprint and a global team’. And this approach is absolutely key for competitive advantage and essential for competitive parity if you are an international company. There are multiple elements of the ‘global footprint and team’ approach, that when effectively orchestrated by IT leadership, can achieve far better results than any other structure. By leveraging high performance global approach, you can move from an offshore service provider to a shared service excellence center and, ultimately to a global service leadership center.

The key elements of a global team approach can be grouped into two areas: high performance global footprint and high performance team. The global footprint elements are:

well-selected strategic sites, each with adequate critical mass, strong labor pools and higher education sources
proper positioning to meet time-of-day and improved skill and cost mix
knowledge and leverage of distinct regional advantages to obtain better customer interface, diverse inputs and designs, or unique skills
proper consolidation and segmentation of functions across sites to achieve optimum cost and capability mixes

Global team elements include:

consistent global goals and vision across global sites with commensurate rewards and recognition by site
a team structure that enables both integrated processes and local and global controls
the opportunity for growth globally from a junior position to a senior leader
close partnership with local universities and key suppliers at each strategic location
opportunity for leadership at all locations

Let’s tackle global footprint today and in a follow on post I will cover global team. First and foremost is selecting the right sites for your company. Your current staff total size and locations will obviously factor heavily into your ultimate site mix. Assess your current sites using the following criteria:

Do they have critical mass (typically at least 300 engineers or operations personnel, preferably 500+) that will make the site efficient, productive and enable staff growth?
Is the site located where IT talent can be easily sourced? Are there good universities nearby to partner with? Is there a reasonable Are there business units co-located or customers nearby?
Is the site in a low, medium, or high cost location?
What is the shift (time zone) of the location?

Once you have classified your current sites with these criteria, you can then assess the gaps. Do you have sites in low-cost locations with strong engineering talent (e.g. India, Eastern Europe)? Do you have medium cost locations (e.g., Ireland or 2nd tier cities in the US midwest)? Do you have too many small sites (e.g., under 100 personnel)? Do you have sites close to key business units or customers? Are no sites located in 3rd shift zones? Remember that your sites are more about the cities they are located in than the countries. A second tier city in India or a first or second tier city in Eastern Europe can often be your best site location because of improved talent acquisition and lower attrition than 1st tier locations in your country or in India.

It is often best to locate your service center where there are strong engineering and business universities nearby that will provide an influx of entry level staff eager to learn and develop. Given staff will be the primary cost factor in your service, ensure you locate in lower cost areas that have good language skills, access to the engineering universities, and appropriate time zones. For example, if you are in Europe, you should look to have one or two consolidated sites located just outside 2nd tier cities with strong universities. For example, do not locate in Paris or London, instead base your service desk either in or just outside Manchester or Budapest or Vilnius. This will enable you to tap into a lower cost yet high quality labor market that also is likely to provide more part-time workers that will help you solve peak call periods. You can use a similar approach in the US or Asia.

A highly competitive site structure enables you to meet a global optimal cost and capability mix as well. At the most mature global teams in very large companies, we drove for a 20/40/40 cost mix (20% high cost, 40% medium and 40% low cost) where each site is in a strong engineering location. Where possible, we also co-located with key business units. Drive to the optimal mix by selecting 3, 4, or 5 strategic sites that meet the mix target and that will also give you the greatest spread of shift coverage. Once you have located your sites correctly, you must then of course drive to have effective recruiting, training, and management of the site to achieve outstanding service. Remember also that you must properly consolidate functions to these strategic sites. Your key functions must be consolidated to 2 or 3 of the sites – you cannot run a successful function where there are multiple small units scattered around your corporate footprint. You will be unable to invest in the needed technology and provide an adequate career path to attract the right staff if it is highly dispersed.

You can easily construct a matrix and assess your current sites against these criteria. Remember these sites are likely the most important investments your company will make. If you have poor portfolio of sites, with inadequate labor resources or effective talent pipelines or other issues, it will impact your company’s ability to attract and retain it’s most important asset to achieve competitive success. It may take substantial investment and an extended period of time, but achieving an optimal global site and global team will provide lasting competitive advantage.

I will cover the global team aspects in my next post along with the key factors in moving from a offshore service provider to shared service excellence to shared service leadership.

It would be great to hear of your perspectives and any feedback on how you or your company been either successful (or unsuccessful) at achieving a global team.

Best, Jim Ditmore

Looking to Improve IT Production? How to Start

August 18, 2013 Jim D

Production issues, as Microsoft and Google can tell you, impact even cloud email apps. A few weeks ago, Microsoft took an entire weekend to full recover its cloud Outlook service. Perhaps you noted the issues earlier this year in financial services where Bank of America experienced internet site availability issues. Unfortunately for Bank of America that was their second outage in 6 months, though they are not alone in having problems as Chase suffered a similar production outage on their internet services the week following. And these are regular production issues, not the unavailability of websites and services due to a series of DD0S attacks.

Perhaps 10 or certainly 15 years ago, such outages with production systems would have resulted in far less notice by their customers as the front office personnel would have worked alternate systems and manual procedures until the systems were restored. But with customers accessing the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis, it is very difficult to avoid direct and widespread impact to customers in the event of a system failure. Your production performance becomes very evident to your customers. And your customers’ expectations have continued to increase such that they expect your company and your services to be available pretty much whenever they want to use them. And while being available is not the only attribute that customers value (usability, feature, service and pricing factor in importantly as well) companies that consistently meet or exceed consumer availability expectations gain a key edge in the market.

So how do you deliver to current and future rising expectations around availability of your online and mobile services? And if both BofA and Chase, which are large organizations that offer dozens of services online and have massive IT departments have issues delivering consistently high availability, how can smaller organizations deliver compelling reliability?

And often, the demand for high availability must be achieved in an environment where ongoing efficiencies have eroded the production base and a tight IT labor market has further complicated obtaining adequate expertise. If your organization is struggling with availability or you are looking to achieve top quartile performance and competitive service advantage, here’s where to start:

First, understand that availability, at its root, is a quality issue. And quality issues can only be changed if you address all aspects. You must set quality and availability as a priority, as a critical and primary goal for the organization. And you will need to ensure that incentives and rewards are aligned to your team’s availability goal.

Second, you will need to address the IT change processes. You should look to implement an ITSM change process based on ITIL. But don’t wait for a fully defined process to be implemented. You can start by limiting changes to appropriate windows. Establish release dates for major systems and accompanying subsystems. Avoid changes during key business hours or just before the start of the day. I still remember the ‘night programmer’ at Ameritrade at the beginning of our transformation there. Staying late one night as CIO in my first month, I noticed two guys come in at 10:30 PM. When I asked what they did, they said ‘ We are the night programmers. When something breaks with the nightly batch run, we go in and fix it.’ And done with no change records, minimal testing and minimal documentation. Of course, my hair stood on end hearing this. We quickly discontinued that practice and instead made changes as a team, after they were fully engineered and tested. I would note that combining this action with a number of other measures mentioned here enabled us to quickly reach a stable platform that had the best track record for availability for all online brokerages.

Importantly, you should ensure that adequate change review and documentation is being done by your teams for their changes. Ensure they take accountability for their work and their quality. Drive to an improved change process with templates for reviews, proper documentation, back out plans, and validation. Most failed changes are due to issues with the basics: a lack of adequate review and planning, poor change documentation of deployment steps, or missing or ineffective validation, or one person doing an implementation in the middle of the night when you should have at least two people doing it together (one to do, and one to check).

Also, you should measure the proportion of incidents due to change. If you experience mediocre or poor availability and failed changes contribute to more than 30% of the incidents, you should recognize change quality is a major contributor to your issues. You will need to zero in on the areas with chronic change issues. Measure the change success rate (percentage of changes executed successfully without production incident) of your teams. Publish the results by team (this will help drive more rapid improvement). Often, you can quickly find which of your teams has inadequate quality because their change success rate ranges from a very poor mid-80s percentage to a mediocre mid-90s percentage. Good shops deliver above 98% and a first quartile shop consistently has a change success rate of 99% or better.

Third, ensure all customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). And the ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues requires consolidating these capabilities and extending their reach. If you have an ECC in place, ensure that all customer impacting issues are fully reported and handled. Underreporting of issues that impact a segment of your customer base, or the siphoning off of a problem to be handled by a local team, is akin to trying to handle a house fire with a garden hose and not calling the fire department. Call the fire department first, and then get the garden hose out while the fire trucks are on their way.

Fourth, you must execute strong root cause and followup. These efforts must be at the individual issue or incident level as well as at a summary or higher level. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Are they cluster with one application or infrastructure component? Are they caused primarily by change? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Track both the fixes for individual issues as well as the efforts to address systemic issues. The systemic efforts will begin to yield improvements that eliminate future issues.

These four efforts will set you on a solid course to improved availability. If you couple these efforts will diligent engagement by senior management and disciplined execution, the improvements will come slowly at first, but then will yield substantial gains that can be sustained.

You can achieve further momentum with work in several areas:

Document configurations for all key systems. If you are doing discovery during incidents it is a clear indicator that your documentation and knowledge base is highly inadequate.
Review how incidents are reported. Are they user reported or did your monitoring identify the issue first? At least 70% of the issues should be identified first by you, and eventually you will want to drive this to a 90% level. If you are lower, then you need to look to invest in improving your monitoring and diagnostic capabilities.
Do you report availability in technical measures or business measures? If you report via time based systems availability measures or number of incidents by severity, these are technical measures. You should look to implement business-oriented measures such as customer impact availability. to drive great transparency and more accurate metrics.
In addition to eliminating issues, reduce your customer impacts by reducing the time to restore service (Microsoft can certainly stand to consider this area given their latest outage was three days!). For mean time to restore (MTTR – note this is not mean time to repair but mean time to restore service), there are three components: teime to detect (MTTD), time to diagnose or correlation (MTTC), and time to fix (to restore service or MTTF). An IT shop that is effective at resolution normally will see MTTR at 2 hours or less for its priority issues where the three components each take about 1/3 of the time. If your MTTD is high, again look to invest in better monitoring. If your MTTC is high look to improve correlation tools, systems documentation or engineering knowledge. And if your MTTF is high, again look to improve documentation or engineering knowledge or automate recovery procedures.
Consider investing in greater resiliency for key systems. It may be that customer expectations of availability exceed current architecture capabilities. Thus, you may want to invest in greater resiliency and redundancy or build a more highly available platform.

As you can see, providing robust availability for your customers is a complex endeavor. By implementing these steps, you can enable sustainable and substantial progress to top quartile performance and achieve business advantage in today’s 7×24 world.

What would you add to these steps? What were the key factors in your shop’s journey to high availability?

Best, Jim Ditmore

A Cloudy Future: The Rise of Appliances and SaaS

October 22, 2012 Jim D

As I mentioned in my previous post, I will be exploring infrastructure trends, and in particular, cloud computing. But while cloud computing is getting most of the marketing press, there are two additional phenomena that are capturing as much if not more of the market: computer appliances and SaaS. So, before we dive deep into cloud, let’s explore these other two trends and then set the stage for a comprehensive cloud discussion that will yield effective strategies for IT leaders.

Computer appliances have been available for decades, typically in the network, security, database and specialized compute spaces. Firewalls and other security devices have long leveraged an appliance approach where generic technology (CPU, storage, OS) is closely integrated with additional special purpose software and sold and serviced as an packaged solution. Specialized database appliances for data warehousing were quite successful starting in the early 1990s (remember Teradata?).

The tighter integration of appliances gives significant advantage over traditional approaches with generic systems. First, since the integrator of the package often is also the supplier of the software and thus can achieve improved tuning of performance and capacity of the software with a specific OS and hardware set. Further, this integrated stack then requires much less install and implementation effort by the customer. The end result can be impressive performance for similar cost to a traditional generic stack without the implementation effort or difficulties. Thus appliances can have a compelling performance and business case for the typical medium and large enterprise. And they are compelling for the technology supplier as well because they will command higher prices and are much higher margin than the individual components.

It is important to recognize that appliances are part of a normal tug and pull between generic and specialized solutions. In essence, throughout the past 40 years of computing, there has been the constant improvement in generic technologies under the march of Moore’s law. And with each advance there are two paths to take: leverage generic technologies and keep your stack loosely coupled so you can continue to leverage the advance of generic components or, closely integrated your stack with the then most current components and drive much better performance from this integration.

By their very nature though, appliances become rooted in a particular generation of technology. The initial iteration can be done with the latest version of technology but the integration will likely result in tight links to the OS, hardware and other underlying layers to wring out every performance improvement available. These tight links yield both the performance improvement and the chains to a particular generation of technology. Once an appliance is developed and marketed successfully, ongoing evolutionary improvements will continue to be made, layering in further links to the original base technology. And the margins themselves are addictive with the suppliers doing everything possible to maintain the margins (thus evolutionary low cost advances will occur but revolutionary (next generation) will likely require too high of an investment to maintain the margins). This then spells the eventual fading and demise of that appliance, as the generic technologies continue their relentless advance and typically surpass the appliance in 2 or 3 generations. This is represented in the chart below and can be seen in the evolution of data warehousing.

The first instances of data warehousing were done using the primary generic platform of the time (the mainframe) and mainstream databases. But with the rise of another generic technology, proprietary chipsets out of the midrange and high end workstation sector, Teradata and others combined these chipsets with specialized hardware and database software to develop much more powerful data warehouse appliances. From the late 1980s through the 1990s the Teradata appliance maintained a significant performance and value edge over generic alternatives. But that began to fray around 2000 with the continued rise of mainstream databases and server chipsets along with low cost operating systems and storage that could be combined to match the performance of Teradata at much lower cost. In this instance, the Teradata appliance held a significant performance advantage for about 10 years before falling back into or below the mainstream generic performance. The value advantage diminished much sooner of course. Typically, the appliance performance advantage is for 4 to 6 years at most. Thus, early in the cycle (typically 3 to 4 generic generations or 4 to 5 years), an appliance offering will present material performance and possibly cost advantages over traditional, generic solutions.

As a technology leader, I recommend the following considerations when looking at appliances:

If you have real business needs that will drive significant benefit from such performance, then investigate the appliance solution.
Keep in mind that in the mid-term the appliance solution will steadily lose advantage and subsequently cost more than the generic solution. Understand where the appliance solution is in its evolution – this will determine its effective life and the likely length of your advantage over generic systems
Factor the hurdle, or ‘switchback’ costs at the end of its life. (The appliance will likely require a hefty investment to transition back to generic solutions that have steadily marched forward).
The switchback costs will be much higher where business logic is layered in (e.g. for middleware, database or business software appliances versus network or security appliances (where there is minimal custom business logic layered in).
Include the level of integration effort and cost required. Often a few appliances within a generic infrastructure will have a smooth integration and less cost. On the other hand, weaving multiple appliances within a service stack can cause much higher integration costs and not yield desired results. Remember that you have limited flexibility with an appliance due to its integrated nature and this could cause issues when they are strung together (e.g., a security appliance with a load balance appliance with a middleware appliance with a business application appliance and data warehouse appliance (!)).
Note for certain areas, security and network in particular, often the follow-on to an appliance will be a next generation appliance from the same or different vendor. This is because there is minimal business logic incorporated in the system (yes, there are lots of parameter settings like firewall rules customer for a business, but the firewall operates essentially the same regardless of the business that uses it).

With these guidelines, you should be able to make better decisions about when to use an appliance and how much of a premium you should pay.

In my next post, I will cover SaaS and I will then bring these views together with a perspective on cloud in a final post.

What changes or additions would you make when considering appliances? I look forward to your perspective.

Best, Jim Ditmore

In the Heat of Battle: Good Guidelines for Production

September 3, 2012 Jim D

If you have been in IT for any stretch, you will have experienced a significant service outage and the blur of pages, conference calls, analysis and actions to recover. Usually such a service incident call occurs at 2 AM, and there is a fog that occurs as a diverse and distributed team tries to sort through a problem and its impacts while seeking to restore service. And often, poor decisions are made or ineffective directions taken in this fog which extend the outage. Further, as part of the confusion, there can be poor communications with your business partners or customers. Even for large companies with a dedicated IT operations team and command center, the wrong actions and decisions can be made in the heat of battle as work is being done to restore service. While you can chalk many of the errors to either inherent engineering optimism or a loss of orientation after working a complex problem for many hours, to achieve outstanding service availability you must enable crisp, precise service restoration when an incident occurs. Such precision and avoidance of mistakes in ‘the heat of battle’ comes from a clear command line and operational approach. This ‘best practice’ clarity includes defined incident roles and operational approach communicated and ready well before such an event. Then everyone operates as a well-coordinated team to restore service as quickly as possible.

We explore these best practice roles and operational approaches in today’s post. These recommended practices have been derived over many years at IT shops that have achieved sustained first quartile production performance*. The first step is to have a production incident management processes which based on an ITIL approach. Some variation and adaption of ITIL is of course appropriate to ensure a best fit for your company and operation but ensure you are leveraging these fundamental industry practices and your team is fully up to speed on them. Further, it is preferable to have a dedicated command center which monitors production and has the resources for managing a significant incident when it occurs.

Assuming those capabilities are in place, there should be clear roles for your technology team in handling a production issue. The incident management roles that should be employed include:

Technical leads — there may be one or more technical leads for an incident depending on the nature of the issue and impact. These leads should have a full understanding of the production environment and be highly capable senior engineers in their specialty. Their role is to diagnose and lead a problem resolution effort in their component area (e.g. storage, network, DBMS, etc). They also must reach out and coordinate with other technical leads to solve those issues that lie between specialties (e.g. DBMS and storage).
Service lead — the service lead is also an experienced engineer or manager and one who understands all systems aspects and delivery requirements of the service that has been impacted. This lead will help direct what restoration efforts are a priority based on their knowledge of what is most important to the business. They would also be familiar with and be able to direct service restoration routines or procedures (e.g. a restart). They also will have full knowledge of the related services and potential downstream impacts that must be considered or addressed. And they will know which business units and contacts must be engaged to enact issue mitigation while the incident is being worked.
Incident lead — the incident lead is a command centre member who is experienced in incident management, has strong command skills, and understands problem diagnosis and resolution. Their general knowledge and experience should extend from the systems monitoring and diagnostics tools available to application and infrastructure components and engineering tools as well as a base understanding of the services IT must deliver for the business. The incident lead will drive all problem resolution actions as needed including
- engaging and directing component and application technical leads and teams and restoration efforts,
- collection and reporting of impact data,
- escalation as required to ensure adequate resources and talent are focused on the issue
Incident coordinator – in addition to the incident lead there should also be an incident coordinator. This command centre member is knowledgeable on the incident management process and procedures and handles key logistics including setting up conference calls, calling or paging resources, drafting and issuing communications, and importantly, managing to the incident clock for both escalation and task progress. The coordinator can be supplemented by additional command centre staff for a given incident particularly if multiple technical resolution calls are spawned by the incident.
Senior IT operations management – for critical issue, it is also appropriate for senior IT operations management to both be present on the technical bridge ensuring proper escalation and response occurs. Further, communications may need to be drafted for senior business personal providing status, impact, and prognosis. If it is a public issue, it may also be necessary to coordinate with corporate public relations and provide information in the issue.
Senior management – As is often the case with a major incident, senior management from all areas of IT and perhaps even the business will look to join the technical call and discussions focused on service restoration and problem resolution. While this should be viewed as natural desire (perhaps similar to slowing and staring at a traffic accident), business and senior management presence can be disruptive and prevent the team from timely resolution. So here is what they are not to do:
- Don’t join the bridge, announce yourself and ask what is going on, this will deflect the team’s attention from the work at hand and waste several minutes bringing you up to speed and extending the problem resolution time (I have seen this happen far too often)
- Don’t look to blame, the team will likely slow or even shut down due to fear of repercussions when honest open dialogue is needed most to understand the problem.
- Don’t jump to conclusions on the problem, the team could be led down the wrong path. Few senior managers have the ability to be up-to-date on the technology and have strong enough problem resolution skills to provide reliable suggestions. If you are one of them, go ahead and enable the team to leverage your experience, but be careful if your track record says otherwise.

Before we get to the guidelines to practice during an incident, I also recommend ensuring your team has the appropriate attitude and understanding at the start of the incident. Far too often, problems start small or the local team thinks they have it well in hand. They then avoid escalating the issue or reporting it as a potential critical issue. Meanwhile critical time is lost, and potentially mistakes made by the local team then compound the issue. By the time escalation to the command centre does occur, the customer impact has become severe and the options to resolve are far more limited. I refer to this as trying to put out the fire with a garden hose. It is important to communicate to the team that it is far better to over-report an issue than report it late. There is no ‘crying wolf’ when it comes to production. The team should first call the fire department (the command center) with a full potential severity alert, and then can go back to putting out the fire with the garden hose. Meanwhile the command center will mobilize all the needed resources to arrive and ensure the fire is put out. If everyone arrives and the fire is put out, all will be happy. And if the fire is raging, you now have the full set of resources to properly overcome the issue.

Now let’s turn our attention to best practice guidelines to leverage during a serious IT incident.

Guidelines in the Heat of Battle:

1. One change at a time (and track all changes)

2. Focus on restoring service first, but list out the root causes as you come across them. Remember most root cause analysis and work comes long after service is restored.

3. Ensure configuration information is documented and maintained through the changes

4. Go back to the last known stable configuration (back out all changes if necessary to get back to the stable configuration). Don’t let engineering ‘optimism’ forward engineer to a new solution unless it is the only option.

5. Establish clear command lines (one for technical, one business interface) and ensure full command center support. It is best for the business not to participate in the technology calls — it is akin to watching sausage get made (no one would eat it if they saw it being made). Your business will feel the same way about technology if they are on the calls.

6. Overwhelm the problem (escalate and bring in the key resources – yours and the vendor’s). Don’t dribble in resources because it is 4 AM in the morning. If you work in IT, and you want to be good, this is part of the job. Get the key resources on the call and ensure you hold the vendor to the same bar as you hold your team.

7. Work in parallel wherever reasonable and possible. This should include spawning parallel activities (and technical bridges) to work multiple reasonable solutions or backups.

8. Follow the clock and use the command center to ensure activities stay on schedule. You must be able to decide when a path is not working and focus resources on better options and the clock is a key component of that decision. And escalation and communication must occur with rigor so maintain confidence and bring necessary resources to bear.

9. Peer plan, review and implement. Everything done in an emergency (here, to restore service and fix a problem) is highly likely to inject further defects into your systems. Too many issues have been complicated by during a change implementation when a typo occurs or the command is executed in the wrong environment. Peer planning, review, and implementation will significantly improve the quality of the changes you implement.

10. Be ready for the worst, have additional options and have a backout plan for the fix. You will save time and be more creative to drive better solutions if you address potential setback proactively rather than waiting for them to happen and then reacting.

Recall that the ITIL incident management objective is to ‘restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.’ These best practice guidelines will help you drive to a best practice incident management capability.

What would you add or change in the guidelines? How have you been able to excellent service restoration and problem management? I look forward to hearing from you.

P.S. Please note that these best practices have been honed over the years in world class availability shops for major corporations with significant contributions from such colleagues as Gary Greenwald, Cecilia Murphy, Jim Borendame, CHris Gushue, Marty Metzker, Peter Josse, Craig Bright, and Nick Beavis (and others).

Recipes for IT

Best practices for achieving high performance IT

Category: World Class Production Availability

The Elusive High Availability in the Digital Age

Infrastructure Engineering – Leveraging a Technology Plan

Moving from Offshoring to Global Shared Service Centers

Looking to Improve IT Production? How to Start

A Cloudy Future: The Rise of Appliances and SaaS

In the Heat of Battle: Good Guidelines for Production