The Infrastructure Engineering Lifecycle – How to Build and Sustain a Top Quartile Infrastructure

November 13, 2015 Jim D

There are over 5,000 books on applications development methods on Amazon. There are dozens of industry and government standards that map out the methodologies for application development. And for IT operations and IT production processes like problem and change management, IT Service Management and ITIL standards provide excellent guidance and structure. Yet for the infrastructure systems on which the applications rely on fully, there is scarcely a publication which outlines the approaches organizations should use to build and sustain a robust infrastructure. ITIL ventures slightly in this area but really just re-defines a waterfall application project cycle in infrastructure terms. During many years of building, re-building, and sustaining top quartile infrastructure I have developed a life cycle methodology for infrastructure or ‘Infrastructure Engineering Life Cycle’ (IELC).

The importance of infrastructure should not be overlooked in our digital age. Not only have customer expectations increased for services where they expect ‘always on’ web sites and transaction capabilities, but they also require quick response and seamless integration across offerings. Certainly the software is critical to provide the functionality, but none of these services can be reliably and securely provided without a well-built infrastructure underpinning all of the applications. A top quartile infrastructure delivers outstanding reliability (on the order of 99.9% or better availability), zippy performance, excellent unit costs, all with robust security and resiliency.

Often enterprises make the mistake of addressing infrastructure only when things break, and they only fix or invest enough to get things back running instead of re-building correctly a modern plant. It is unfortunate because not only will they likely experience further outages and service impacts but also their full infrastructure costs are likely to be higher for their dated, dysfunctional plant than for an updated, modern plant. Unlike most assets, I have found that a modern, well-designed IT infrastructure is cheaper to run than a poorly maintained plant that has various obsolete or poorly configured elements. Remember that every new generation of equipment can basically do twice as much s the previous so you have fewer components, less maintenance, less administration, less things that can go wrong. In addition, a modern plant also boosts time to market for application projects and reduces significantly the portion of time spent on fixing things by both infrastructure and application engineers.

So, given the critical nature of well-run technology infrastructure in the world of digitalization, how do enterprises and CIOs build and maintain a modern plant with outstanding fit and finish? It is not just about buying lots of new equipment, or counting on a single vendor or cloud provider to take care of all the integration or services. Nearly all major enterprise have a legacy of systems that result in complexity and complicate the ability to deliver reliable services or keep pace with new capabilities. These complexities can rarely be handled by a systems integrator or single service provider. Further, a complete re-build of the infrastructure often requires major capital investment and and can put the availability even further at risk. The best course is usually then is not to go ‘all-in’ where you launch a complete re-build or hand over the keys to a sole outsourcer, but instead to take a ‘spiral optimization’ approach which addresses fundamentals and burning issues first, and then uses the newly acquired capabilities to advance and address more complex or less pressing remaining issues.

A repeated, closed cycle approach (‘spiral optimization’) is our management approach. This management approach is coupled with an Infrastructure Engineering Lifecycle (IELC) methodology to build top quartile infrastructure. For the first cycle of the infrastructure rebuild, it is important to address the biggest issues. Front and center, the entire infrastructure team must focus on quality. Poorly designed or built infrastructure becomes a blackhole of engineering time as rework demands grow with each failure or application built upon a teetering platform. And while it must also be understood that a everything cannot be fixed at once, those things that are undertaken, must be done with quality. This includes documenting the systems and getting them correctly into the asset management database. And it includes coming up with a standard design or service offering if none exists. Having 5000 servers must be viewed as a large expense requiring great care and feeding — and the only thing worse is having 5000 custom servers because your IT team did not take the time to define the standard, keep it up to date, and maintain and patch it consistently. 5000 custom servers are a massive expense that likely cannot be effectively and efficiently maintained or secured by any team. There is no cheaper time than the present moment to begin standardizing and fixing the mess by requiring that the next server built or significantly updated be done such that it becomes the new standard. Don’t start this effort though until you have the engineering capacity to do it. A standard design done by lousy engineers is not worth the investment. So, as an IT leader, while you are insisting on quality, ensure you have adequate talent to engineer your new standards. If you do not have it on board, leverage top practitioners in the industry to help your team create the new designs.

In addition to quality and starting to do things right, there are several fundamental practices that must be implemented. Your infrastructure engineering work should be guided by the infrastructure engineering lifecycle – which is a methodology and set of practices that ensure high quality platforms that are effective, efficient, and sustainable.

The IELC covers all phases of infrastructure platforms – from an emerging platform to a standard to declining and obsolete platforms. Importantly, the IELC is comprised of three cycles of activity that recognize that infrastructure requires constant grooming and patching where inputs come typically from external parties, and, all the while, technology advances regularly occur such that over 3 to 10 years nearly every hardware platform becomes obsolete and should and must be replaced. The three cycles of activity are:

Platform – This is the foundational lifecycle activity where hardware and utility software is defined, designed and integrated into a platform to perform a particular service. Generally, for medium and large companies, this is a 3 to 5 year lifecycle. A few examples could be a server platform, storage platform or an email platform.
Release – Once a platform is initial designed and implemented, then organizations should expect to refresh the platform on a regular basis to incorporate major underlying product or technology enhancements, address significant design flaws or gaps, and improve operational performance and reliability. Release should be planned for 3 to 12 month intervals over the life of the platform (which is usually 3 to 5 years).
Patch – A patch should also be employed where on a regular and routine basis, minor upgrades (both fixes and enhancements) are applied. The patch cycle should synchronize with both the underlying patch cycle of the OEM (Original Equipment Manufacturer) for the product and with the security and production requirements of the organization. Usually, patch cycles are used to incorporate security fixes and significant production defect fixes issued by the OEM. Typical patch cycles can be weekly to every 6 months.

Below is a diagram that represents the three infrastructure engineering life cycles and the general parameters of the cycles.

In subsequent posts, I will further detail key steps and practices within the cycles as well as provide templates that I have found to be effective for infrastructure teams. As a preview, here is the diagram of the cycles with their activities and attributes.

What key practices or techniques have you used for your infrastructure teams to enable them to achieve success? I look forward to you thoughts and comments.

Best, Jim Ditmore

Using Organizational Best Practices to Handle Cloud and New Technologies

May 17, 2013 Jim D

I have extended and updated this post which was first published in InformationWeek in March, 2013. I think it is a very salient and pragmatic organizational method for IT success. I look forward to your feedback! Best, Jim

IT organizations are challenged to keep up with the latest wave of cloud, mobile and big data technologies, which are outside the traditional areas of staff expertise. Some industry pundits recommend bringing on more technology “generalists,” since cloud services in particular can call on multiple areas of expertise (storage, server, networking). Or they recommend employing IT “service managers” to bundle up infrastructure components and provide service offerings.

But such organizational changes can reduce your team’s expertise and accountability and make it more difficult to deliver services. So how do you grow your organization’s expertise to handle new technologies? At the same time, how do you organize to deliver business demands for more product innovation and faster delivery yet still ensure efficiency, high quality and security?

Rather than acquire generalists and add another layer of cost and decision making to your infrastructure team, consider the following:

Cloud computing. Assign architects or lead engineers to focus on software-as-a-service and infrastructure-as-a-service, ensuring that you have robust estimating and costing models and solid implementation and operational templates. Establish a cloud roadmap that leverages SaaS and IaaS, ensuring that you don’t overreach and end up balkanizing your data center.

For appliances and private cloud, given their multiple component technologies, let your best component engineers learn adjacent fields. Build multi-disciplinary teams to design and implement these offerings. Above all, though, don’t water down the engineering capacity of your team by selecting generalists who lack depth in a component field. For decades, IT has built complex systems with multiple components by leveraging multi-faceted teams of experts, and cloud is no different.

Where to use ‘service managers’. A frequent flaw in organizations is to employ ‘service managers’ who group multiple infrastructure components (e.g. storage, servers, data centers, etc) into a ‘product’ (e.g. ‘hosting service’) and provide direction and interface for this product. This is an entirely artificial layer that then removes accountability from the component teams and often makes poor ‘product’ decision because of limited knowledge and depth. In the end IT does not deliver ‘hosting services’; IT delivers systems that meet business functions (e.g., for banking, teller or branch functions, ATMs; or for insurance, claims reporting or policy quote or issue). These business functions are the true IT services and are where you should apply a service manager role. Here, a service manager can ensure end-to-end integration and quality, drive better overall transaction performance and reliability, and provide deep expertise on system connections and SLAs and business needs back across the application and infrastructure component teams. And because it is directly attached to the business functions to be done, it will yield high value. These service managers will be invaluable for both new development and enhancement work as well as assisting during production issues.

Mobile. If mobile isn’t already the most critical interface for your company, it will be in three to five years. So don’t treat mobile as an afterthought, to be adapted from traditional interfaces. And don’t outsource this capability, as mobile will be pervasive in everything you build.

Build a mobile competency center that includes development, user experience and standards expertise. Then fan out that expertise to all of your development teams, while maintaining the core mobile group to assist with the most difficult efforts. And of course, continue with a central architecture and control of the overall user experience. A consistent mobile look, feel and flow is essentially your company’s brand, invaluable in interacting with customers.

Big data. There are two key aspects of this technology wave: the data (and traditional analytic uses) and real-time data “decisioning,” similar to IBM’s Watson. You can handle the data analytics as an extension of your traditional data warehousing (though on steroids). However, real-time decisioning has the potential to dramatically alter how your organization specifies and encodes business rules.

Consider the possibility that 30% to 50% of all business logic traditionally encoded in 3 or 4 generation programming languages instead becomes decisioned in real time. This capability will require new development and business analyst skills. For now, cultivate a central team with these skills. As you pilot and determine how to more broadly leverage real-time data decisioning, decide how to seed your broader development teams with these capabilities. In the longer run, I believe it will be critical to have these skills as an inherent portion of each development team.

Competing Demands. Overall, IT organizations must meet several competing demands: Work with business partners to deliver competitive advantage; do so quickly in order to respond to (and anticipate) market demands; and provide efficient, consistent quality while protecting the company’s intellectual property, data and customers. In essence, there are business and market drivers that value speed, business knowledge and closeness at a reasonable cost and risk drivers that value efficiency, quality, security and consistency.

Therefore, we must design an IT organization and systems approach that meets both sets of drivers and accommodates business organizational change. As opposed to organizing around one set of drivers or the other, the best solution is to organize IT as a hybrid organization to deliver both sets of capabilities.

Typically, the functions that should be consolidated and organized centrally to deliver scale, efficiency and quality are infrastructure (especially networks, data centers, servers and storage), IT operations, information security, service desks and anything else that should be run as a utility for the company. The functions to be aligned and organized along business lines to promote agility and innovation are application development (including Web and mature mobile development), data marts and business intelligence.

Some functions, such as database, middleware, testing and project management, can be organized in either mode. But if they aren’t centralized, they’ll require a council to ensure consistent processes, tools, measures and templates.

For services becoming a commodity, or where there’s a critical advantage to having one solution (e.g., one view of the customer for the entire company), it’s best to have a single team or utility that’s responsible (along with a corresponding single senior business sponsor). Where you’re looking to improve speed to market or market knowledge, organize into smaller IT teams closer to the business. The diagram below gives a graphical view of the hybrid organization.

With this approach, your IT shop will be able to deliver the best of both worlds. And you can then weave in the new skills and teams required to deliver the latest technologies such as cloud and mobile. You can read more about this hybrid model in our best practice reference page.

Which IT organizational approaches or variations have you seen work best? How are you accommodating new technologies and skills within your teams? Please weigh in with a comment below.

Best, Jim Ditmore

A Cloudy Future: Hard Truths and How to Best Leverage Cloud

January 26, 2013 Jim D

We are long into the marketing hype cycle on cloud. That means that clear criteria to assess and evaluate the different cloud options are critical. Given these complexities, what approach should the medium to large enterprise take to best leverage cloud and optimize their data center? What are the pitfalls as well? While cloud computing is often presented as homogenous, there are many different types of cloud computing from infrastructure as a service (IaaS) to software as a service (SaaS) and many flavors in between. Perhaps some of the best examples are Amazon’s infrastructure services (IaaS), Google’s Email and office productivity services (SaaS), and Salesforce.com’s customer relationship management or CRM services (SaaS). Typically, the cloud is envisioned as an accessible and low cost compute utility in the sky that is always available. Despite this lofty promise, companies will need to select and build their cloud environment carefully to avoid fracturing their computing capabilities, locking themselves into a single, higher cost environment or impacting their ability to differentiate and gain competitive advantage – or all three.

The chart below provides an overview of the different types of cloud computing:

Note the positioning of the two dominant types of cloud computing:

there is the specialized Software-as-a-Service (SaaS) where the entire stack from server to application (even version) are provided — with minimal variation
there is the very generic IaaS or PaaS where a set of server and OS version(s) is available with types of storage. Any compatible database, middleware, or application can be installed to then run.

Other types of cloud computing include private cloud – essentially IaaS that an enterprise builds for itself. The private cloud variant is the evolution of the current corporate virtualized server and storage farm to a more mature instance with clearly defined service configurations, offerings, billing as well as highly automated provisioning and management.

Another impacting technology in the data center is engineered stacks. These are a further evolution of the computer appliances that have been available for decades. Engineered stacks are tightly specified, designed and engineered components integrated to provide superior performance and cost. These devices have typically been in the network, security, database and specialized compute spaces. Firewalls and other security devices have long leveraged an this approach where generic technology (CPU, storage, OS) is closely integrated with additional special purpose software and sold and serviced as an packaged solution. There has been a steady increase in the number of appliance or engineered stack offerings moving further into data analytics, application servers, and middleware.

With the landscape set it is important to understand the technology industry market forces and the customer economics that will drive the data center landscape over the next five years. First, the technology vendors will continue to invest and increase the SaaS and engineered stack offerings because they offer significantly better margin and more certain long term revenue. A SaaS offering gets a far higher Wall Street multiple than traditional software licenses — and for good reason — it can be viewed as a consistent ongoing revenue stream where the customer is heavily locked in. Similarly for engineered stacks, traditional hardware vendors are racing to integrate as far up the stack as possible to both create additional value but more importantly enable locked in advantage where upgrades, support and maintenance can be more assured and at higher margin than traditional commodity servers or storage. It is a higher hurdle to replace an engineered stack than commodity equipment.

The industry investment will be accelerated by customer spend. Both SaaS and engineered stacks provide appealing business value that will justify their selection. For SaaS, it is speed and ease of implementation as well as potentially variable cost. For engineered stacks, it is a performance uplift at potentially lower costs that often makes the sale. Both SaaS and engineered stacks should be selected where the business case makes sense but with the cautions of:

for SaaS:
- be very careful if it is core business functionality or processes, you could be locking away your differentiation and ultimate competitiveness.
- know how you will get your data back before you sign should you stop using the SaaS
- make sure you have ensured integrity and security of your data in your vendor’s hands
for engineered stacks:
- understand where the product is in its lifecycle before selecting
- anticipate the eventual migration path as the product fades at the end of its cycle

For both, ensure you avoid integrating key business logic into the vendor’s product. Otherwise you will be faced with high migration costs at the end of the product life or when there is a better compelling product. There are multiple ways to ensure that your key functionality and business rules remain independent and modular outside of the vendor service package.

With these caveats in mind, and a critical eye to your contract to avoid onerous terms and lock-ins, you will be successful with the project level decisions. But you should drive optimization at the portfolio level as well. If you are a medium to large enterprise, you should be driving your internal infrastructure to mature their offering to an internal private cloud. Virtualization, already widespread in the industry, is just the first step. You should move to eliminate or minimize your custom configurations (preferably less than 20% of your server population). Next, invest in the tools and process and engineering so you can heavily automate the provisioning and management of the data center. Doing this will also improve the quality of service).

Make sure that you do not shift so much of your processing to SaaS that you ‘balkanize’ your own utility. Your data center utility would then operate subscale and inefficiently. Should you overreach, expect to incur heavy integration costs on subsequent initiatives (because your functionality will be spread across multiple SaaS vendors in many data centers). And you can expect to experience performance issues as your systems operate at WAN speeds versus LAN speeds across these centers. And expect to lose negotiating position with SaaS providers because you have lost your ‘in-source’ strength.

I would venture that over the next five year the a well-managed IT shop will see:

the most growth in its SaaS and engineered stack portfolio,
a conversion from custom infrastructure to a robust private cloud with a small sliver of custom remaining for unconverted legacy systems
minimal growth in PaaS and IaaS (growth here is actually driven by small to medium firms)

This transition is represented symbolically in the chart below:

Data Center Transition Over the Next 5 Years

So, on the road to a cloud future, SaaS and engineered stacks will be a part of nearly every company’s portfolio. Vendor lock-in could be around every corner, but good IT shops will leverage these capabilities judiciously and develop their own private cloud capabilities as well as retain critical IP and avoid the lock-ins. We will see far greater efficiency in the data center as custom configurations are heavily reduced. So while the prospects are indeed ‘cloudy’, the future is potentially bright for the thoughtful IT shop.

What changes or guidelines would you apply when considering cloud computing and the many offerings? I look forward to your perspective.

This post appeared in its original version at Information Week January 4. I have extended and revised it since then.

Best, Jim Ditmore

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30