Celebrate 2013 Technology or Look to 2014?

The year is quickly winding down and 2013 will not be remembered as a stellar year for technology. Between the NSA leaks and Orwellian revelations, the Healthcare.gov mishaps, the cloud email outages (and Yahoo’s is still lingering) and now the 40 million credit identities stolen from Target, 2013 actually was a pretty tough year for the promise of technology to better society.

While the breakneck progress of technology continued, we witnessed so many shortcomings in its implementation. Fundamental gaps in large project delivery and availability design and implementation continue to plague large and widely used systems.   It is as if the primary design lessons of ‘Galloping Gertie’ regarding resonance were never absorbed by bridge builders. The costs of such major flaws in these large systems are certainly similar to that of a failed bridge.  And as it turns out, if there is a security flaw or loophole, either the bad guys or the NSA will exploit it. I particularly like NSA’s use of ‘smiley faces’ on internal presentations when they find a major gap in someone else’s system.

So, given 2013 has shown the world we live in all too clearly, as IT leaders let’s look to 2014 and resolve to do things better. Let’s continue to up the investment in security within our walls and be more demanding of our vendors to improve their security. Better security is the number 2 focus item (behind data analytics) for most firms and the US government. And security spend will increase an out-sized amount even as total spend goes up by 5%. This is good news, but let’s ensure the money is spent well and we make greater progress in 2014. Of course, one key step is to get XP out of your environment by March since it will no longer be patched by Microsoft. For a checklist on security, here is a good start at my best practices security reference page.

As for availability, remember that quality provides the foundation to availability. Whether design, implementation or change, quality must be woven throughout these processes to enable robust availability and meet the demands of today’s 7×24 mobile consumers. Resolve to move your shop from craft to science in 2014, and make a world of a difference for your company’s interface to its customers. Again, if you are wondering how best to start this journey and make real progress, check out this primer on availability.

Now, what should you look for in 2014? As with last January, where I made 6 predictions for 2013, I will make 6 technology predictions for 2014. Here we go!

6. There will be consolidation in the public cloud market as smaller companies fail to gather enough long term revenue to survive and compete in a market with rapidly falling prices. Nirvanx was the first of many.

5. NSA will get real governance, though it will be secret governance. There is too much of a firestorm for this to continue in current form.

4. Dual SIM phones become available in major markets. This is my personal favorite wish list item and it should come true in the Android space by 4Q.

3. Microsoft’s ‘messy’ OS versions will be reduced, but Microsoft will not deliver on the ‘one’ platform. Expect Microsoft to drop RT and continue to incrementally improve Pro and Enterprise to be more like Windows 7. As for Windows Phone OS, it is a question of sustained market share and the jury is out. It should hang on for a few more years though.

2. With a new CEO, a Microsoft breakup or spinoffs are in the cards. The activist shareholders are holding fire while waiting for the new CEO, but will be applying the flame once again. Effects? How about Office on the iPad? Everyone is giving away software and charging for hardware and services, forcing an eventual change in the Microsoft business model.

1. Flash revolution in the enterprise. What looked at the start of 2013 to be 3 or more years out looks now like this year. The emergence of flash storage at prices (with de-duplication) comparable to traditional storage and 90% reductions in environmentals will become a stampede with the next generation of flash costing significantly less than disk storage.

What are your top predictions? Anything to change or add?

I look forward to your feedback and next week I will assess how my predictions from January 2013 did — we will keep score!

Best, and have a great holiday,

Jim Ditmore

How Did Technology End Up on the Sunday Morning Talk Shows?

It has been two months since the Healthcare.gov launch and by now nearly every American has heard or witnessed the poor performance of the websites. Early on, only one of every five users was able to actually sign in to Healthcare.gov, while poor performance and unavailable systems continue to plague the federal and some state exchanges. Performance was still problematic several weeks into the launch and even as of Friday, November 30, the site was down for 11 hours for maintenance. As of today, December 1, the promised ‘relaunch day’, it appears the site is ‘markedly improved’ but there are plenty more issues to fix.

What a sad state of affairs for IT. So, what does the Healthcare website issues teach us about large project management and execution? Or further, about quality engineering and defect removal?

Soon after the launch, former federal CTO Aneesh Chopra, in an Aspen Institute interview with The New York Times‘ Thomas Friedman, shrugged off the website problems, saying that “glitches happen.” Chopra compared the Healthcare.gov downtime to the frequent appearances of Twitter’s “fail whale” as heavy traffic overwhelmed that site during the 2010 soccer World Cup.

But given that the size of the signup audience was well known and that website technology is mature and well understood, how could the government create such an IT mess? Especially given how much lead time the government had (more than three years) and how much it spent on building the site (estimated between $300 million and $500 million).

Perhaps this is not quite so unusual. Industry research suggests that large IT projects are at far greater risk of failure than smaller efforts. A 2012 McKinsey study revealed that 17% of lT projects budgeted at $15 million or higher go so badly as to threaten the company’s existence, and more than 40% of them fail. As bad as the U.S. healthcare website debut is, there are dozens of examples, both government-run and private of similar debacles.

In a landmark 1995 study, the Standish Group established that only about 17% of IT projects could be considered “fully successful,” another 52% were “challenged” (they didn’t meet budget, quality or time goals) and 30% were “impaired or failed.” In a recent update of that study conducted for ComputerWorld, Standish examined 3,555 IT projects between 2003 and 2012 that had labor costs of at least $10 million and found that only 6.4% of them were successful.

Combining the inherent problems associated with very large IT projects with outdated government practices greatly increases the risk factors. Enterprises of all types can track large IT project failures to several key reasons:

  • Poor or ambiguous sponsorship
  • Confusing or changing requirements
  • Inadequate skills or resources
  • Poor design or inappropriate use of new technology

Unfortunately, strong sponsorship and solid requirements are difficult to come by in a political environment (read: Obamacare), where too many individual and group stakeholders have reason to argue with one another and change the project. Applying the political process of lengthy debates, consensus-building and multiple agendas to defining project requirements is a recipe for disaster.

Furthermore, based on my experience, I suspect the contractors doing the government work encouraged changes, as they saw an opportunity to grow the scope of the project with much higher-margin work (change orders are always much more profitable than the original bid). Inadequate sponsorship and weak requirements were undoubtedly combined with a waterfall development methodology and overall big bang approach usually specified by government procurement methods. In fact, early testimony by the contractors ‘cited a lack of testing on the full system and last-minute changes by the federal agency’.

Why didn’t the project use an iterative delivery approach to hone requirements and interfaces early? Why not start with healthcare site pilots and betas months or even years before the October 1 launch date? The project was underway for three years, yet nothing was made available until October 1. And why did the effort leverage only an already occupied pool of virtualized servers that had little spare capacity for a major new site? For less than 10% of the project costs a massive dedicated farm could have been built.  Further, there was no backup site, nor any monitoring tools implemented. And where was the horizontal scaling design within the application to enable easy addition of capacity for unexpected demand? It is disappointing to see such basic misses in non-functional requirements and design in a major program for a system that is not that difficult or unique.

These basic deliverables and approaches appear to have been fully missed in the implementation of the wesite. Further, the website code appears to have been quite sloppy, not even using common caching techniques to improve performance. Thus, in addition to suffering from weak sponsorship and ambiguous requirements, this program failed to leverage well-known best practices for the technology and design.

One would have thought that given the scale and expenditure on the program, top technical resources would have been allocated and ensured these practices were used. The feds are  scrambling with a “surge” of tech resources  for the site. And while the new resources and leadership have made improvements so far, the surge will bring its own problems. It is very difficult to effectively add resources to an already large program. And, new ideas introduced by the ‘surge’ resources, may not be either accepted or easily integrated. And if the issues are deeply embedded in the system, it will be difficult for the new team to fully fix the defects. For every 100 defects identified in the first few weeks, my experience with quality suggests there are 2 or 3 times more defects buried in the system. Furthermore, if one wonders if the project couldn’t handle the “easy” technical work — sound website design and horizontal scalability – how will they can handle the more difficult challenges of data quality and security?

These issues will become more apparent in the coming months when the complex integration with backend systems from other agencies and insurance companies becomes stressed. And already the fraudsters are jumping into the fray.

So, what should be done and what are the takeaways for an IT leader? Clear sponsorship and proper governance are table stakes for any big IT project, but in this case more radical changes are in order. Why have all 36 states and the federal government roll out their healthcare exchanges in one waterfall or big bang approach? The sites that are working reasonably well (such as the District of Columbia’s) developed them independently. Divide the work up where possible, and move to an iterative or spiral methodology. Deliver early and often.

Perhaps even use competitive tension by having two contractors compete against each other for each such cycle. Pick the one that worked the best and then start over on the next cycle. But make them sprints, not marathons. Three- or six-month cycles should do it. The team that meets the requirements, on time, will have an opportunity to bid on the next cycle. Any contractor that doesn’t clear the bar gets barred from the next round. Now there’s no payoff for a contractor encouraging endless changes. And you have broken up the work into more doable components that can then be improved in the next implementation.

Finally, use only proven technologies. And why not ask the CIOs or chief technology architects of a few large-scale Web companies to spend a few days reviewing the program and designs at appropriate points. It’s the kind of industry-government partnership we would all like to see.

If you want to learn more about how to manage (and not to manage) large IT programs, I recommend “Software Runaways,” , by Robert L. Glass, which documents some spectacular failures. Reading the book is like watching a traffic accident unfold: It’s awful but you can’t tear yourself away. Also, I expand on the root causes of and remedies for IT project failures in my post on project management best practices.

And how about some projects that went well? Here is a great link to the 10 best government IT projects in 2012!

What project management best practices would you add? Please weigh in with a comment below.

Best, Jim Ditmore

This post was first published in late October in InformationWeek and has been updated for this site.

Whither Virtual Desktops?

The enterprise popularity of tablets and smartphones at the expense of PCs and other desktop devices is also sinking desktop virtualization. In addition to the clear link that tablets and smartphones are cannibalizing PC sales, mobility and changing device economics is also impacting corporate desktop virtualization or VDI.

The heyday of virtual desktop infrastructure came around 2008 to 2010, as companies sought to cut their desktop computing costs — VDI promised savings from 10% to as much as 40%. Those savings were possible despite the additional engineering and server investments required to implement the VDI stack. Some companies even anticipated replacing up to 90% of their PCs with VDI alternatives. Companies sought to reduce desktop costs and address specific issues not well-served by local PCs (e.g., smaller overseas sites with local software licensing and security complexities).

But something happened on the way to VDI dominance. The market changed faster than the maturing of VDI. Employee demand for mobile devices, in line with the BYOD phenomenon, has refocused IT shops on delivering mobile device management capabilities, not VDI. On-the-go employees are gravitating toward new lightweight laptops, a variety of tablets and other non-desktop innovations that aren’t VDI-friendly. Mobile employees want to use multiple devices; they don’t want to be tied down to a single VDI-based interface. And enterprise IT shops have refocused on delivering mobile device management capabilities so company employees can securely use their smartphones for their work. Given the VDI interface is at best cumbersome on a touch interface with a different OS than Windows, there will be less and less demand for VDI as the way to interconnect.  Given the dominance of these highly mobile smartphones and tablets will only increase in the next few years as the client device war between Apple, Android, and Microsoft (Nokia) heats up further (and they continue to produce better and cheaper products) VDI’s appeal will fall even farther.

Meantime, PC prices, both desktop and laptop, which have had a steady decline in the past 4 years, dropping 30-40% (other than Apple’s products, of course), will accelerate their price drop.  With the decline in shipments these past 18 months, the entire industry is overcapacity and the only way to out of the situation is to spur demand and better consumer interest in PCs is through further cost reductions. (Note that the answer is not that Windows 8 will spur demand). Already Dell and Lenovo are using lower prices to try to hold their volumes steady. And with other devices entering the market (e.g. Smart TVs, smart game stations, etc), it will become a very bloody marketplace. The end result for IT shops will be $300 laptops that are pretty slick that come fully with Windows (perhaps even Office). At those prices, VDI will have minimal or no cost advantage especially taking into account the backend VDI engineering costs.  And if you can buy a $300 laptop or tablet fully equipped that is preferred by most employees, IT shops will be hard pressed to pass that up and impose VDI. In fact, by late 2014, corporate IT shops in 2014 could be faced with their VDI solutions costing more than traditional client devices (e.g., that $300 laptop). This is because the major components of VDI costs (servers and engineering work and support) will not drop nearly as quickly as the distressed market PC costs. 

There is no escaping the additional engineering time and attention VDI requires. The complex stack (either Citrix or VMware) still requires more engineering than a traditional solution. And with this complexity, there will still be bugs between the various client and VDI and server layers that impact user experience. Recent implementations still show far too many defects between the layers. At Allstate, we have had more than our share of defects in our recent rollout between the virtualization layer, Windows, and third party products. And this is for what should be by now, a mature technology.

Faced with greater costs, greater engineering resources (which are scarce) and employee demand for the latest mobile client devices, organizations will begin to throw in the towel on VDI. Some companies now deploying will reduce the scope of current VDI deployments. Some now looking at VDI will jump instead to mobile-only alternatives more focused on tablets and smartphones. And those with extensive deployments will allow significant erosion of their VDI footprint as internal teams opt for other solutions, employee demand moves to smartphones and tablets or lifecycle events occur. This is a long fall from the lofty goals of 90% deployment from a few years ago. IT shops do not want to be faced with both supporting VDI for an employee who also has a tablet, laptop or desktop solution because it essentially doubles the cost of the client technology environment. In an era of very tight IT budgets, excess VDI deployments will be shed.

One of the more interesting phenomenon in the rapidly changing world of technology is when a technology wave gets overtaken well before it peaks. This occurred many times before (think optical disk storage in the data center) but perhaps most recently with netbooks where their primary advantages of cost and simplicity where overwhelmed by smartphones (from below) and ultra-books from above. Carving out a sustainable market niche on cost alone in the technology world is a very difficult task, especially when you consider that you are reversing long term industry trends.

Over the past 50 years of computing history, the intelligence and capability has been drawn either to the center or to the very edge. In the 60s, mainframes were the ‘smart’ center and 3270 terminals were the ‘dumb’ edge device. In the 90s, client computing took hold and the ‘edge’ became much smarter with PCs but there was a bulging middle tier of the three tier client compute structure. This middle tier disappeared as hybrid data centers and cloud computing re-centralized computing. And the ‘smart’ edge moved out even farther with smartphones and tablets. While VDI has a ‘smart’ center, it assumes a ‘dumb’ edge, which goes against the grain of long term compute trends. Thus the VDI wave, a viable alternative for a time, will be dissipated in the next few years as the long term compute trends overtake it fully.

I am sure there will still be niche applications, like offshore centers (especially where VDI also enables better control of software licensing) and there will still be small segments of the user population that will swear by the flexibility to access their device from anywhere they can log in without carrying anything, but these are ling term niches. Long term, VDI solutions will have a smaller and smaller portion of the device share, perhaps 10%, maybe even 20%, but not more.

What is your company’s experience with VDI? Where do you see its future?

Best, Jim Ditmore

 This post was first published in InformationWeek on September 13, 2013 and has been slightly revised and updated.

Getting to Private Cloud: Key Steps to Build Your Cloud

Now that I am back from summer break, I want to continue to further the discussion on cloud and map out how medium and large enterprises can build their own private cloud. As we’ve discussed previously, software-as-a-service, engineered stacks and private cloud will be the biggest IT winners in the next five to ten years. Private clouds hold the most potential — in fact, early adopters such as JP Morgan Chase and Fidelity are seeing larger savings and greater benefits than initially anticipated.

While savings is a key reason to move to a private cloud, shorter development cycles and faster time to market are more significant. Organizations can test risky ideas more easily as small, low-cost projects, quickly dispensing with those projects that fail and accelerating those that show more promise.

While savings is a key driver to moving to private cloud, faster development cycles and better time to market are turning out to be both more significant and more valuable to early adopter firms than initially estimated. And it is not just a speed improvement but a qualitative improvement where smaller projects can trialled or riskier pilots can be executed with far greater speed and nominal costs. This allows a ‘fast fail’ approach on corporate innovation that greatly speeds the selection process, avoids extensive wasted investment in lengthier traditional pilots (that would have failed anyway) and greatly improves time to market on those ideas that are successful.

As for the larger savings, early implementations at scale are seeing savings well in excess of 50%. This is well beyond my estimate of 30% and is occurring in large part because of the vastly reduced labor requirements to build and administer a private cloud versus traditional infrastructure.

So with greater potential benefits, how should an IT department go about building a private cloud? The fundamental building blocks required for private cloud are a base of virtualized servers utilizing commodity servers and leveraging open systems. And of course you need the server engineering and administration expertise to support the platform. There’s also a strong early trend toward leveraging open source software for private clouds, from the Linux operating system to OpenNebula and Eucalyptus for infrastructure management. But just having a virtualized server platform does not result in private cloud. There are several additional elements required.

First, establish a set of standardized images that constitute most of the stack. Preferably, that stack will go from the hardware layer to the operating system to the application server layer, and it will include systems management, security, middleware and database. Ideally, go with a dozen or fewer server images and certainly no more than 20. Consider everything else to be custom and treated separately and differently from the cloud.

Once you have established your target set of private cloud images you should build a catalogue and ordering process that is easy, rapid, and transparent. The costs should be clear, and the server units should be processor-months or processor-weeks. You will need to couple the catalogue with highly automated provisioning and de-provisioning. Your objective should be to deliver servers quickly, certainly within hours, preferably within minutes (once the costs are authorized by the customer). And de-provisioning should be just as rapid and regular. In fact, you should offer automated ‘sunset’ servers in test and development environments (e.g., after 90 days the server(s) are allocated, they are automatically returned to the pool). I strongly recommend well-published and clear cost and allocation reporting to drive the right behaviors among your users. It will encourage quicker adoption, better and more efficient usage and rapid turn-in when no longer needed. With these 4 prerequisites in place (standard images, a catalogue and easy ordering process, clear costs and allocations, and automated provisioning and de-provisioning) you are ready to start your private cloud.

Look to build your private cloud in parallel to your traditional data center platforms. There should be both a development and test private cloud as well as a production private cloud. Seed the cloud with an initial investment of servers of each standard type. Then transition demand into the private as new projects initiate and proceed to grow it project by project.

You could begin by routing small and medium size projects to the private cloud environment and as it builds up scale and provisioning kinks are ironed out, migrate more and more server requests until nearly all requests are routed through your private cloud path. As you begin to achieve scale and you prove out your ordering and provisioning (and de-provisioning processes) you can begin to tighten the criteria for projects to proceed with traditional custom servers. Within 6 months, custom, traditional servers should be the rare exception and should be charged fully for the excess costs they will generate.

 Once the private cloud is established you can verify the costs savings and advantages. And there will be additional advantages such as improved time to market because of improvements in the speed of your development efforts given server deployment is no longer a long pole in the tent. Well-armed with this data, you can now circle back and tackle existing environments and legacy custom servers. While often the business case for a platform transition is not a good investment, a transition to private cloud during another event (e.g., major application release, server end-of-life migration) should easily become a winning investment. A few early adopters (such as JPMC or Fidelity) are seeing outsized benefits and strong developer push into these private cloud environments. So, if you build it well, you should be able to reap the same advantages.

How is your cloud journey proceeding? Are there other key steps necessary to be successful? I look forward to hearing your perspective.

Best, Jim Ditmore

 

Looking to Improve IT Production? How to Start

Production issues, as Microsoft and Google can tell you, impact even cloud email apps. A few weeks ago, Microsoft took an entire weekend to full recover its cloud Outlook service. Perhaps you noted the issues earlier this year in financial services where Bank of America experienced internet site availability issues. Unfortunately for Bank of America that was their second outage in 6 months, though they are not alone in having problems as Chase suffered a similar production outage on their internet services the week following. And these are regular production issues, not the unavailability of websites and services due to a series of DD0S attacks.

Perhaps 10 or certainly 15 years ago, such outages with production systems would have resulted in far less notice by their customers as the front office personnel would have worked alternate systems and manual procedures until the systems were restored. But with customers accessing the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis, it is very difficult to avoid direct and widespread impact to customers in the event of a system failure. Your production performance becomes very evident to your customers. And your customers’ expectations have continued to increase such that they expect your company and your services to be available pretty much whenever they want to use them. And while being available is not the only attribute that customers value (usability, feature, service and pricing factor in importantly as well) companies that consistently meet or exceed consumer availability expectations gain a key edge in the market.

So how do you deliver to current and future rising expectations around availability of your online and mobile services? And if both BofA and Chase, which are large organizations that offer dozens of services online and have massive IT departments have issues delivering consistently high availability, how can smaller organizations deliver compelling reliability?

And often, the demand for high availability must be achieved in an environment where ongoing efficiencies have eroded the production base and a tight IT labor market has further complicated obtaining adequate expertise. If your organization is struggling with availability or you are looking to achieve top quartile performance and competitive service advantage, here’s where to start:

First, understand that availability, at its root, is a quality issue. And quality issues can only be changed if you address all aspects. You must set quality and availability as a priority, as a critical and primary goal for the organization. And you will need to ensure that incentives and rewards are aligned to your team’s availability goal.

Second, you will need to address the IT change processes. You should look to implement an ITSM change process based on ITIL. But don’t wait for a fully defined process to be implemented. You can start by limiting changes to appropriate windows. Establish release dates for major systems and accompanying subsystems. Avoid changes during key business hours or just before the start of the day. I still remember the ‘night programmer’ at Ameritrade at the beginning of our transformation there. Staying late one night as CIO in my first month, I noticed two guys come in at 10:30 PM. When I asked what they did, they said ‘ We are the night programmers. When something breaks with the nightly batch run, we go in and fix it.’  And done with no change records, minimal testing and minimal documentation. Of course, my hair stood on end hearing this. We quickly discontinued that practice and instead made changes as a team, after they were fully engineered and tested. I would note that combining this action with a number of other measures mentioned here enabled us to quickly reach a stable platform that had the best track record for availability for all online brokerages.

Importantly, you should ensure that adequate change review and documentation is being done by your teams for their changes. Ensure they take accountability for their work and their quality. Drive to an improved change process with templates for reviews, proper documentation, back out plans, and validation. Most failed changes are due to issues with the basics: a lack of adequate review and planning, poor change documentation of deployment steps, or missing or ineffective validation, or one person doing an implementation in the middle of the night when you should have at least two people doing it together (one to do, and one to check).

Also, you should measure the proportion of incidents due to change. If you experience mediocre or poor availability and failed changes contribute to more than 30% of the incidents, you should recognize change quality is a major contributor to your issues. You will need to zero in on the areas with chronic change issues. Measure the change success rate (percentage of changes executed successfully without production incident) of your teams. Publish the results by team (this will help drive more rapid improvement). Often, you can quickly find which of your teams has inadequate quality because their change success rate ranges from a very poor mid-80s percentage to a mediocre mid-90s percentage. Good shops deliver above 98% and a first quartile shop consistently has a change success rate of 99% or better.

Third, ensure all customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). And the ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues requires consolidating these capabilities and extending their reach.  If you have an ECC in place, ensure that all customer impacting issues are fully reported and handled. Underreporting of issues that impact a segment of your customer base, or the siphoning off of a problem to be handled by a local team, is akin to trying to handle a house fire with a garden hose and not calling the fire department. Call the fire department first, and then get the garden hose out while the fire trucks are on their way.

Fourth, you must execute strong root cause and followup. These efforts must be at the individual issue or incident level as well as at a summary or higher level. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Are they cluster with one application or infrastructure component? Are they caused primarily by change? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Track both the fixes for individual issues as well as the efforts to address systemic issues. The systemic efforts will begin to yield improvements that eliminate future issues.

These four efforts will set you on a solid course to improved availability. If you couple these efforts will diligent engagement by senior management and disciplined execution, the improvements will come slowly at first, but then will yield substantial gains that can be sustained.

You can achieve further momentum with work in several areas:

  • Document configurations for all key systems.  If you are doing discovery during incidents it is a clear indicator that your documentation and knowledge base is highly inadequate.
  • Review how incidents are reported. Are they user reported or did your monitoring identify the issue first? At least 70% of the issues should be identified first by you, and eventually you will want to drive this to a 90% level. If you are lower, then you need to look to invest in improving your monitoring and diagnostic capabilities.
  • Do you report availability in technical measures or business measures? If you report via time based systems availability measures or number of incidents by severity, these are technical measures. You should look to implement business-oriented measures such as customer impact availability. to drive great transparency and more accurate metrics.
  • In addition to eliminating issues, reduce your customer impacts by reducing the time to restore service (Microsoft can certainly stand to consider this area given their latest outage was three days!). For mean time to restore (MTTR – note this is not mean time to repair but mean time to restore service), there are three components: teime to detect (MTTD), time to diagnose or correlation (MTTC), and time to fix (to restore service or MTTF). An IT shop that is effective at resolution normally will see MTTR at 2 hours or less for its priority issues where the three components each take about 1/3 of the time. If your MTTD is high, again look to invest in better monitoring. If your MTTC is high look to improve correlation tools, systems documentation or engineering knowledge. And if your MTTF is high, again look to improve documentation or engineering knowledge or automate recovery procedures.
  • Consider investing in greater resiliency for key systems. It may be that customer expectations of availability exceed current architecture capabilities. Thus, you may want to invest in greater resiliency and redundancy or build a more highly available platform.

As you can see, providing robust availability for your customers is a complex endeavor. By implementing these steps, you can enable sustainable and substantial progress to top quartile performance and achieve business advantage in today’s 7×24 world.

What would you add to these steps? What were the key factors in your shop’s journey to high availability?

Best, Jim Ditmore

Turning the Corner on Data Centers

Recently I covered the ‘green shift’ of servers where each new server generation is not only driving major improvements in compute power but is also requires about the same or even less environmentals (power, cooling, space) as the previous generation. Thus, compute efficiency, or compute performance per watt, is improving exponentially. And this trend in servers, which started in 2005 or so, is also being repeated in storage. We have seen a similar improvement in power per terabyte  for the past 3 generations (since 2007). Current storage product pipeline suggests this efficiency trend will continue for the next several years. Below is a chart showing representative improvements in storage efficiency (power per terabyte) across storage product generations from a leading vendor.

Power (VA) per Terabyte
Power (VA) per Terabyte

With current technology advances, a terabyte of storage on today’s devices requires approximately 1/5 of the amount of power as a device from 5 years ago. And these power requirements could drop even more precipitously with the advent of flash technology. By some estimates, there is a drop of 70% or more in power and space requirements with the switch to flash products. In addition to being far more power efficient, flash will offer huge performance advantages for applications with corresponding time reductions in completing workload. So expect flash storage to quickly convert the market once mainstream product introductions occur. IBM sees this as just around the corner, while other vendors see the flash conversion as 3 or more years out. In either scenario, there are continued major improvements in storage efficiency in the pipeline that deliver far lower power demands even with increasing storage requirements.

Ultimately, with the combined efficiency improvements of both storage and server environments over the next 3 to 5 years, most firms will see a net reduction in data center requirements. The typical corporate data center power requirements are approximately one half server, one third storage, and the rest being network and other devices. With the two biggest components experiencing ongoing dramatic power efficiency trends, the net power and space demand should decline in the coming years for all but the fastest growing firms. Add in the effects of virtualization, engineered stacks and SaaS and the data centers in place today should suffice for most firms if they maintain a healthy replacement pace of older technology and embrace virtualization.

Despite such improvements in efficiency, we still could see a major addition in total data center space because cloud and consumer firms like Facebook are investing major sums in new data centers. This resulting consumer data center boom also shows the effects of growing consumerization in the technology market place. Consumerization, which started with PCs and PC software, and then moved to smart phones, has impacted the underlying technologies dramatically. The most advanced compute chips are now those developed for smart phones and video games. Storage technology demand and advances are driven heavily by smart phones and products like the MacBook Air which already leverage only flash storage. The biggest and best data centers? No longer the domain of corporate demand, instead, consumer demand (e.g. Gmail, FaceBook, etc) drives bigger and more advanced centers. The proportion of data center space dedicated to direct consumer compute needs (a la GMail or Facebook) versus enterprise compute needs (even for companies that provide directly consumer services) will see a major shift from enterprise to consumer over the next decade. This will follow the shifts in chips and storage that at one time were driven by the enterprise space (and previously, the government) and are now driven by the consumer segment. And it is highly likely that there will be a surplus of enterprise class data centers (50K – 200K raised floor space) in the next 5 years. These centers are too small and inefficient for a consumer data center (500K – 2M or larger), and with declining demand and consolidation effects, plenty of enterprise data center space will be on the market.

As an IT leader, you should ensure your firm is riding the effects of the compute and storage efficiency trends. Further multiply these demand reduction effects by leveraging virtualization, engineered stacks and SaaS (where appropriate). If you have a healthy buffer of data center space now, you could avoid major investments and costs in data centers in the next 5 to 10 years by taking these measures. Those monies can instead be spent on functional investments that drive more direct business value or drop to the bottom line of your firm. If you have excess data centers, I recommend consolidating quickly and disposing of the space as soon as possible. These assets will be worth far less in the coming years with the likely oversupply. Perhaps you can partner with a cloud firm looking for data center space if your asset is strategic enough for them. Conversely, if you have minimal buffer and see continued higher business growth, it may be possible to acquire good data center assets for far less unit cost than in the past.

For 40 years, technology has ridden Moore’s Law to yield ever-more-powerful processors at lower cost. Its compounding effects have been astounding — and we are now seeing nearly 10 years of similar compounding on the power efficiency side of the equation (below is a chart for processor compute power advances and compute power efficiency advances).

Trend Change for Power Efficiency

The chart above shows how the compute efficiency (performance per watt — green line) has shifted dramatically from its historical trend (blue lines). And it’s improving about as fast as compute performance is improving (red lines), perhaps even faster.

These server and storage advances have resulted in fundamental changes in data centers and their demand trends for corporations. Top IT leaders will be take advantage of these trends and be able to direct more IT investment into business functionality and less into the supporting base utility costs of the data center, while still growing compute and storage capacities to meet business needs.

What trends are you seeing in your data center environment? Can you turn the corner on data center demand ? Are you able to meet your current and future business needs and growth within your current data center footprint and avoid adding data center capacity?

Best, Jim Ditmore

Using Organizational Best Practices to Handle Cloud and New Technologies

I have extended and updated this post which was first published in InformationWeek in March, 2013. I think it is a very salient and pragmatic organizational method for IT success. I look forward to your feedback! Best, Jim

IT organizations are challenged to keep up with the latest wave of cloud, mobile and big data technologies, which are outside the traditional areas of staff expertise. Some industry pundits recommend bringing on more technology “generalists,” since cloud services in particular can call on multiple areas of expertise (storage, server, networking). Or they recommend employing IT “service managers” to bundle up infrastructure components and provide service offerings.

But such organizational changes can reduce your team’s expertise and accountability and make it more difficult to deliver services. So how do you grow your organization’s expertise to handle new technologies? At the same time, how do you organize to deliver business demands for more product innovation and faster delivery yet still ensure efficiency, high quality and security?

Rather than acquire generalists and add another layer of cost and decision making to your infrastructure team, consider the following:

Cloud computing. Assign architects or lead engineers to focus on software-as-a-service and infrastructure-as-a-service, ensuring that you have robust estimating and costing models and solid implementation and operational templates. Establish a cloud roadmap that leverages SaaS and IaaS, ensuring that you don’t overreach and end up balkanizing your data center.

For appliances and private cloud, given their multiple component technologies, let your best component engineers learn adjacent fields. Build multi-disciplinary teams to design and implement these offerings. Above all, though, don’t water down the engineering capacity of your team by selecting generalists who lack depth in a component field. For decades, IT has built complex systems with multiple components by leveraging multi-faceted teams of experts, and cloud is no different.

Where to use ‘service managers’. A frequent flaw in organizations is to employ ‘service managers’ who group multiple infrastructure components (e.g. storage, servers, data centers, etc) into a ‘product’ (e.g. ‘hosting service’) and provide direction and interface for this product. This is an entirely artificial layer that then removes accountability from the component teams and often makes poor ‘product’ decision because of limited knowledge and depth. In the end IT does not deliver ‘hosting services’; IT delivers systems that meet business functions (e.g., for banking, teller or branch functions, ATMs; or for insurance, claims reporting or policy quote or issue). These business functions are the true IT services and are where you should apply a service manager role. Here, a service manager can ensure end-to-end integration and quality, drive better overall transaction performance and reliability, and provide deep expertise on system connections and SLAs and business needs back across the application and infrastructure component teams. And because it is directly attached to the business functions to be done, it will yield high value. These service managers will be invaluable for both new development and enhancement work as well as assisting during production issues.

Mobile. If mobile isn’t already the most critical interface for your company, it will be in three to five years. So don’t treat mobile as an afterthought, to be adapted from traditional interfaces. And don’t outsource this capability, as mobile will be pervasive in everything you build.

Build a mobile competency center that includes development, user experience and standards expertise. Then fan out that expertise to all of your development teams, while maintaining the core mobile group to assist with the most difficult efforts. And of course, continue with a central architecture and control of the overall user experience. A consistent mobile look, feel and flow is essentially your company’s brand, invaluable in interacting with customers.

Big data. There are two key aspects of this technology wave: the data (and traditional analytic uses) and real-time data “decisioning,” similar to IBM’s Watson. You can handle the data analytics as an extension of your traditional data warehousing (though on steroids). However, real-time decisioning has the potential to dramatically alter how your organization specifies and encodes business rules.

Consider the possibility that 30% to 50% of all business logic traditionally encoded in 3 or 4 generation programming languages instead becomes decisioned in real time. This capability will require new development and business analyst skills. For now, cultivate a central team with these skills. As you pilot and determine how to more broadly leverage real-time data decisioning, decide how to seed your broader development teams with these capabilities. In the longer run, I believe it will be critical to have these skills as an inherent portion of each development team.

Competing Demands. Overall, IT organizations must meet several competing demands: Work with business partners to deliver competitive advantage; do so quickly in order to respond to (and anticipate) market demands; and provide efficient, consistent quality while protecting the company’s intellectual property, data and customers. In essence, there are business and market drivers that value speed, business knowledge and closeness at a reasonable cost and risk drivers that value efficiency, quality, security and consistency.

Therefore, we must design an IT organization and systems approach that meets both sets of drivers and accommodates business organizational change. As opposed to organizing around one set of drivers or the other, the best solution is to organize IT as a hybrid organization to deliver both sets of capabilities.

Typically, the functions that should be consolidated and organized centrally to deliver scale, efficiency and quality are infrastructure (especially networks, data centers, servers and storage), IT operations, information security, service desks and anything else that should be run as a utility for the company. The functions to be aligned and organized along business lines to promote agility and innovation are application development (including Web and mature mobile development), data marts and business intelligence.

Some functions, such as database, middleware, testing and project management, can be organized in either mode. But if they aren’t centralized, they’ll require a council to ensure consistent processes, tools, measures and templates.

For services becoming a commodity, or where there’s a critical advantage to having one solution (e.g., one view of the customer for the entire company), it’s best to have a single team or utility that’s responsible (along with a corresponding single senior business sponsor). Where you’re looking to improve speed to market or market knowledge, organize into smaller IT teams closer to the business. The diagram below gives a graphical view of the hybrid organization.

The IT Hybrid Model diagram
With this approach, your IT shop will be able to deliver the best of both worlds. And you can then weave in the new skills and teams required to deliver the latest technologies such as cloud and mobile. You can read more about this hybrid model in our best practice reference page.

Which IT organizational approaches or variations have you seen work best? How are you accommodating new technologies and skills within your teams? Please weigh in with a comment below.

Best, Jim Ditmore

Massive Mobile Shifts and Keeping Score

As the first quarter of 2013 has come to a close, we see a technology industry moving at an accelerated pace. Consumerization is driving a faster level of change, with consequent impacts on technology ecosystem and the companies occupying different perches. From rapidly growing BYOD demand to the projected demise of the PC, we are seeing consumers shift their computing choices much faster than corporations, and some suppliers struggling to keep up. These rapid shifts require corporate IT groups to follow more quickly in their services. From implementing MDM (mobile device management), to increasing the bandwidth of wireless networks to adopting tablets and smartphones as the primary customer interfaces for future development, IT teams must adjust to ensure effective services and a competitive parity or advantage.

Let’s start with mobile. Consumers today use their smartphones to conduct much of their everyday business. And they use the devices if not for the entire transaction, then often to research or initiate the transaction. The lifeblood of most retail commerce has heavily shifted to the mobile channel. Thus, companies must have significant and effective mobile presence to achieve competitive advantage (or even survive). Mobile has become the first order of delivery for company services. Next in importance is the internet and then internal systems for call centers and staff. And since the vast majority of mobile devices (smartphone or tablet) are not Windows-based (nor is the internet), application development shops need to build or augment current Windows-oriented skills to enable native mobile development. Back end systems must be re-engineered to more easily support mobile apps.  And given your company’s competitive edge may be determined by its mobile apps, you need to be cautious about fully outsourcing this critical work.

Internally such devices are becoming a pervasive feature in the corporate landscape. It is important to be able to accommodate many of the choices of your company’s staff and yet still secure and manage the client device environment. Thus, implementations of MDM to manage these devices and enable corporate security on the portion of the device that contains company data are increasing at a rapid pace. Further, while relatively few companies currently have a corporate app store, this will become prevalent feature within a few years and companies will shift from a ‘push’ model of software deployment to a ‘pull’ model. Further consequences of the rapid adoption of mobile devices by staff include such items as needing to implement wireless at your company sites, adding visitor wireless capabilities (like a Starbucks wifi), or just increasing the capacity to handle the additional load (a 50% increase in internal wifi demand in January is not unheard as everyone returns to the office with their Christmas gifts).

A further consequence of the massive shift to smartphones and tablets is the diminishing reach and impact of Microsoft based on Gartner latest analysis and projections. The shift away from PCs and towards tablets in the consumer markets reduces the largest revenue sources of Microsoft. It is stunning to realize that Microsoft with its long consumer market history, could become ever more dependent on the enterprise versus consumer market. Yet, because the consumer’s choices are rapidly making inroads into the corporate device market, even this will be a safe harbor for only a limited time. With Windows 8, Microsoft tried to address both markets with one OS platform, perhaps not succeeding well in either. A potential outcome for Microsoft is to introduce the reported ‘Blue’ OS version which will be a complete touch interface (versus a hybrid touch and traditional). Yet, Microsoft has struggled to gain traction against Android and iOS tablets and smartphones, so it is hard to see how this will yield significant share improvement. And with new Chrome devices and a reputed cheap iPhone coming, perhaps even Gartner’s projections for Microsoft are optimistic. The last overwhelming consumer OS competitive success Microsoft had was against OS/2 and IBM — Apple iOS and Google Android are far different competitors! With the consumer space exceedingly difficult to make much headway, my top prediction for 2013 is that Microsoft will subsequently introduce a new Windows ‘classic’ to satisfy the millions of corporate desktops where touch interfaces are inadequate or application have not been redesigned. Otherwise, enterprises may sit pat on the current versions for an extended period, depriving Microsoft of critical revenue streams. Subsequent to the 1st version of this post, there were reports of Microsoft introducing Windows 8 stripped of the ‘Metro’ or touch interface! Corporate IT shops need to monitor these outcomes because once a shift occurs, there could be a rapid transition not just in the OS, but in the productivity suites and email as well.

There is also upheaval in the PC supplier base as a result of the worldwide sales decline of 13.9% (year over year in Q1). Also predicted here in January, HP struggled the most among the top 3 of HP, Lenovo and Dell. HP was down almost 24%, barely retaining the title of top volume manufacturer. Lenovo was flat, delivering the best performance in a declining market. Lenovo delivered 11.7 million units in the quarter, just below HP’s 12 million units. Dell suffered a 10.9% drop, which given the company is up for sale, is remarkable. Acer and other smaller firms saw major drops in sales as well (more than 31% for Acer). The ongoing decline of the market will see massive impact on the smaller market participants, with consolidation and fallout likely occurring late this year and early in 2014. The real question is whether HP can turn around their rapid decline. It will be a difficult task because the smartphone, tablet and Chrome book onslaught is occurring when HP is facing a rejuvenated Lenovo and a very aggressive Dell. Ultrabooks will provide some margin and volume improvement, but not enough to make up for the declines. Current course suggests that early 2014 will see a declining market where Lenovo is comfortably leading followed by a lagging HP fighting tooth and nail with Dell for 2nd place. HP must pull off a major product refresh, supply chain tightening, and aggressive sales to turn it around. It will be a tall order.

Perhaps the next consumerization influence will be the greater use of desktop video. Many of our employees have experienced the pretty good video of Skype or Facetime and potentially will be expecting similar experiences in the corporate conversations. Current internal networks often do not have the bandwidth for such casual and common video interactions, especially for smaller campuses or remote offices. It will be important for IT shops to manage the introduction of the capabilities so that more critical workload is not impacted.

How is your company’s progress on mobile? do you have an app store? Have you implemented desktop video? I look forward to hearing from you.

Best, Jim Ditmore

Cloud Trends: Turning the Tide on Data Centers

A recent study by Intel shows that the compute load that required 184 single-core processors in 2005 now can be handled with just 21 processors where every nine servers are replaced by one.

Moores LawFor 40 years, technology rode Moore’s Law to yield ever-more-powerful processors at lower cost. Its compounding effect was astounding: One of the best analogies is that we now have more processing power in a smart phone than the Apollo astronauts had when they landed on the moon. At the same time though, the electrical power requirements for those processors continued to increase at a similar rate as the increase in transistor count. While new technologies (CMOS, for example) provided a one-time step-down in power requirements, each turn-up in processor frequency and density resulted in similar power increases.

As a result, by the 2000-2005 timeframe there were industry concerns regarding the amount of power and cooling required for each rack in the data center. And with the enormous increase in servers spurred by Internet commerce, most IT shops have labored for the past decade to supply adequate data center power and cooling.

Meantime, most IT shops have experienced compute and storage growth rates of 20% to 50% a year, requiring either additional data centers or major increases in power and cooling capacity at existing centers. Since 2008, there has been some alleviation due to both slower business growth and the benefits of virtualization, which has let companies reduce their number of servers by as much as 10 to 1 for 30% to 70% of their footprint. But IT shops can deploy virtualization only once, suggesting that they’ll be staring at a data center build or major upgrade in the next few years.

But an interesting thing has happened to server power efficiency. Before 2006, such efficiency improvements were nominal, represented by the solid blue line below. Even if your data center kept the number of servers steady but just migrated to the latest model, it would need significant increases in power and cooling. You’d experience greater compute performance, of course, but your power and cooling would increase in a corresponding fashion. Since 2006, however, compute efficiency (green line) has improved dramatically, even outpacing the improvement in processor performance (red lines).

Trend Change for Power Efficiency

The chart above shows how the compute efficiency (performance per watt — green line) has shifted dramatically from its historical trend (blue lines). And it’s improving about as fast as compute performance is improving (red lines), perhaps even faster. The chart above is for the HP DL 380 server line over the past decade, but most servers are showing a similar shift.

This stunning shift is likely to continue for several reasons. Power and cooling costs continue to be a significant proportion of overall server operating costs. Most companies now assess power efficiency when evaluating which server to buy. Server manufacturers can differentiate themselves by improving power efficiency. Furthermore, there’s a proliferation of appliances or “engineered stacks” that eke significantly better performance from conventional technology within a given power footprint.

A key underlying reason for future increases in compute efficiency is the fact that chipset technologies are increasingly driven by the requirements for consumer mobile devices. One of the most important requirements of the consumer market is improved battery life, which also places a premium on energy-efficient processors. Chip (and power efficiency) advances and designs in the consumer market will flow back into the corporate (server) market. An excellent example is HP’s Moonshot program which leverages ARM chips (previously typically used in consumer devices only) for a purported 80%+ reduction in power consumption. Expect this power efficiency trend to continue for the next five and possibly the next 10 years.

So how does this propitious trend impact the typical IT shop? For one thing, it reduces the need to build another data center. If you have some buffer room now in your data center and you can move most of your server estate to a private cloud (virtualized, heavily standardized, automated), then you will deliver more compute poer yet also see a leveling and then a reduction in the number of servers(blue line) and a similar trend in the power consumed (green line).

Traditional versus Optimized Server Trends

This analysis assumes 5% to 10% business growth, (translating to a need for a 15% to 20% increase in server performance/capacity). You’ll have to employ best practices in capacity and performance management to get the most from your server and storage pools, but the long-term payoff is big. If you don’t leverage these technologies and approaches, your future is the red and purple lines on the chart: ever-rising compute and data center costs over the coming years.

By applying these approaches, you can do more than stem the compute cost tide; you can turn it. Have you started this journey? Have you been able to reduce the total number of servers in your environment? Are you able to meet your current and future business needs and growth within your current data center footprint?

What changes or additions to this approach would you make? I look forward to your thoughts and perspective.

Best, Jim Ditmore

Note this post was first published on January 23rd in InformationWeek. It has been updated since then. 

First Quarter Technology Trends to Note

For those looking for the early signs of spring, crocuses and flowering quince are excellent harbingers. For those looking for signs of technology trends and shifts, I thought it would be worthwhile to point out some new ones and provide further emphasis or confirmation of a few recent ones:

1. Enterprise server needs have flattened and the cumulative effect of cloud, virtualization, SaaS, and appliances will mean the corporate server market has fully matured. The 1Q13 numbers bear this trend is continuing (as mentioned here last month). Some big vendors are even seeing revenue declines in this space. Unit declines are possible in the near future. The result will be consolidation in the enterprise server and software industry. VMWare, BMC and CA have already seen their share prices fall as investors are concerned the growth years are behind them. Make sure your contracts consider potential acquisitions or change of control.

2. Can dual SIM smartphones be just around the corner? Actually, they are now officially here. Samsung just launched a Galaxy dual SIM in China, so perhaps it will not be long for other device makers to follow suit. Dual SIM enables excellent flexibility – just what the carriers do not want. Consider when you travel overseas, you will be able to insert a local SIM into your phone and handle all local or regional calls at low rates, and will still receive your ‘home’ number calls. Or for everyone who carries two devices, one for business and one for personal, now you still can keep your business and personal numbers separate, but only have one device.

3. Further evidence has appeared of the massive compromises enterprise are experiencing due to Advanced Persistent Threats (APTs). Most recently, Mandiant published a report that ties the Chinese government and the PLA to a broad set of compromises of US corporations and entities over many years.  If you have not begun to move you enterprise security from a traditional perimeter model to a post-perimeter design, make the investment. You can likely bet you are compromised, you need to not only lock the doors but thwart those who have breached your perimeter. A post here late last year covers many of the measures you need to take as an IT leader.

4. Big data and decision sciences could drive major change in both software development and business analytics. It may not be of the level of change that computers had say on payroll departments and finance accountants in the 1980s, but it could be wide-ranging. Consider that perhaps 1/3 to 1/2 of all business logic now encoded in systems (by analysts and software developers) could instead be handled by data models and analytics to make business rules and decisions in real time. Perhaps all of the business analysts and software developers will then move to developing the models and  proving them out, or we could see a fundamental shift in the skills demanded in the workplace. We still have accountants of course, they just no longer do the large amount of administrative tasks. Now, perhaps applying this to legal work….

5. The explosion in mobile continues apace. Wireless data traffic is growing at 60 to 70% per year and projected to continue at this pace. The use of the mobile phone as your primary commerce device is likely to become real for most purchases in the next 5 years. So businesses are under enormous pressure to adapt and innovate in this space.  Apps that can gracefully handle poor data connections (not everywhere is 4G) and hurried consumers will be critical for businesses. Unfortunately, there are not enough of these.

Any additions you would make? Please send me a note.

Best, Jim Ditmore