The Pandemic and Information Security

With the lockdown due to COVID 19, nearly every organization has moved rapidly to remote access for employees, and a much higher degree of internet sales and service for customers. For a number of teams that I have talked to, their transition was relatively smooth and they are typically finding they have been able to maintain productivity for most of their staff. This is good news, as it gives employees more flexibility (as well as perhaps commute time back) and enables organizations to continue to perform despite the lockdowns and quarantines. At the same time though, the increase in digital work and interactions, increases vulnerabilities and potential for cybercrime.

These increased vulnerabilities are likely to linger as organizations are moving cautiously back to previous ways of working and slowly returning to offices. In fact, remote working will likely increase substantially from pervious norms, especially as corporations now consider whether they really need all that expensive downtown office space. Maybe they can reduce their space by 20%, 30% or even 50% as employees come into the office far less frequently, and when there, use office hoteling schemes.

Customer buying patterns have likely changed for good as well. The demise of thousands or even tens of thousands of retail stores during the crisis will be difficult to recover – if ever. And new habits of buying things online or buying less are likely to stick. So, your digital interfaces are more important than ever. And they are more lucrative than ever for hackers.

So what to do?

I checked in with some of the top Infosec leaders, including Tom Bartolomeo, Peter Makohon and a few others and compiled key actions to take. These actions, along with your existing defenses, should help keep your enterprise safe during these changing times.

First and foremost is education — for customers and employees. When introducing or reinforcing how to use the remote office tools, remind your staff of how to avoid common security issues. For your customer, send out regular reminders of how to keep themselves safe — especially shopping online or doing financial transactions online. Ensure that mid-transaction (e.g. on a payments screen), they are informed of the potential for fraud, and the characteristics of typical fraudulent scenarios for that transaction so that they consider if they too have been tricked before completing the transaction. On corporate email, make sure your system automatically and clearly delineates email that does not source from inside the company — this can prevent much typical invoice or CEO fraud scenarios. Remind them to be extra careful going to new sites as there are instances of ‘watering hole’ setups for common sites such as the CDC site. New fraud schemes abound online, and not just for fake masks to be purchased. These reminders and educational tips, when done creatively and clearly, can ensure your staff and customers do not easily fall for common fraud schemes.

Second, given the likely dramatic changes in traffic and end devices, actively review and monitor these to ensure you do not have endpoints that have been compromised or have low protections. Wherever possible, improve the security measures on the endpoints particularly ensuring employee home or remote work devices have antivirus, ensuring secure email, conferencing, and communications methods, and leveraging VPN and encryption as much as possible.

Third, double check the security of your internet-facing systems, including customer conferencing systems and applications that suddenly have an increase in traffic because they are not the preferred way to do business. Direct your red teams or an outside firm to re-test your defenses and identify vulnerabilities. Look for anomalies in login sources. Review the dark web for fraud schemes targeting your company or industry. Generally, be on high alert for potential new ways of defrauding your company or breaching its defenses for data.

Given you are likely doing a much higher proportion of business digitally, your exposure and potential revenue loss due to a DDOS or similar attacks are much higher. Review and test your current capabilities to detect and thwart such attacks. Ensure your VPN provider also has DDOS protection. Strengthen as necessary. There has been an upsurge of attacks on VPN services like Pulse, thus your team must keep these services up-to-date with patches and minimizing configuration drift.

Review the infrastructure and application changes that were made to enable the organization to operate during COVID as many were made in haste. It is possible that some changes could have inadvertently opened a door or window that would allow hackers to take advantage. Inventory your new or updated attack surfaces and ensure you have adequate protections. If you have substantial gaps, consider leveraging advanced site hardening technology like Codesealer to add a layer of protection while you correct or update your underlying components.

Finally, talk to your key suppliers, ensure they are stepping up to the plate and also have the means to do so. Many past breaches have occurred via a supplier’s network, thus their security stance is important to the true security in your ecosystem. Ransomware attacks against smaller firms are also on the rise, but proactive measures that you help them implement can make the difference and keep them safe.

Unfortunately, with the economic distress that has accompanied the coronavirus shutdowns, cybercrime is a lucrative option for many. With your staff and customers on the internet more than ever, it is important to assist them to stay safe, and assure the security of your corporate and customer assets.

What have you experienced in the COVID 19 era? What would you recommend for better security and safety?

Look forward to your recommendations or questions!

Best, Jim Ditmore

Hi ho! Hi ho! It’s Off to Cloud we go!

With the ongoing stampede to public cloud platforms, it is worth a clearer look at some of the factors leading to such rapid growth. Amazon, Azure, Google, and IBM and a host of other public cloud services saw continued strong growth in 2018 of 21% to $175B, extending a long run of rapid revenue growth for the industry, according to Gartner in a recent Forbes article. Public cloud services, under Gartner’s definition, include a broad range of services from more traditional SaaS to infrastructure services (IaaS and PaaS) as well as business process services. IaaS, perhaps most closely associated with AWS, is forecast to grow 26% in 2019, with total revenues increasing from $31B in 2018 to $39.5B in 2019. AWS does have the lion’s share of this market with 80% of enterprises either experimenting with or using AWS as their preferred platform. Microsoft’s Azure continues to make inroads as well with increase of enterprises using the Azure platform from 43% to 58%. And Google is proclaiming a recent upsurge in its cloud services in its quarterly earning announcement. It is worth noting though that both more traditional SaaS and private cloud implementations are expect to also grow at near 30% rates for the next decade – essentially matching or even exceeding public cloud infrastructure growth rates over the same time period. The industry with the highest adoption rates of both private and public cloud is the financial services industry where adoption (usage) rates above 50% are common and even rates close to 100% are occurring versus median rates for all industries of 19%.

At Danske Bank, we are close to completing a 4 year infrastructure transformation program that has migrated our entire application portfolio from proprietary dedicated server farms in 5 obsolete data centers to a modern private cloud environment in 2 data centers. Of course, we migrated and updated our mainframe complex as well. Over that time, we have also acquired business software that is SaaS-provided as well as experimented with or leveraged smaller public cloud environments. With this migration led by our CTO Jan Steen Olsen, we have eliminated nearly all of our infrastructure layer technical debt, reduced production incidents dramatically (by more than a 95% ), and correspondingly improved resiliency, security, access management, and performance. Below is a chart that shows the improved customer impact availability achieved through the migration, insourcing, and adoption of best practice.

These are truly remarkable results that enable Danske Bank to deliver superior service to our customers. Such reliability for online and mobile systems is critical in the digital age. Our IT infrastructure and applications teams worked closely together to accomplish the migration to our new, ‘pristine’ infrastructure. The data center design and migration was driven by our senior engineers with strong input from top industry experts, particularly CS Technology. A critical principle we followed was not to just move old servers to the new centers but instead to set up a modern and secure ‘enclave’ private cloud and migrate old to new. Of course this is a great deal more work and requires extensive update and testing to the applications. Working closely together, our architects and infrastructure engineers partnered to design our private that established templates and services up to our middleware, API, and database layers. There were plenty of bumps in the road especially in the our earliest migrations as worked out the cloud designs, but our CIO Fredrik Lindstrom and application teams dug in, and in partnering with the infrastructure team, made room for the updates and testing, and successfully converted our legacy distributed systems to the new private cloud environments. While certainly a lengthy and complex process, we were ultimately successful. We are now reaping the benefits of a fully modernized cloud environment with rapid server implementation times and lower long term costs (you can see further guidelines here on how to build a private cloud). In fact, we have benchmarked our private cloud environment and it is 20 to 70% less expensive than comparable commercial offerings (including AWS and Azure). A remarkable achievement indeed and for the feather in the cap, the program led by Magnus Jacobsen was executed on a relatively flat budget as we used savings generated from insourcing and consolidations to fund much of the needed investments.

Throughout the design and the migration, we have stayed abreast of the cloud investments and results at many peer institutions and elsewhere. We have always looked at our cloud transformation as an infrastructure quality solution that could provide secondary performance and cycle time and cost savings. But our core objective was focused on achieving the availability and resiliency benefits and eliminating the massive risk due to legacy data center environmentals. Yet, much of the dialogue in the industry is focused on cloud as a time to market and cost solution for companies with complex legacy environments, enabling them to somehow significantly reduce systems costs and greatly improve development time to market.

Let’s consider how realistic is this rationale. First, how real is the promise of reduced development time to market due to public cloud? Perhaps, if you are comparing an AWS implementation to a traditional proprietary server shop with mediocre service and lengthy deliver times for even rudimentary servers, then, yes, you enable development teams to dial up their server capacity much more easily and quickly. But compared to a modern, private cloud implementation, the time to implement a new server for an application is (or should be) comparable. So, on an apples to apples basis, generally public and private cloud are comparably quick. More importantly though, for a business application or service that is being developed, the server implementation tasks should be done as parallel tasks to the primary development work with little to no impact on the overall development schedule or time to market. In fact, the largest tasks that take up time in application development are often the project initiation, approval, and definition phases (for traditional waterfall) and project initiation, approval, and initial sprint phases for Agile projects. In other words, management decisions and defining what the business wants the solution to do remain the biggest and longest tasks. If you are looking to improve your time to market, these are the areas where IT leadership should focus. Improving your time to get from ‘idea to project’ is typically a good investment in large organizations. Medium and large corporations are often constrained as much by the annual finance process and investment approval steps as any other factor. We are all familiar with investment processes that require several different organizations to agree and many hurdles to be cleared before the idea can be approved. And the larger the organization, the more likely the investment process is the largest impact to time to market.

Even after you have approval for the idea and the project is funded, the next lengthy step is often ensuring that adequate business, design, and technology resources are allocated and there is enough priority to get the project off the ground. Most large IT organizations are overwhelmed with too many projects, too much work and not enough time or resources. Proper prioritization and ensuring that not too many projects are in flights at any one time are crucial to enable projects to work at reasonable speed. Once the funding and resources are in place, then adopting proper agile approaches (e.g. joint technology and business development agile methods) can greatly improve the time to market.

Thus, must time to market issues have little to do with infrastructure and cloud options and almost everything to do with management and leadership challenges. And the larger the organization, the harder it is to focus and streamline. Perhaps the most important part of your investment process is on what not to do, so that you can focus your efforts on the most important development projects. To attain the prized time to market so important in today’s digital competition, drive instead for a smooth investment process coupled with a properly allocated and flexible development teams and agile processes. Streamlining these processes, and ensuring effective project startups (project manager assigned, dedicated resources, etc) will yield material time to market improvements. And having a modern cloud environment will then nicely support your streamlined initiatives.

On the cost promise of public cloud, I find it surprising that many organizations are looking to public cloud as silver bullet for improving their costs. For either legacy or modern applications, the largest costs are the software development and software maintenance cost – ranging anywhere from 50% to 70% percent of the full lifetime cost of a system. Next will come IT operations – running the systems and the production environment – as well as IT security and networks at around 15-20% of total cost. This leaves 15 to 30% of lifetime cost for infrastructure, including databases, middleware, and messaging as well as the servers and data centers. Thus, the servers and storage total perhaps 10-15% of the lifetime cost. Perhaps you can achieve a a 10%, or even 20% or 30% reduction in this cost area, for a total systems cost reduction of 2-5%. And if you have a modern environment, public cloud would actually be at a cost disadvantage (at Danske Bank, our new private cloud costs are 20% to 70% lower than AWS, Azure, and other public clouds). Further, focusing on a 2% or 5% server cost reduction will not transform your overall cost picture in IT. Major efficiency gains in IT will come from far better performance in  your software development and maintenance — improving productivity, having a better and more skilled workforce with fewer contractors, or leveraging APIs and other techniques to reduce technical and improve software flexibility.  It is disingenuous to suggest you are tackling primary systems costs and making a difference for your firm with public cloud. . You can deliver 10x total systems cost improvements by introducing and rolling out software development best practices, achieving an improved workforce mix and simplifying your systems landscape than simply substituting public cloud for your current environment. And as I noted earlier, we have actually achieved lower costs with a private cloud solution versus commercial public cloud offerings. And there are hidden factors to consider with public cloud. For example, when testing a new app on your private cloud, you can run the scripts in off hours to your heart’s content at minimal to no cost, but you would need to watch your usage carefully if on a public cloud, as all usage results in costs. The more variable your workload is also means it could cost less on a public cloud — the reverse being the more stable the total workload is, the more likely you can achieve significant savings with private cloud.

On a final note, with public cloud solutions come lock-in, not unlike previous generations of proprietary hardware or wholesale outsourcing. I am certain a few of you recall the extensions done to proprietary Unix flavors like AIX and HP-UX that provided modest gains but then increased lock-in of an application to that vendor’s platform. Of course, the cost increases from these vendors came later as did migration hurdles to new and better solutions. The same feature extension game occurs today in the public cloud setting with Azure or AWS or others. Once you write your applications to take advantage of their proprietary features, you have now become an annuity stream for that vendor, and any future migration off of their cloud with be arduous and expensive. Your ability to move to another vendor will typically be eroded and compromised with each system upgrade you implement. Future license and support price increases will need to be accepted unless you are willing to take on a costly migration. And you have now committed your firm’s IT systems and data to be handled elsewhere with less control — potentially a long term problem in the digital age. Note your application and upgrade schedules are now determined by the cloud vendor, not by you. If you have legacy applications (as we all do) that rely on an older version of infrastructure software or middleware, thus must be upgraded and keep pace, otherwise they don’t work. And don’t count on a rollback if problems are found after the upgrades by the cloud vendor.

Perhaps more concerning, in this age of ever bigger hacks, is that the public cloud environments become the biggest targets for hackers, from criminal gangs to state sponsored. And while, they have much larger security resources, there is still a rich target surface for hackers. The recent Capital One breach is a reminder that proper security remains a major task for the cloud customer.

In my judgement, certainly larger corporations are better off maintaining control of their digital capabilities with a private cloud environment than a public cloud. This will likely be supplemented with a multi-cloud environment to enable key SaaS capabilities or leverage public cloud scalability and variable expense for non-core applications. And with the improving economies of server technology and improved cloud automation tools, these environments can also be effectively implemented by medium-sized corporations as well. Having the best digital capabilities — and controlling them for your firm —  is key to outcompeting in most industries today. If you have the scale to retain control of your digital environment and data assets, then this is the best course to enabling future digital success.

What is your public or private cloud experience? Has your organization mastered private, public, or multi-cloud? Please share your thoughts and comments.

Best, Jim

P.S. Worth noting that public clouds are not immune to availability issues as well as reported here.

The Elusive High Availability in the Digital Age

Well, the summer is over, even if we have had great weather into September. My apologies for the delay in a new post, and I know I have several topic requests to fulfill 🙂 Given our own journey at Danske Bank on availability, I thought it was best to re-touch this topic and then come back around to other requests in my next posts. Enjoy and look forward to your comments!

It has been a tough few months for some US airlines with their IT systems availability. Hopefully, you were not caught up in the major delays and frustrations. Both Southwest and Delta suffered major outages in August and September. Add in power outages affecting equipment and multiple airlines recently in Newark, and you have many customers fuming over delays and cancelled flights. And the cost to the airlines was huge — Delta’s outage alone is estimated at $100M to $150M and that doesn’t include the reputation impact. And such outages are not limited to the US airlines, with British Airways also suffering a major outage in September. Delta and Southwest are not unique in their problems, both United and American suffered major failures and widespread impacts in 2015. Even with large IT budgets, and hundreds of millions invested in upgrades over the past few years, airlines are struggling to maintain service in the digital age. The reasons are straightforward:

  • At their core, services are based on antiquated systems that have been partially refitted and upgraded over decades (the core reservation system is from the 1960s)
  • Airlines have struggled earlier this decade to make a profit due to oil prices, and minimally invested in the IT systems to attack the technical debt. This was further complicated by multiple integrations that had to be executed due to mergers.
  • As they have digitalized their customer interfaces and flight checkout procedures, the previous manual procedures are now backup steps that are infrequently exercised and woefully undermanned when IT systems do fail, resulting in massive service outages.

With digitalization reaching even further into the customer interfaces and operations, airlines, like many other industries, must invest in stabilizing their systems, address their technical debt, and get serious about availability. Some should start with the best practices in the previous post on Improving Availability, Where to Start. Others, like many IT shops, have decent availability but still have much to do to get to first quartile availability. If you have made good progress but realize that three 9’s or preferably four 9’s of availability on your key channels is critical for you to win in the digital age this post covers what you should do.

Let’s start with the foundation. If you can deliver consistently good availability, then your team should already understand:

  • Availability is about quality. Poor availability is a quality issue. You must have a quality culture that emphasizes quality as a desired outcome and doing things right if you wish to achieve high availability.
  • Most defects — which then cause outages — are injected by change. Thus, strong change management processes that identify and eliminate defects are critical to further reduce outages.
  • Monitor and manage to minimize impact. A capable command center with proper monitoring feeds and strong incident management practices may not prevent the defect from occurring but it can greatly reduce the time to restore and the overall customer impact. This directly translates into higher availability.
  • You must learn and improve from the issues. Your incident management process must be coupled with a disciplined root cause analysis that ensures teams identify and correct underlying causes that will avoid future issues. This continuous learning and improvement is key to reaching high performance.

With this base understanding, and presumably with only smoldering areas of problems for IT shop left, there are excellent extensions that will enable your team to move to first quartile availability with moderate but persistent effort. For many enterprises, this is now a highly desirable business goal. Reliable systems translate to reliable customer interfaces as customers access the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis. Your production performance becomes very evident, very fast to your customers. And if you are down, they cannot transact, you cannot service them, your company loses real revenue, and more importantly, damages it’s reputation, often badly. It is far better to address these problems and gain a key edge in the market by consistently meeting or exceeding costumer availability expectations.

First, if you have moved up from regularly fighting fires, then just because outages are not everyday, does not mean that IT leadership no longer needs to emphasize quality. Delivering high quality must be core to your culture and your engineering values. As IT leaders, you must continue to reiterate the importance of quality and demonstrate your commitment to these values by your actions. When there is enormous time pressure to deliver a release, but it is not ready, you delay it until the quality is appropriate. Or you release a lower quality pilot version, with properly set customer and business expectations, that is followed in a timely manner by a quality release. You ensure adequate investment in foundational quality by funding system upgrades and lifecycle efforts so technical debt does not increase. You reward teams for high quality engineering, and not for fire-fighting. You advocate inspections, or agile methods, that enable defects to be removed earlier in the lifecycle at lower cost. You invest in automated testing and verification that enables work to be assured of higher quality at much lower cost. You address redundancy and ensure resiliency in core infrastructure and systems. Single power cord servers still in your data center? Really?? Take care of these long-neglected issues. And if you are not sure, go look for these typical failure points (another being SPOF network connections). We used to call these ‘easter eggs’, as in the easter eggs that no one found in a preceding year’s easter egg hunt and then you find the old, and quite rotten, easter egg on your watch. It’s no fun, but it is far better to find them before they cause an outage.

Remember that quality is not achieved by not making mistakes — a zero defect goal is not the target — instead, quality is achieved by a continuous improvement approach where defects are analyzed and causes eliminated, where your team learns and applies best practices. Your target goal should be 1st quartile quality for your industry, that will provide competitive advantage.  When you update the goals, also revisit and ensure you have aligned the rewards of your organization to match these quality goals.

Second, you should build on your robust change management process. To get to median capability, you should have already established clear change review teams, proper change windows and moved to deliveries through releases. Now, use the data to identify which groups are late in their preparation for changes, or where change defects are clustered around and why. These understandings can improve and streamline the change processes (yes, some of the late changes could be due to too many approvals required for example). Further clusters of issues may be due to specific steps being poorly performed or inadequate tools. For example, often verification is done as cursory task and thus seldom catches critical change defects. The result is that the defect is then only discovered in production, hours later, when your entire customer base is trying but cannot use the system. Of course, it is likely such an outage was entirely avoidable with adequate verification because you would have known at the time of the change that it had failed and could have take action then to back out the change. The failed change data is your gold mine of information to understand which groups need to improve and where they should improve. Importantly, be transparent with the data, publish the results by team and by root cause clusters. Transparency improves accountability. As an IT leader, you must then make the necessary investments and align efforts to correct the identified deficiencies and avoid future outages.

Further, you can extend the change process by introducing production ready.  Production ready is when a system or major update can be introduced into production because it is ready on all the key performance aspects: security, recoverability, reliability, maintainability, usability, and operability. In our typical rush to deliver key features or products, the sustainability of the system is often neglected or omitted. By establishing the Operations team as the final approval gate for a major change to go into production, and leveraging the production ready criteria, organizations can ensure that these often neglected areas are attended to and properly delivered as part of the normal development process. These steps then enable a much higher performing system in production and avoid customer impacts. For a detailed definition of the production ready process, please see the reference page.

Third, ensure you have consolidated your monitoring and all significant customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). This modern ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues these capabilities must be consolidated and integrated. Once you have an integrated ECC, you can extend it by moving from component monitoring to full channel monitoring. Full channel monitoring is where the entire stack for a critical customer channel (e.g. online banking for financial services or customer shopping for a retailer) has been instrumented so that a comprehensive view can be continuously monitored within the ECC. The instrumentation is such that not only are all the infrastructure components fully monitored but the databases, middleware, and software components are instrumented as well. Further, proxy transactions can and are run at a periodic basis to understand performance and if there are any issues. This level of instrumentation requires considerable investment — and thus is normally done only for the most critical channels. It also requires sophisticated toolsets such as AppDynamics. But full channel monitoring enables immediate detection of issues or service failures, and most importantly, enables very rapid correlation of where the fault lies. This rapid correlation can take incident impact from hours to minutes or even seconds. Automated recovery routines can be built to accelerate recovery from given scenarios and reduce impact to seconds. If your company’s revenue or service is highly dependent on such a channel, I would highly recommend the investment. A single severe outage that is avoided or greatly reduced can often pay for the entire instrumentation cost.

Fourth, you cannot be complacent about learning and improving. Whether from failed changes, incident pattern analysis, or industry trends and practices, you and your team should always be seeking to identify improvements. High performance or here, high quality, is never reached in one step, but instead in a series of many steps and adjustments. And given our IT systems themselves are dynamic and changing over time, we must be alert to new trends, new issues, and adjust.

Often, where we execute strong root cause and followup, we end up focused only at the individual issue or incident level. This can be all well and good for correcting the one issue, but if we miss broader patterns we can substantially undershoot optimal performance. As IT leaders, we must always consider the trees and the forest. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Do they cluster with one application or infrastructure component? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Do you have some teams that create far more defects than the norm? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Discuss these issues and analysis with your management team and engineering leads. Tackle fixing them as a team, with your quality goals prioritizing the efforts. By correcting things both individually and systemically you can achieve far greater progress. Again, the transparency of the discussions will increase accountability and open up your teams so everyone can focus on the real goals as opposed to hiding problems.

These four extensions to your initial efforts will set your team on a course to achieve top quartile availability. Of course, you must couple these efforts with diligent engagement by senior management, adequate investment, and disciplined execution. Unfortunately, even with all the right measures, providing robust availability for your customers is rarely a straight-line improvement. It is a complex endeavor that requires persistence and adjustment along the way. But by implementing these steps, you can enable sustainable and substantial progress and achieve top quartile performance to provide business advantage in today’s 7×24 digital world.

If your shop is struggling with high availability or major outages, look to apply these practices (or send your CIO the link to this page 🙂 ).

Best, Jim Ditmore

How Did Technology End Up on the Sunday Morning Talk Shows?

It has been two months since the Healthcare.gov launch and by now nearly every American has heard or witnessed the poor performance of the websites. Early on, only one of every five users was able to actually sign in to Healthcare.gov, while poor performance and unavailable systems continue to plague the federal and some state exchanges. Performance was still problematic several weeks into the launch and even as of Friday, November 30, the site was down for 11 hours for maintenance. As of today, December 1, the promised ‘relaunch day’, it appears the site is ‘markedly improved’ but there are plenty more issues to fix.

What a sad state of affairs for IT. So, what does the Healthcare website issues teach us about large project management and execution? Or further, about quality engineering and defect removal?

Soon after the launch, former federal CTO Aneesh Chopra, in an Aspen Institute interview with The New York Times‘ Thomas Friedman, shrugged off the website problems, saying that “glitches happen.” Chopra compared the Healthcare.gov downtime to the frequent appearances of Twitter’s “fail whale” as heavy traffic overwhelmed that site during the 2010 soccer World Cup.

But given that the size of the signup audience was well known and that website technology is mature and well understood, how could the government create such an IT mess? Especially given how much lead time the government had (more than three years) and how much it spent on building the site (estimated between $300 million and $500 million).

Perhaps this is not quite so unusual. Industry research suggests that large IT projects are at far greater risk of failure than smaller efforts. A 2012 McKinsey study revealed that 17% of lT projects budgeted at $15 million or higher go so badly as to threaten the company’s existence, and more than 40% of them fail. As bad as the U.S. healthcare website debut is, there are dozens of examples, both government-run and private of similar debacles.

In a landmark 1995 study, the Standish Group established that only about 17% of IT projects could be considered “fully successful,” another 52% were “challenged” (they didn’t meet budget, quality or time goals) and 30% were “impaired or failed.” In a recent update of that study conducted for ComputerWorld, Standish examined 3,555 IT projects between 2003 and 2012 that had labor costs of at least $10 million and found that only 6.4% of them were successful.

Combining the inherent problems associated with very large IT projects with outdated government practices greatly increases the risk factors. Enterprises of all types can track large IT project failures to several key reasons:

  • Poor or ambiguous sponsorship
  • Confusing or changing requirements
  • Inadequate skills or resources
  • Poor design or inappropriate use of new technology

Unfortunately, strong sponsorship and solid requirements are difficult to come by in a political environment (read: Obamacare), where too many individual and group stakeholders have reason to argue with one another and change the project. Applying the political process of lengthy debates, consensus-building and multiple agendas to defining project requirements is a recipe for disaster.

Furthermore, based on my experience, I suspect the contractors doing the government work encouraged changes, as they saw an opportunity to grow the scope of the project with much higher-margin work (change orders are always much more profitable than the original bid). Inadequate sponsorship and weak requirements were undoubtedly combined with a waterfall development methodology and overall big bang approach usually specified by government procurement methods. In fact, early testimony by the contractors ‘cited a lack of testing on the full system and last-minute changes by the federal agency’.

Why didn’t the project use an iterative delivery approach to hone requirements and interfaces early? Why not start with healthcare site pilots and betas months or even years before the October 1 launch date? The project was underway for three years, yet nothing was made available until October 1. And why did the effort leverage only an already occupied pool of virtualized servers that had little spare capacity for a major new site? For less than 10% of the project costs a massive dedicated farm could have been built.  Further, there was no backup site, nor any monitoring tools implemented. And where was the horizontal scaling design within the application to enable easy addition of capacity for unexpected demand? It is disappointing to see such basic misses in non-functional requirements and design in a major program for a system that is not that difficult or unique.

These basic deliverables and approaches appear to have been fully missed in the implementation of the wesite. Further, the website code appears to have been quite sloppy, not even using common caching techniques to improve performance. Thus, in addition to suffering from weak sponsorship and ambiguous requirements, this program failed to leverage well-known best practices for the technology and design.

One would have thought that given the scale and expenditure on the program, top technical resources would have been allocated and ensured these practices were used. The feds are  scrambling with a “surge” of tech resources  for the site. And while the new resources and leadership have made improvements so far, the surge will bring its own problems. It is very difficult to effectively add resources to an already large program. And, new ideas introduced by the ‘surge’ resources, may not be either accepted or easily integrated. And if the issues are deeply embedded in the system, it will be difficult for the new team to fully fix the defects. For every 100 defects identified in the first few weeks, my experience with quality suggests there are 2 or 3 times more defects buried in the system. Furthermore, if one wonders if the project couldn’t handle the “easy” technical work — sound website design and horizontal scalability – how will they can handle the more difficult challenges of data quality and security?

These issues will become more apparent in the coming months when the complex integration with backend systems from other agencies and insurance companies becomes stressed. And already the fraudsters are jumping into the fray.

So, what should be done and what are the takeaways for an IT leader? Clear sponsorship and proper governance are table stakes for any big IT project, but in this case more radical changes are in order. Why have all 36 states and the federal government roll out their healthcare exchanges in one waterfall or big bang approach? The sites that are working reasonably well (such as the District of Columbia’s) developed them independently. Divide the work up where possible, and move to an iterative or spiral methodology. Deliver early and often.

Perhaps even use competitive tension by having two contractors compete against each other for each such cycle. Pick the one that worked the best and then start over on the next cycle. But make them sprints, not marathons. Three- or six-month cycles should do it. The team that meets the requirements, on time, will have an opportunity to bid on the next cycle. Any contractor that doesn’t clear the bar gets barred from the next round. Now there’s no payoff for a contractor encouraging endless changes. And you have broken up the work into more doable components that can then be improved in the next implementation.

Finally, use only proven technologies. And why not ask the CIOs or chief technology architects of a few large-scale Web companies to spend a few days reviewing the program and designs at appropriate points. It’s the kind of industry-government partnership we would all like to see.

If you want to learn more about how to manage (and not to manage) large IT programs, I recommend “Software Runaways,” , by Robert L. Glass, which documents some spectacular failures. Reading the book is like watching a traffic accident unfold: It’s awful but you can’t tear yourself away. Also, I expand on the root causes of and remedies for IT project failures in my post on project management best practices.

And how about some projects that went well? Here is a great link to the 10 best government IT projects in 2012!

What project management best practices would you add? Please weigh in with a comment below.

Best, Jim Ditmore

This post was first published in late October in InformationWeek and has been updated for this site.

First Quarter Technology Trends to Note

For those looking for the early signs of spring, crocuses and flowering quince are excellent harbingers. For those looking for signs of technology trends and shifts, I thought it would be worthwhile to point out some new ones and provide further emphasis or confirmation of a few recent ones:

1. Enterprise server needs have flattened and the cumulative effect of cloud, virtualization, SaaS, and appliances will mean the corporate server market has fully matured. The 1Q13 numbers bear this trend is continuing (as mentioned here last month). Some big vendors are even seeing revenue declines in this space. Unit declines are possible in the near future. The result will be consolidation in the enterprise server and software industry. VMWare, BMC and CA have already seen their share prices fall as investors are concerned the growth years are behind them. Make sure your contracts consider potential acquisitions or change of control.

2. Can dual SIM smartphones be just around the corner? Actually, they are now officially here. Samsung just launched a Galaxy dual SIM in China, so perhaps it will not be long for other device makers to follow suit. Dual SIM enables excellent flexibility – just what the carriers do not want. Consider when you travel overseas, you will be able to insert a local SIM into your phone and handle all local or regional calls at low rates, and will still receive your ‘home’ number calls. Or for everyone who carries two devices, one for business and one for personal, now you still can keep your business and personal numbers separate, but only have one device.

3. Further evidence has appeared of the massive compromises enterprise are experiencing due to Advanced Persistent Threats (APTs). Most recently, Mandiant published a report that ties the Chinese government and the PLA to a broad set of compromises of US corporations and entities over many years.  If you have not begun to move you enterprise security from a traditional perimeter model to a post-perimeter design, make the investment. You can likely bet you are compromised, you need to not only lock the doors but thwart those who have breached your perimeter. A post here late last year covers many of the measures you need to take as an IT leader.

4. Big data and decision sciences could drive major change in both software development and business analytics. It may not be of the level of change that computers had say on payroll departments and finance accountants in the 1980s, but it could be wide-ranging. Consider that perhaps 1/3 to 1/2 of all business logic now encoded in systems (by analysts and software developers) could instead be handled by data models and analytics to make business rules and decisions in real time. Perhaps all of the business analysts and software developers will then move to developing the models and  proving them out, or we could see a fundamental shift in the skills demanded in the workplace. We still have accountants of course, they just no longer do the large amount of administrative tasks. Now, perhaps applying this to legal work….

5. The explosion in mobile continues apace. Wireless data traffic is growing at 60 to 70% per year and projected to continue at this pace. The use of the mobile phone as your primary commerce device is likely to become real for most purchases in the next 5 years. So businesses are under enormous pressure to adapt and innovate in this space.  Apps that can gracefully handle poor data connections (not everywhere is 4G) and hurried consumers will be critical for businesses. Unfortunately, there are not enough of these.

Any additions you would make? Please send me a note.

Best, Jim Ditmore

 

Delivering Efficiency and Cost Reduction: Long term tactics II

We have spend the past several weeks covering how to deliver efficiency and cost reduction in IT, and particularly how to do it while maintaining or improving your capabilities. In my last post we discussed some two long terms tactics to drive efficiency but also at the same time to achieve improved capability. In this post we will cover the another key long term tactics that you can leverage to achieve world class cost and performance: leveraging metrics and benchmarking.

Let’s assume you have begun to tackle the quality and team issues in your shop. One of the key weapons to enable you to make better decisions as a manager and to identify those areas that you need to focus on is having a level of transparency around the operational performance of your shop. This doesn’t mean your team must produce reams of reports for senior management or for you. If fact, that is typically wasted effort as it is as much a show as real data. What you require to obtain effective operational transparency is the regular reporting of key operational metrics as they are used by the teams themselves. You and your senior team should be reviewing the Key Process Metrics(KPMs) that are used by the operational teams to drive the processes day-to-day and week-to-week. These would include things such as change success rates by team or a summary of project statuses with regular project reporting behind each status.

The ability to produce and manage using metrics depends substantially on the maturity of your teams. You should match the metrics to be reported and leveraged to your team’s maturity with an eye for the to master their current level and move to the next level.

For example, for production metrics you should start, if a Level 1 or 2 organization, with having basic asset information and totals by type and by business along with basic change success volumes and statistics and incident volumes and statistics. A level 3 organization would have additional information on causal factors. So in addition to change or incident statistics there would be data on why (or root cause) for a failed change or a production incident. A level 4 organization would have information on managing the process improvement itself. Thus you would have information on how long root cause takes to complete, how many are getting resolved, what areas have chronic issues, etc. Consider the analogy of having information on a car. At the lowest level, we have information on the car’s position, then we add information on the car’s velocity, and then on its acceleration at the highest level.

These three levels of information (basic position and volumes, then causes and verification metrics, then patterns and improvement metrics and effects) can be applied across any IT service or component. And in fact, if applied, your management team can move from guessing what to do, to knowing what the performance and capabilities are to understanding what is causing problems to scientifically fixing the problems. I would note that many IT managers would view putting the metrics in place as a laborious time-consuming step that may not yield results. They would rather fly and direct by the seat of their pants. This is why so many shops have failed improvement programs and a lack of effective results. It is far better to take a scientific, process-oriented approach and garner repeatable results that enable your improvement programs to accelerate and redouble their impact over time.

Another advantage of this level of transparency is that both your team and your business partners can now see measurable results rather than promises or rosy descriptions. You should always seek to make key data a commodity in your organization. You should be publish the key process metrics broadly. This will reinforce your goals and the importance of quality throughout your team and you will be surprised and the additional benefits that will be gathered by employees able to now to a better job because they have visibility of the impact of their efforts.

And when you do develop key process metrics, make sure you include metrics that have business meaning. For example, why publish metrics on number of incidents or outage time (or conversely system time availability) to your business partners? While there is some merit in these data for the technical team, you should publish for the business in more appropriate terms. For availability, publish the customer impact availability (which would be the total number of transaction failed or impacted to a customer divided by the total number of transaction that typically occur or would have occurred that month) plus the number of customers impacted. If you tell the business they had 50,000 customers impacted by IT last month versus you had 98.7% system time availability, it now becomes real for them. And they will then better understand the importance of investing in quality production.

In essence, by taking a metrics-based approach, we in IT are then following in the footsteps of the manufacturing revolutions on process that has been underway for the past 60 years. Your team should know about the basics on Continuous Process Improvement and Lean. Perhaps give them some books on Deming or others. We should run our shops as a factory for all of our routine or volume services. Our businesses must compete in a lean manufacturing world, and our work is not handcraft, it should be formalized and improved to yield astounding gains that compound over 12, 24 or 36 months.

On a final note, once you have the base metrics in place, you should benchmark. You will have one of two good outcomes: either you will find out you have great performance and a leader in the industry (and you can then market this to your business sponsors) or you will find out lots of areas you must improve and you now have what to work on for the next 12 months.

 

Have you seen success by applying these approaches? How has your team embraced or resisted taking such a disciplined approach? What would you add or change?

 

Best, Jim