If you have been in IT for any stretch, you will have experienced a significant service outage and the blur of pages, conference calls, analysis and actions to recover. Usually such a service incident call occurs at 2 AM, and there is a fog that occurs as a diverse and distributed team tries to sort through a problem and its impacts while seeking to restore service. And often, poor decisions are made or ineffective directions taken in this fog which extend the outage. Further, as part of the confusion, there can be poor communications with your business partners or customers. Even for large companies with a dedicated IT operations team and command center, the wrong actions and decisions can be made in the heat of battle as work is being done to restore service. While you can chalk many of the errors to either inherent engineering optimism or a loss of orientation after working a complex problem for many hours, to achieve outstanding service availability you must enable crisp, precise service restoration when an incident occurs. Such precision and avoidance of mistakes in ‘the heat of battle’ comes from a clear command line and operational approach. This ‘best practice’ clarity includes defined incident roles and operational approach communicated and ready well before such an event. Then everyone operates as a well-coordinated team to restore service as quickly as possible.
We explore these best practice roles and operational approaches in today’s post. These recommended practices have been derived over many years at IT shops that have achieved sustained first quartile production performance*. The first step is to have a production incident management processes which based on an ITIL approach. Some variation and adaption of ITIL is of course appropriate to ensure a best fit for your company and operation but ensure you are leveraging these fundamental industry practices and your team is fully up to speed on them. Further, it is preferable to have a dedicated command center which monitors production and has the resources for managing a significant incident when it occurs.
Assuming those capabilities are in place, there should be clear roles for your technology team in handling a production issue. The incident management roles that should be employed include:
- Technical leads — there may be one or more technical leads for an incident depending on the nature of the issue and impact. These leads should have a full understanding of the production environment and be highly capable senior engineers in their specialty. Their role is to diagnose and lead a problem resolution effort in their component area (e.g. storage, network, DBMS, etc). They also must reach out and coordinate with other technical leads to solve those issues that lie between specialties (e.g. DBMS and storage).
- Service lead — the service lead is also an experienced engineer or manager and one who understands all systems aspects and delivery requirements of the service that has been impacted. This lead will help direct what restoration efforts are a priority based on their knowledge of what is most important to the business. They would also be familiar with and be able to direct service restoration routines or procedures (e.g. a restart). They also will have full knowledge of the related services and potential downstream impacts that must be considered or addressed. And they will know which business units and contacts must be engaged to enact issue mitigation while the incident is being worked.
- Incident lead — the incident lead is a command centre member who is experienced in incident management, has strong command skills, and understands problem diagnosis and resolution. Their general knowledge and experience should extend from the systems monitoring and diagnostics tools available to application and infrastructure components and engineering tools as well as a base understanding of the services IT must deliver for the business. The incident lead will drive all problem resolution actions as needed including
- engaging and directing component and application technical leads and teams and restoration efforts,
- collection and reporting of impact data,
- escalation as required to ensure adequate resources and talent are focused on the issue
- Incident coordinator – in addition to the incident lead there should also be an incident coordinator. This command centre member is knowledgeable on the incident management process and procedures and handles key logistics including setting up conference calls, calling or paging resources, drafting and issuing communications, and importantly, managing to the incident clock for both escalation and task progress. The coordinator can be supplemented by additional command centre staff for a given incident particularly if multiple technical resolution calls are spawned by the incident.
- Senior IT operations management – for critical issue, it is also appropriate for senior IT operations management to both be present on the technical bridge ensuring proper escalation and response occurs. Further, communications may need to be drafted for senior business personal providing status, impact, and prognosis. If it is a public issue, it may also be necessary to coordinate with corporate public relations and provide information in the issue.
- Senior management – As is often the case with a major incident, senior management from all areas of IT and perhaps even the business will look to join the technical call and discussions focused on service restoration and problem resolution. While this should be viewed as natural desire (perhaps similar to slowing and staring at a traffic accident), business and senior management presence can be disruptive and prevent the team from timely resolution. So here is what they are not to do:
- Don’t join the bridge, announce yourself and ask what is going on, this will deflect the team’s attention from the work at hand and waste several minutes bringing you up to speed and extending the problem resolution time (I have seen this happen far too often)
- Don’t look to blame, the team will likely slow or even shut down due to fear of repercussions when honest open dialogue is needed most to understand the problem.
- Don’t jump to conclusions on the problem, the team could be led down the wrong path. Few senior managers have the ability to be up-to-date on the technology and have strong enough problem resolution skills to provide reliable suggestions. If you are one of them, go ahead and enable the team to leverage your experience, but be careful if your track record says otherwise.
Before we get to the guidelines to practice during an incident, I also recommend ensuring your team has the appropriate attitude and understanding at the start of the incident. Far too often, problems start small or the local team thinks they have it well in hand. They then avoid escalating the issue or reporting it as a potential critical issue. Meanwhile critical time is lost, and potentially mistakes made by the local team then compound the issue. By the time escalation to the command centre does occur, the customer impact has become severe and the options to resolve are far more limited. I refer to this as trying to put out the fire with a garden hose. It is important to communicate to the team that it is far better to over-report an issue than report it late. There is no ‘crying wolf’ when it comes to production. The team should first call the fire department (the command center) with a full potential severity alert, and then can go back to putting out the fire with the garden hose. Meanwhile the command center will mobilize all the needed resources to arrive and ensure the fire is put out. If everyone arrives and the fire is put out, all will be happy. And if the fire is raging, you now have the full set of resources to properly overcome the issue.
Now let’s turn our attention to best practice guidelines to leverage during a serious IT incident.
Guidelines in the Heat of Battle:
1. One change at a time (and track all changes)
2. Focus on restoring service first, but list out the root causes as you come across them. Remember most root cause analysis and work comes long after service is restored.
3. Ensure configuration information is documented and maintained through the changes
4. Go back to the last known stable configuration (back out all changes if necessary to get back to the stable configuration). Don’t let engineering ‘optimism’ forward engineer to a new solution unless it is the only option.
5. Establish clear command lines (one for technical, one business interface) and ensure full command center support. It is best for the business not to participate in the technology calls — it is akin to watching sausage get made (no one would eat it if they saw it being made). Your business will feel the same way about technology if they are on the calls.
6. Overwhelm the problem (escalate and bring in the key resources – yours and the vendor’s). Don’t dribble in resources because it is 4 AM in the morning. If you work in IT, and you want to be good, this is part of the job. Get the key resources on the call and ensure you hold the vendor to the same bar as you hold your team.
7. Work in parallel wherever reasonable and possible. This should include spawning parallel activities (and technical bridges) to work multiple reasonable solutions or backups.
8. Follow the clock and use the command center to ensure activities stay on schedule. You must be able to decide when a path is not working and focus resources on better options and the clock is a key component of that decision. And escalation and communication must occur with rigor so maintain confidence and bring necessary resources to bear.
9. Peer plan, review and implement. Everything done in an emergency (here, to restore service and fix a problem) is highly likely to inject further defects into your systems. Too many issues have been complicated by during a change implementation when a typo occurs or the command is executed in the wrong environment. Peer planning, review, and implementation will significantly improve the quality of the changes you implement.
10. Be ready for the worst, have additional options and have a backout plan for the fix. You will save time and be more creative to drive better solutions if you address potential setback proactively rather than waiting for them to happen and then reacting.
Recall that the ITIL incident management objective is to ‘restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.’ These best practice guidelines will help you drive to a best practice incident management capability.
What would you add or change in the guidelines? How have you been able to excellent service restoration and problem management? I look forward to hearing from you.