In the Heat of Battle: Command Center Best Practices

If you have been in IT for any stretch, you will have experienced a significant service outage and the blur of pages, conference calls, analysis and actions to recover. Usually such a service incident call occurs at 2 AM, and there is a fog that occurs as a diverse and distributed team tries to sort through a problem and its impacts while seeking to restore service. And often, poor decisions are made or ineffective directions taken in this fog which extend the outage. Further,  as part of the confusion, there can be poor communications with your business partners or customers. Even for large companies with a dedicated IT operations team and command center, the wrong actions and decisions can be made in the heat of battle as work is being done to restore service. While you can chalk many of the errors to either inherent engineering optimism or a loss of orientation after working a complex problem for many hours, to achieve outstanding service availability you must enable crisp, precise service restoration when an incident occurs. Such precision and avoidance of mistakes in ‘the heat of battle’ comes from a clear command line and effective, well-rehearsed operational approach. This ‘best practice’ clarity includes defined incident roles and operational procedures communicated and ready well before such an event. Then everyone operates as a well-coordinated team to restore service as quickly as possible.

We explore these best practice roles and operational approaches in today’s post. These recommended practices have been derived over many years at IT shops that have achieved sustained first quartile production performance*. With these practices and robust operational capabilities place, your operations team should be able to have a median time to restore  service (MTTR) for Sev 0s, 1s, and 2s (customer impacting incidents) of less than 1 hour and in some cases as low as 30 minutes. And you can target your median time to detect at five minutes, and the median time to correlate and the median time to fix would have targets of 25 minutes each.

The first step is to have a production incident management processes which based on an ITIL approach. Some variation and adaption of ITIL is of course appropriate to ensure a best fit for your company and operation but ensure you are leveraging these fundamental industry practices and your team is fully up to speed on them. Further, it is  preferable to have a dedicated command center which monitors production and has the resources for managing a significant incident when it occurs. If your enterprise is serious about customer service, then an operations command center and team are a fundamental investment.

While best to have those capabilities are in place,  there should be clear roles for your technology and operations team for handling a production issue. The incident management roles that should be employed include:

  • Technical leads — there may be one or more technical leads for an incident depending on the nature of the issue and impact. These leads should have a full understanding of the production environment and be highly capable senior engineers in their specialty. Their role is  to diagnose and lead a problem resolution effort in their component area (e.g. storage, network, DBMS, application, etc). They also must reach out and coordinate with other technical leads to solve those issues that lie between specialties (e.g. DBMS and storage) which is often the case.
  • Service lead — the service lead is also an experienced engineer or manager and one who understands all systems aspects and delivery requirements of the service that has been impacted. This lead will help direct what restoration efforts are a priority based on their knowledge of what is most important to the business. They would also be familiar with customer requirements, business constraints and be able to direct service restoration routines or procedures (e.g. a restart). They also will have full knowledge of the related services and potential downstream impacts that must be considered or addressed (e.g., will this impact crucial month end processing?). And they will know which business units and contacts must be engaged to enact issue mitigation while the incident is being worked (e.g., go to manual modes or put up a different web intro page)
  • Incident lead — the incident lead is a command centre member who is experienced in incident management, has strong command skills, and understands problem diagnosis and resolution. Their broad knowledge and experience should extend from the systems monitoring and diagnostics tools available to application and infrastructure components and engineering tools as well as a base understanding of the services IT must deliver for the business. While this is often a hard-to-find resource, you will often find such resources within an organization because they are the informal ‘go to” person used to solve hard problems. If they have the prerequisite communication skills as well, then promote them to one of your incident leads and give them the support to become even stronger. The incident lead will drive all problem diagnostic and resolution actions as needed including:
    • review the situation and establish preliminary areas for analysis and inspection
    • engage and direct component and application technical leads and teams, both for diagnostics and restoration efforts,
    • guide collection and reporting of impact data,
    • escalate as required to senior Operations Management to ensure adequate resources and talent are focused on the issue
  • Incident coordinator – in addition to the incident lead there should also be an incident coordinator. This command centre member is knowledgeable on the incident management process and procedures and handles key logistics including setting up conference calls, calling or paging resources, drafting and issuing communications, and importantly, manages to the incident clock for both escalation and task progress. The coordinator can be supplemented by additional command centre staff for a given incident particularly if multiple technical resolution calls are spawned by the incident.
  • Senior IT operations management – for critical issue, it is also appropriate for senior IT operations management to both be present on the technical bridge ensuring proper escalation and response occurs. Further, communications may need to be drafted for senior business personal providing status, impact, and prognosis. If it is a public issue, it may also be necessary to coordinate with corporate public relations and provide information in the issue.
  • Senior management – As is often the case with a major incident, senior management from all areas of IT and perhaps even the business will look to join the technical call and discussions focused on service restoration and problem resolution. While this should be viewed as natural desire (perhaps similar to slowing and staring at a traffic accident), business and senior management presence can be disruptive and prevent the team from timely resolution. So here is what they are not to do:
    • Don’t join the bridge, announce yourself and ask what is going on, this will deflect the team’s attention from the work at hand and waste several minutes bringing you up to speed and extending the problem resolution time (I have seen this happen far too often)
    • Don’t look to blame, the team will likely slow or even shut down due to fear of repercussions when honest open dialogue is needed most to understand the problem.
    • Don’t jump to conclusions on the problem, the team could be led down the wrong path. Few senior managers have the ability to be up-to-date on the technology and have strong enough problem resolution skills to provide reliable suggestions. If you are one of them, go ahead and enable the team to leverage your experience, but be careful if your track record says otherwise.

Before we get to the guidelines to practice during an incident, I also recommend ensuring your team has the appropriate attitude and understanding at the start of the incident. Far too often, problems start small or the local team thinks they have it well in hand. They then avoid escalating the issue or reporting it as a potential critical issue. Meanwhile critical time is lost, and potentially mistakes made by the local team then compound the issue. By the time escalation to the command centre does occur, the customer impact has become severe and the options to resolve are far more limited. I refer to this as trying to put out the fire with a garden hose. It is important to communicate to the team that it is far better to over-report an issue than report it late. There is no ‘crying wolf’ when it comes to production. The team should first call the fire department (the command center) with a full potential severity alert, and then can go back to putting out the fire with the garden hose. Meanwhile the command center will mobilize all the needed resources to arrive and ensure the fire is put out. If everyone arrives and the fire is put out, all will be happy. And if the fire is raging, you now have the full set of resources to properly overcome the issue.

Now let’s turn our attention to best practice guidelines to leverage during a serious IT incident.

Guidelines in the Heat of Battle:

1. One change at a time (and track all changes)

2. Focus on restoring service first, but list out the root causes as you come across them. Remember most root cause analysis and work comes long after service is restored.

3. Ensure configuration information is documented and maintained through the changes

4. Go back to the last known stable configuration (back out all changes if necessary to get back to the stable configuration). Don’t let engineering ‘optimism’ forward engineer to a new solution unless it is the only option.

5. Establish clear command lines (one for technical, one business interface) and ensure full command center support. It is best for the business not to participate in the technology calls — it is akin to watching sausage get made (no one would eat it if they saw it being made). Your business will feel the same way about technology if they are on the calls.

6. Overwhelm the problem (escalate and bring in the key resources – yours and the vendor’s). Don’t dribble in resources because it is 4 AM in the morning. If you work in IT, and you want to be good, this is part of the job. Get the key resources on the call and ensure you hold the vendor to the same bar as you hold your team.

7. Work in parallel, wherever reasonable and possible. You should try to maintain broader efforts with alternative paths and avoid the ‘tunnel vision’ that can occur so easily to groups in a crisis situation. This should include spawning parallel activities (and technical bridges) to work multiple reasonable solutions or backups.  Don’t try to work one solution path singularly only to find it fails and you must now serially restart another path.

8. Follow the clock and use the command center to ensure activities stay on schedule. You must be able to decide when a path is not working and focus resources on better options and the clock is a key component of that decision. And escalation and communication must occur with rigor so maintain confidence and bring necessary resources to bear.

9. Peer plan, review and four eyes implement. Everything done in an emergency (here, to restore service and fix a problem) is, unfortunately, highly likely to inject further defects into your systems. Too many services outages have been extended or complicated when executing the solution or fix, a simple mistake such as the command is executed in the wrong environment. Also, remember that 70% of all fixes have another defect embedded in them. Peer planning, review, and four eyes implementation will significantly improve the quality of the changes you implement.

10. Be ready for the worst, have additional options and have a backout plan for the fix. You will save time and be more creative to drive better solutions if you address potential setbacks proactively rather than waiting for them to happen and then reacting.

Recall that the  ITIL incident management objective is to ‘restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.’  These best practice guidelines will help you drive to a best practice incident management capability.

What would you add or change in the guidelines? How have you been able to excellent service restoration and problem management? I look forward to hearing from you.

Best, Jim Ditmore

P.S. Please note that these best practices  have been honed over the years in world class availability shops for major corporations with significant contributions from such colleagues as Gary Greenwald, Cecilia Murphy, Jim Borendame, Chris Gushue, Marty Metzker, Peter Josse, Craig Bright, and Nick Beavis (and others).

17 thoughts on “In the Heat of Battle: Command Center Best Practices”

  1. hi,

    i work in an IT operations environment where we are not really required to be proactive. if something breaks, we escalate it. how do we transform such an environment to one where people are more proactive and are allowed to resolve issues before escalating it?

    1. Tsholo,

      I think there are two sides to how to improve things in your environment. One is leading by example, and this is typically the most important. Good leaders gain influence because their teams and peers know that they will back up what they say with action. People are much more inclined to follow someone who has a ‘Do as I do’, rather than a ‘Do as I say’, approach. So, start first in your work or your team. You can improve being proactive by having the right data. Collect the information on issues and problems in you area. Like most things with systems, they data will indicate clusters. Investigate the clusters further and understand why the issues are occurring. Once understood, you can now be proactive and initiate action to address the underlying causes. The way to improve things is to be a proponent. Having courage and saying the right things in the right settings is also a key element of leadership. Be measured and have data when speaking in these situations. Express the effects of the desired change in terms of how it will help your enterprise be more successful (like, ‘our customers will not experience these service losses if we do X’). Identify well understood ways to improve your current practices (e.g. ITIL methods), and be a proponent of adopting these best practices which have helped other companies (including likely your competitors) improve. Find allies and join forces to improve accountability and pride in the work your organization does.

      I think these are the best ways to tackle a reactive and complacent environment where external influences have not yet made change compelling. I wish you the best on this journey.

      Jim Ditmore

  2. Hello Jim

    What are the best proactive monitoring methods we can introduce from IT networking standpoint when dealing with contact center environment

    Thanx

    1. Dear Dhruv,

      I think you want to start with the normal workload monitoring and routing capabilities within the contact center. For example, your call center managers will be anxious to know queue depth and wait time in real time throughout the work day. These capabilities are included with any traditional call center management software. From a network perspective, you particularly want to ensure you have proactive circuit quality and utilization monitoring for all connections to the call center. Call abandonment rates are another key indicator that can identify if you have issues with either quality or wait time. Monitoring these metrics will give you a solid understanding of the environment. Remember though, the appropriate practice for call centers is to set the network connections and the call management software and servers to be highly resilient so that on any one failure, you have rapid failover with minimal impact to the call center.

      I hope this answers your question. Best, Jim Ditmore

  3. We incorporate all the best practices in the article (so nice to see a confirming document). We are looking for information on Best Pratices for Command Center Logistical support. ie. Onsite, Offsite. Any thoughts on that?
    We do “follow the sun”, on the Physicial Command Center Support (California, India, Signapore and Texas). Most of our support today consists of gathering key leads on-site for the various 6 teams that are heavily involved with the Installation/Activation. We can have up to 14 staff in one room during the first 24 to 36 hours, with additional staff onsite or on call. Our Management wants to know if there are best practices for the “lower support” periods”. We run a 4 day / 24-hour onsite command center running Friday to Monday. But on Sunday it is slow going, then Monday picks up with activity. Are there any Best Practices on supporting Major Installations for teams in multiple remote settings? Where no one would be ON-SITE ( ie. working from a hotel, home, starbucks, or other remote sites). Outside of being on your laptop during your support shift and having conference calls, tools like Meeting Place, or using IM; are there any other tools that happen to work well and are considered a part of “best practise” when operating in this mode.

    1. Dear Jeannine,

      Good questions, I have some initial thoughts but want to check with some of the IT Operations managers in the industry first. Just one clarification: are you running a 4 day/ 24 hour schedule or 7 day / 24 hour schedule?

      Best, Jim Ditmore

  4. Sir,

    Have you published any additional information on this topic? This has direct bearing on a new Command Center initiative within our organization.

    Thanks,
    Roy M

    1. Roy,

      I can get you plenty of additional information. What topics or areas are of particular interest? I also recommend you or your team visit recent successful implementations of Command Centers. JPMorgan has an excellent center as does Allstate. If you need a contact there, I may be able to arrange.

      Best, Jim

  5. Hi Jim,
    I’ve been working on the feasibility phase of building an operations centre and we are at a point where we would like to dig deeper in a few areas.

    The area where we have the least amount of consensus is in house vs out sourced Ops Center. This particularly evident when comparing the security monitoring function (who would prefer to look for an out sourced solution), and data center monitoring and operations (who would prefer to build the capability in house). Right now I’m thinking that if the business cases are truly different then maybe having a hybrid solution isn’t a bad thing. I’d like to hear more from you on the two options and under what scenarios each makes sense.

    1. Dear Daniel,

      My apologies for the delay in getting back to you. I think it is reasonable to do a hybrid where the primary command center is in-house and the SOC is out-sourced, particularly if your IT budget is stretched. Typically your applications and underlying support infrastructure is quite custom and requires significant knowledge and experience to understand and manage correctly. But most security attacks and threats follow patterns that a supplier who is on top of latest trends and activity will be able to detect and understand how to thwart or roll back. It is very important that the SOC operate 7×24 which may also be easier to attain with a supplier with critical mass. I hope you and your team have been able to progress this effectively, don’t hesitate to write me directly if you have further questions. Best, Jim Ditmore

  6. Hi Jim,

    very interesting sharing and thank you..

    Need some advise. I run operations in APAC and I’m in the midst of setting up a new site global operations center in APAC to cover he sun support model. Currently we already have 2 GOC in other region and in the midst of setting up one in APAC for business needs.

    Took note on the staffing capabilities that is basically already on my plan, but would also like to know if there is any other guidelines, process or procedure to run and maintain a successful global ops center and best KPI measurement? The focus is only infrastructure services (server, DB, network and etc)..

  7. Jim,

    One of my customers is running an outsourced and mature service desk. They want to merge multiple service desks catering to different product lines and yet have an exceptionally agile escalation management desk who can handle all the major incidents/situations that get flagged off as Sev1.

    Any thoughts on how the escalation management desk/function should be run and are there tools in the market that can simplify this operation

    Thanks
    Jay

    1. Dear Jay T.

      An incident management/escalation management approach should follow the ITIL model. The ITIL model also maps out how service desks interlink with the incident management and escalation processes. Obviously if the service receives the call from a user reporting an incident, it should be logged in the service desk system and tagged for incident management. One of the filtering questions is if the incident is impacting multiple users or key channels, if so, it is isiimmendiately flagged to your command center and escalation occurs immediately. The best incident and service desk management system in my view is ServiceNow, which enables a fully integrated capability between your command center, service monitoring, and your service desk.

      Hope this helps, apologize for the delay getting back to you.

      Best, Jim Ditmore

  8. I recently was brought on board to direct an organization which includes the ServiceNow team, the Service Desk and the IT Command Center falling under me. The Service Desk and IT Command Center are two separate teams and are run by two different managers. I recently, due to my leadership changes, have started reviewing my org structure. I have thought about moving the IT Command Center and Service Desk under one umbrella and one manager with team leads under that person. The current IT Command Center Manager would lead only the ServiceNow Team. Our IT Command Center has not reached a maturity of being proactive for many reason. The Service Desk is not where we need it to be. For us to reach the SPOC as defined on this website I am looking to align both groups together.

    What are your thoughts and feedback?

    1. Samuel, I recommend that you address how your users call in to get to a SPOC. You need to publish 1 master phone number only, you should have self service and ability to submit on the web as well. But all should get directed to your service desk (and not to the Command center). The command should focus on monitoring and automated detection of issues. They should get the serious incidents that get called into the Service Center directly (even a red phone line) immediately after the initial logging and triage by the Service center. Since these are two very different disciplines (service center best practices versus command center and monitoring best practices, I think it would be difficult for one manager to both master the practices much less juggle the very high demands of both roles. I think instead you should recruit an experienced manager for one of the two functions, and coach and develop the other one. It is Fine to have a separate ServiceNow and systems management tools team as well.

      I hope this provides some further guidance for you, please let me know if you have further questions. Best, Jim DItmore

      1. Thank you greatly. Yes, we do have all customers funneled through the Service Desk with one published number. The IT Command Center is primarily for IT engineering teams, event monitoring and notification ownership in IT and with the customer. The reason for putting it under one manager was ownership of incident but I think it is better to keep it separate. What are your thoughts regarding field services who support end user devices on premise when the service desk can’t resolve. Is that better to be under the same director as the service desk?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.