Incident Management is a learned skill. The corollary is that there are a bunch of instinctive behaviours that people follow which are not helpful during an incident.
If you see these happening in your org, it’s time to gently encourage folks to learn more about incident management, or to start offering some training. Try to avoid the urge to yell “Bingo!” when you have seen them all- this is rarely helpful during an incident response.
The Anti-Patterns #
The Clusterfuck or “Kicked Anthill” 🐜 🐜 🐜 #
The shout goes up “Production is down!” (or “the server”, or “BigCustomer”) and everyone within earshot drops what they are doing and huddles around the oncaller’s desk. Or piles into the chat thread, asking questions, offering their unrequested observations, and generally carrying on as though they are an equal stakeholder in the situation.
This is the most likely of the naive incident response antipatterns- folks sincerely believe they are helping by dropping what they are doing to try to contribute to resolving the incident.
This is bad for two reasons- it confuses the incident response, and it takes folks away from other work they could be doing. If your IM processes are robust, the oncaller will be able to either deal with the problem themselves or escalate and loop in the folks they need. If everyone is always keeping some part of their attention on possible incidents they could jump into, they will experience more stress and likelihood of burnout. And if that isn’t bad enough, from a management point of view, they won’t be fully concentrating on their project work.
All Hands on Deck 🧊 🚢 🫡 #
This is a panic reaction where someone in authority (CEO, CTO, VP, random manager) has decided to be proactive about a problem and that everyone who works under them needs to only be working on this specific problem.
For big incidents needing a lot of headcount, this may be appropriate. In most cases though, it takes engineers away from other important work and requires folks without relevant skills for the incident to be little more than cheerleaders for the day.
For example, BigCustomer (tm) has raised an incident by direct-messaging the CEO and now literally everyone in the company is told to stop work and help.
Race to the Moon 🚀 🌔 🚀 #
Teams (or individuals) are in competition with each other to solve the problem first. Rather than sharing lines of enquiry, evidence found, or theories disproved, they end up hiding details from others and bragging about their progress. Instead of addressing the (customer’s) problem, they see it as an opportunity to prove their own skill / value / importance.
This is a toxic behaviour which reinforces silos, delays incident resolution, and prevents effective post-incident learning.
Magic Meeting ✨ 🖥️ ✨ #
All the participants in the initial incident response start/join a video call and only discuss the incident verbally and synchronously in the meeting. No notes are taken. No evidence is shared. No decisions are communicated outside the video call.
Sometimes a synchronous video call is the best way to rapidly share information, reach a decision, and get feedback from stakeholders. But it is essential that the information shared and the decision reached are also communicated and recorded outside the call:
- So the evidence and decision are available for post-incident learning, and
- Any responders joining the incident later have access to the same context and information as everyone else (eg if the incident lasts multiple hours and needs to be handed over)
This is why I never configure incident management software to automatically create a video call, just a dedicated chat channel. If a call is needed, it can be easily started. We establish a default behaviour of communicating important information via the chat channel where it will be visible to all responders.
Nothing To See Here 🙈 🚧 #
An incident happens, and the team responsible either try to fix it without mentioning it at all, or simply share something very generic like “There was a problem with the web server, but it’s all fixed now”.
This is especially common in teams without psychological safety, or where there is a cultural expectation to hide or deny mistakes.
Sometimes the outage (or risk) is due to a bug that is seen as preventable or careless and folks don’t want to admit to that. But everyone is always doing the best they can with the resources available to them at the time. And every human makes mistakes, and every sufficiently complex system can have unexpected behaviours.
If the incident is hidden, we lose an opportunity to learn from it and improve the system. Owners of any downstream systems will be scratching their heads, trying to work out what just happened. Affected customers or colleagues won’t be notified or supported when they could be.
This can also be seen in incidents where the owning team/engineer skip the post-incident learning phase. They just mark it resolved and move on- no lessons learned, no opportunity to spot, prevent, or improve on the situation in future.
Leave It With Me 👤 🔒 #
Related to Nothing To See Here, some engineers will not share their knowledge, investigations, evidence, or remediation steps with others. Even if they are part of the incident chat channel, they are acting as a Lone Wolf.
You might get a response like “Fixing it now” or “Fixed” without any details, or any chance for the Incident Lead to coordinate the response and make decisions about what steps are appropriate for the incident at hand. Frequently, they will end up deploying production changes without review (by peers or the Incident Lead).
This may come from a place of embarrassment (at the inevitable existence of a bug in a system they built) or a mistaken assumption that the problem is solely theirs to fix and no-one else’s business.
This can lead to overlapping or clashing or mutually incompatible changes to the system during the incident. If outage symptoms mysteriously disappear without a reason, that can result in bad decisions by the Incident Lead. If the lone wolf engineer’s changes cause more problems, it is much harder to understand the progress of the incident. It also makes it hard/impossible to learn from the incident, or share investigation and resolution steps within the team.
Ways Forward #
Practical Training / Workshops / Wheel of Misfortune #
The most effective approach I have seen for addressing these antipatterns is running practical training sessions.
Start by explaining the behaviours and outcomes you would like to see, along with any standard tooling or processes within your organisation.
Then make sure everyone has a go at doing all parts of your incident management process with different roles. Several times. The first time, they are likely to fall into one or more of the antipatterns. This is a great opportunity for them to realise how that behaviour led to worse outcomes overall, and now they can try again without making that mistake.
Actions:
- Raising an incident
- Sending comms
- Coordinating a response
- Handing over Incident Lead
- Verifying service is restored
- Compiling a post-incident report / post-mortem
Roles:
- First Responder
- Incident Lead (initial, and replacement after handover)
- External Stakeholder
Having well-defined roles ahead of time helps avoid the Kicked Anthill approach. When engineers have a better idea of what is (and is not) expected of them during an incident, they find it easier to follow the process in future.
Again, having specific roles within a constrained example helps make the point that not everyone needs to be involved all the time during an incident. So you can help avoid All Hands On Deck, especially if all managers and senior leadership take part in the sessions.
I cannot overstate how much folks learn about good incident management behaviours by having to compile a post-incident report with a detailed timeline of what happened when, and what was learned in the process. All by themselves, they will see the downside of the Magic Meeting and Race to the Moon and Nothing to See Here and Leave It With Me.
What seemed like unnecessary over-communication and verbosity, too much typing in a chat channel, and time wasted when we could be resolving the incident quicker, all starts to make sense.
Match the process to the problem #
An Incident Management process which works for other organisations may not be a good fit for yours.
If the terminology does not suit your team’s culture, it’s ok to change the terms. For example, Google uses the term Incident Commander (IC) but other roles in the response (like Comms Lead or Ops Lead) use less martial terminology. I prefer the term Incident Lead myself, as it is consistent with the other roles and less intimidating. But you may want to keep Incident Commander to reinforce the idea that the IC is the one making decisions and other responders need to respect that part of the role.
Maybe your team is stuck in fire-fighting mode, with more incidents than you can sustainably handle. In which case, you may want to streamline the post-incident learning process and do a weekly review rather than a deep-dive on every incident. Consider changing the post-incident doc template to something with a few checkboxes and optional notes so that you can still learn something from every incident.
Focus on the goal of Incident Management #
The goal is not just to resolve this incident quickly, but to have fewer and smaller incidents over time. And that we get better at detecting, resolving, and (ultimately) avoiding similar incidents completely.
Once this makes sense, it becomes natural to share information at every stage, and take the time to learn from every incident, and make those lessons count in terms of meaningful changes to systems and processes over time.
When there is an Incident Management process that makes sense, evolves over time, shows concrete results, is easy to follow, and has the full backing of senior leadership, you will see fewer and fewer examples of the antipatterns listed here.