Introducing the incident management

发布时间 2024-01-13 19:54:23作者: 伯安知心

Forward

Today, we talk about the significance of incident management. Firstly, we start it by some simple concepts.  What exactly the incident is? Or what is issuse management and proble mmanagement? Why it is so important? What we can do? What we should do make it better? let me say it one by one.

Basic Concepts

Basically, incidents are unplaned production outages that disrupt the end-user experience and require immediate organized intervention. The ITIL definition of an incident is " Any event which is not part of the standard operation of a service and which cause, or may cause, an interruption to, or a reduction in, the quality of that service." I have to say this difination is a little bit "government speak" in it. let me give it a more easily understood meaning of "Any event that reduces that quality of our service." An incident here then could be a downtime related event, an event that causes slowness in response time to end users, or an event that causes incorrect or unexpected results to be returned to end user. And issue management, as difined by the ITIL, is "to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price." Thus, management of an issue really becomes the management of the impact of the issue. We love this difinition and love the approach as it separates cause from impact.  We want to resolve an issue as quickly as possible, but that dose not necessarily mean understanding its root cause. Therefore, rapidly resolving an incident is critical to the perception of scale, as once a scalability related incident occurs, it starts to cause the perception of a lack of scalability.

Now that you know that an incident is an unwanted event in our system that impacts our availability or service levels and that incident management has to do with timely and cost-effective resolution of incidents to force the system into perceived normal behavior. 

Now let us talk about the problem management. Officially, the problem difines by "the unknown cause of one or more incidents, often identified as a result of mutiple similar incidents."  Comparing the incident management, there are so different between incident management and problem manegement. We can see the purposeful separation of events and their causes. The simple separation of definition in incident and problem helps us in our everyday lives by forcing us think about their resolution differently. If for every incident we attempt to find root cause before restoring service, we will have lower availability than if we separate the restoration of service from the identification of cause. Furthermore, the skills necessary to restore service and manager a system back to proper operation may very well be different from those necessary to identify  root cause of any given incident. If  that is the case, serializing the two processes not only wastes engineering time but further destorys shareholder value. What is that mean?

I have to say there is a real conflict between these two processes. Usually, problem management are in conflict with incident management. The rapid restoration of service often conflicts with the forensic data gathering necessary for problem management. For now , recognize that there is a benefit in thinking about the differences in actions for the restoration of service and the resolution problems.

What is incident management? Incident management is the set of actions taken in a select order to mitigate and resolve critical incidents to restore service health as quickly as possible.

Incident Management Stages

Basically, there is no standardized stage for incident management, but there is some  same things about what we need to think? I give you some examples for following.

Detect: Outages are proactively  detected via monitoring/alerts set up on the infrastructure or by user reports via various customer supprot channels.

Create: Incidents are created for the detected outages triggering the initiation of the incident management process . Ideally, an organization can relay on a ticket management system similar to Atlassian's JIRA to log incident details.

Classify: Incidents are classified based on the established guidelines. It is highly recommended to draft these guidelines in alignment with business needs. There are mutiple terminologies used across the industry today, but we will stick to the major, medium, and minor categorization to keep it simple. The incident management process and sense of urgency remain the same for all incidents, but identifying the incidents helps prioritize when mutiple incidents incidents are ongoing simultaneously.

Troubleshoot: The incident is escalated to oncall engineers  of the respective service by the person who initially reported the incident to the best of their knowledge after consulting the internal on-call runbook. Escalations  continue until the root cause of the issue is identified ; sometimes, an incident may involve mutiple teams working together to find the problem.

Resolve: As their highest priority, the teams  involved focus on identifying the steps to mitigate the ongoing incident in the shortest amount of time possible. The key is to take intelligent risks  and be decisive in the following steps. Once the issue is mitigated, teams focus on resolving the root cause to prevent the recurrence of the problem. Throughout the resolution process, communication with internal and external stakeholder is essential.

Review: The incident review usually takes place after the root cause identification. The team involved during the incident and stakeshoulders  get  together to review the incident in detail. Their goal is to identify what went wrong, what could be improved to prevent or resolve similar issues faster in the future, and identify shotr/long-term actions items to prevent or improve the process/stack.

Follow up:Incident action items are review regularly at the management level to ensure all the action items related to the incidents are resolved. Critical metrics around incidents, such as TTD(time to detected), TTM(time to mitigate), TTR(time to resolution), and SLAs(Service Level Agreement), are evaluated to determine incident management effectiveness and identify the strategic investment areas to improve the reliability of the services.

Metrics to Measure

As it is said in SRE cricle:"what gets measured get fixed." The following are standard metrics that should be measured and tracked across all incidents and organization.

Time To Detect(TTD)

Time to detect is the time it takes to detect the outage manually or via automated alerts from its start time. Teams can adopt more comprehensive alert coverage with fresher signals to detect outages faster.

Time To Mitigate(TTM)

Time to mitigate is the time taken to mitigate the user impact from the start of the incident. Mitigation steps are tempporary solutions until the root cause of the issue is addressed. Striving for better TTM helps increase the availability of the service. Many companies rely on serving users from multiple regions in an active-active mode and redirecting traffic to healthy regions to mitigate incidents faster. Similarly, redundancy at the service or node level helps mitigate faster in some situtations.

Time To Resolution(TTR)

Time to resolution is the time takento fully reolve the incident from the start of the incident. Time to resolution helps better understand the organization's ability to detect and fix root causes. As troubshooting makes up a significant part of the resolution lifecycle, teams can adopt sophisticated observability tools to help engineers uncover root causes faster.

key incident metadata

Incident metadata includes the number of incidents, root cause type, services impacted, root cause service, and detection method that helps the organization identify the TBF(Time Between Failures). The goal of the orangization is to increase the Mean Time Between  Failures. Analyzing this metadata helps identify the hot spots in the operational aspect of the organization.

Availability of Services

Service availability is the percentage of uptime of service over a period of time. The availability metric is used as a quantitative measure of resiliency.  

Conclusion

We learned that incident resolution and problem management should be throught of as two separate and sometimes competing processes.

This article discussed the incident manegement process and showed how it can help organizations manage chaos and resolve incidents faster. Incident management frameworks come in various flavors, but the ideas presented here are generic enough to customize and adapt in organizations of any size,

Organizations planning to introduce the incident management framwork can start small by collecting the data around incidents. This data will help understand the inefficiencies in the current system or lack thereof and provide comparative data to measure the progress of the new incident management process about to be introduced. Once they have a better sense of the requirements, they can start with a basic framwork that suits the organization's size without creating additional overhead. As needed, they can introduce other steps or tools into the process.

Organizations looking to improve their current incident management process must take a deliberate test, measure, tweak, and repeat the approach. The focus should be on identifying what's broken in the current process , making incremental changes, and mesuring the process. Start small and build from here.