Skip to content

Unrecovered Incidents


The Unrecovered Incidents Explorer centrally displays all incident records at the alert level within the current workspace, helping users fully understand the context of alert incidents, accelerating the comprehension and recognition of incidents, and effectively reducing alert fatigue by associating monitors and alert strategies.

The Unrecovered Incidents data source queries incident data, aggregates it using df_fault_id as the unique identifier, and displays the most recent results. You can use this Explorer as a visualization tool to intuitively understand a series of key data points, from incident levels to triggered threshold baselines. Information such as incident level, duration, alert notifications, monitors, incident content, and historical trigger trend charts together form a comprehensive view, helping you analyze and understand incidents from different perspectives, thereby making more informed response decisions.

Incident Card

Incident Level

Based on the trigger condition configuration of the monitor, the following status statistics are generated:

  • Fatal (fatal)
  • Critical (critical)
  • Error (error)
  • Warning (warning)
  • No Data (nodata)

In the Unrecovered Incidents Explorer, the level of each incident is defined as the level at which the detection object last triggered the incident.

For more details, refer to Incident Level Description.

Incident Title

The incident title displayed in the Unrecovered Incidents Explorer comes directly from the title set during the monitor rule configuration. It represents the title used when the detection object last triggered the incident.

Duration

Indicates the time from when the current detection object first triggered an anomaly and generated an incident until the end time of the current time widget, e.g., 5 minutes (08/20 17:53:00 ~ 17:57:38).

Alert Notification

The alert notification status when the current detection object last triggered the incident. It mainly includes the following three statuses:

  • Mute: Indicates the current incident is affected by a mute rule but no alert notification was sent externally.
  • Identifier of the actual notified Notification Target: Includes DingTalk bot, WeCom bot, Lark bot, etc.
  • -: No external alert notification was triggered.

Monitor Detection Type

Refers to the monitor type.

Detection Object

If a by group query was used at the detection metric during monitor rule configuration, the incident card will display the filter condition, e.g., source:kodo-servicemap.

Incident Content

The incident content when the current detection object last triggered the incident. It comes from the preset content in the monitor rule configuration and represents the incident content when the detection object last triggered the incident.

Historical Trigger Trend Chart

This trend is displayed using a Window function. The historical trend of the detection result value shows the actual data from the last 60 detections.

Based on the current detection result value of the unrecovered incident, the historical trend of incident anomalies is displayed. The triggered threshold condition value configured in the monitor detection rule is set as a clear reference line. The system specifically marks the detection result when the current detection object last triggered the incident. Through the vertical lines in the trend chart, you can quickly locate the specific time points when incidents were triggered. Simultaneously, the corresponding detection interval for this detection result is also displayed, providing an intuitive analysis tool for evaluating the development process and impact of the incident.

Management Card

Display Items

The Unrecovered Incidents list supports the following display styles:

  • Standard: Displays the incident title, detection dimension, and incident content.
  • Extended: In addition to standard information, also displays the historical trend of the detection result value for the unrecovered incident.
  • List: Displays incident data in list form.

Mute Incident

In large-scale monitoring scenarios, to avoid the cumbersome steps, time consumption, and susceptibility to omissions associated with manually handling a large number of similar alerts, you can directly "mute" the rule on the current page.

  1. Hover over a single incident and click Mute on the right side.
  2. Select the mute time type.
  3. Confirm.

Mute Time Type

Supports customizing the mute start time and end time, or quickly setting it to 1 hour, 6 hours, 12 hours, 1 day, or 1 week.


  1. Select the mute start time and duration.
  2. Select the mute cycle starting from a certain moment.
  3. Select the mute expiration time. You can choose to repeat forever according to the above time or repeat until a specific moment.

Recover Incident

An incident is considered recovered when its status is normal (df_sub_status = ok).

  • To recover a single rule, you can do so via the button on the right side of the rule, or go to the Monitor settings, or recover it manually.
  • If you click "Recover All", all abnormal incidents in the current list will be recovered, and you can choose whether to associate Issues.

Incident recovery is divided into four types:

Name
df_status Description
Recovered ok If the previously detected "Critical", "Error", or "Warning" abnormal incidents are not triggered again within N detections, they are considered recovered.
No Data Recovered ok Data stopped being reported and then resumed reporting, judged as recovered.
No Data Considered Recovered ok Detection data interruption is considered a normal state.
Manually Recovered ok User manually clicks to recover, supports single/batch recovery.

Further Reading