Skip to content

Unresolved Incidents


The Unresolved Incidents Explorer centrally displays all incident records currently at an alert level within the workspace, helping users fully understand the context of alert incidents, accelerating comprehension and awareness of incidents. It also effectively reduces alert fatigue by associating monitors, alert strategies, and more.

The Unresolved Incidents data source aggregates event data by querying it, using df_fault_id as the unique identifier, and displays the most recent results. You can use the Explorer, a visualization tool, to intuitively understand a series of key data points from incident level to triggered threshold baselines. Information from incident level, duration, alert notifications, monitors, to incident content, and historical trigger trend charts collectively forms a comprehensive view. This helps you analyze and understand incidents from different angles, enabling more informed response decisions.

Core Logic

Mechanism Description
Aggregation Dimension Uses df_fault_id as the unique identifier, aggregating multiple event triggers for the same detection object into a single record for display. This means multiple abnormal triggers from the same host under the same monitoring rule appear as one aggregated record in the Unresolved Incidents list, preventing alert storms.
Time Window By default, displays events with df_status != ok from the last 48 hours. You can adjust the time range, with a maximum query support of data from the last 7 days.
Status Determination Takes the level of the most recent triggered event for that detection object as the current display level. If the incident level changes during triggering (e.g., from warning to critical), the list shows the latest level.

Incident Card

Before clicking into an incident's details, the card presents key information in a structured way:

Incident Level

Based on the monitor's trigger condition configuration, the following status statistics are generated, with severity decreasing in order:

  • Fatal (fatal)
  • Critical (critical)
  • Important (error)
  • Warning (warning)
  • Data Gap (nodata)

In the Unresolved Incidents Explorer, the level of each incident is defined as the level when that detection object last triggered an event. This means if the same fault first triggered a warning and later escalated to critical, the card displays the level as critical.

For more details, refer to Incident Level Description.

Incident Title

The incident title displayed in the Unresolved Incidents Explorer comes directly from the title set during the monitor rule configuration. It represents the title used when that detection object last triggered an event. Titles typically contain key variable substitutions, such as hostname, metric value, etc., facilitating quick identification of the problematic object.

Duration

Indicates the time from when the current detection object first triggered an abnormal event until the end time of the current time widget, e.g., 5 minutes (08/20 17:53:00 ~ 17:57:38). This duration reflects the fault's persistence and is an important basis for assessing impact scope and urgency. If the duration exceeds the expected recovery time, it is recommended to prioritize handling or escalate.

Alert Notification

The alert notification status for the last triggered event of the current detection object, indicating whether the incident has reached relevant personnel. It mainly includes the following three statuses:

  • Mute: Indicates the current incident is affected by a mute rule but no alert notification has been sent externally. The event is only recorded within the system and no external alert notifications are sent, suitable for known issues or maintenance windows.
  • Identifier of the actual sent notification target: Includes DingTalk bot, WeCom bot, Lark bot, etc., indicating the alert has been successfully pushed to the corresponding channel.
  • -: No external alert notification was triggered. Possible reasons include no notification configured for the monitor, notification target failure, or exceeding notification frequency limits.

Monitor Detection Type

Refers to the monitor type, identifying which detection rule triggered this event, such as threshold detection, log detection, anomaly detection, etc. The detection type allows quick location of the monitor configuration entry for rule adjustment or temporary disabling.

Detection Object

When configuring monitor rules, if a by grouping query is used at the detection metric, the event card will display the filter condition, e.g., source:kodo-servicemap. This indicates the event is a detection result for a specific dimension group, not a global aggregation. Clicking the detection object tag can quickly filter other events with the same dimension.

Incident Content

The incident content from the last triggered event of the current detection object, sourced from the pre-configured content in the monitor rule. It represents the incident content when that detection object last triggered an event. The content typically includes:

  • Specific metric value at trigger time
  • Comparison result with the threshold
  • Pre-set troubleshooting suggestions or handling guidance
  • Complete description after variable substitution

Historical Trigger Trend Chart

This trend is displayed using a Window function, showing the historical trend of the detection result value for the last 60 actual data checks.

Based on the detection result value of the current unresolved incident, it displays the historical event anomaly trend. The trigger threshold condition value configured within the monitor detection rule is set as a clear reference line. The system specifically marks the detection result of the last triggered event for the current detection object. Through the vertical line in the trend chart, you can quickly locate the specific time point when the event was triggered. Simultaneously, the corresponding detection interval for that detection result is displayed, providing an intuitive analysis tool to assess the event's development process and its impact.

Managing Incident Cards

Display Items

The Unresolved Incidents list supports the following display styles to accommodate different information density needs across scenarios:

  • Standard: Displays incident title, detection dimension, and incident content.
  • Extended: In addition to standard information, also displays the historical trend of the detection result value for unresolved incidents.
  • List: Displays incident data in list form, with customizable fields.

Muting Incidents

In large-scale monitoring scenarios, to avoid the cumbersome steps, time consumption, and error-prone nature of manually handling a large number of similar alerts, you can directly "Mute" rules on the current page. During the mute period, events continue to be detected and recorded, but no alert notifications are sent. This is suitable for the following scenarios:

  • Known issues are being fixed, requiring temporary noise reduction.
  • Planned maintenance windows, expected anomalies.
  • Batch alerts for non-production environments or low-priority systems.

Mute Operation Steps

  1. Hover over a single incident, click Mute on the right side.
  2. Select the Mute Time Type.
  3. Confirm.

Mute Time Type

Supports customizing the mute start time and end time, or quickly setting it to 1 hour, 6 hours, 12 hours, 1 day, 1 week.


  1. Select the mute start time and duration.
  2. Select the mute cycle starting from a certain point in time.
  3. Select the mute expiration time. You can choose to repeat forever according to the above time or repeat until a specific moment.

Recovering Incidents

An incident status is considered recovered (df_sub_status = ok) when it is normal. Recovery means the detection object no longer meets the monitor's abnormal trigger conditions, or has been manually confirmed as resolved.

  • Single Recovery: Can be done via the button on the right side of the rule, or by going to the Monitor settings, or manually recovered.
  • Batch Recovery: Click "Recover All at Once" to recover all abnormal incidents under the current list.

Recovered incidents are divided into four types:

Name
df_status Description
Recovery ok Previously detected "Fatal", "Critical", "Important", "Warning" abnormal incidents. If not triggered again within N checks, it is considered recovered. This is the most common automatic recovery type, indicating the metric has returned to the normal range.
Data Gap Recovery ok Data stopped reporting and then resumed reporting, judged as recovery.
Data Gap Treated as Recovery ok Detection data gap occurs, treated as normal status.
Manual Recovery ok User manually clicks to recover, supporting single/batch recovery.

Further Reading