Unresolved Incidents¶

The Unresolved Incidents Explorer centrally displays all incident records currently at an alert level within the workspace, helping users fully understand the context of alert incidents, accelerating comprehension and awareness of incidents. It also effectively reduces alert fatigue by associating monitors, alert strategies, and more.

The Unresolved Incidents data source aggregates event data by querying it, using df_fault_id as the unique identifier, and displays the most recent results. You can use the Explorer, a visualization tool, to intuitively understand a series of key data points from incident level to triggered threshold baselines. Information from incident level, duration, alert notifications, monitors, to incident content, and historical trigger trend charts collectively forms a comprehensive view. This helps you analyze and understand incidents from different angles, enabling more informed response decisions.

Core Logic¶

Mechanism	Description
Aggregation Dimension	Uses `df_fault_id` as the unique identifier, aggregating multiple event triggers for the same detection object into a single record for display. This means multiple abnormal triggers from the same host under the same monitoring rule appear as one aggregated record in the Unresolved Incidents list, preventing alert storms.
Time Window	By default, displays events with `df_status != ok` from the last 48 hours. You can adjust the time range, with a maximum query support of data from the last 7 days.
Status Determination	Takes the level of the most recent triggered event for that detection object as the current display level. If the incident level changes during triggering (e.g., from `warning` to `critical`), the list shows the latest level.

Incident Card¶

Before clicking into an incident's details, the card presents key information in a structured way:

Incident Level¶

Based on the monitor's trigger condition configuration, the following status statistics are generated, with severity decreasing in order:

Fatal (fatal)
Critical (critical)
Important (error)
Warning (warning)
Data Gap (nodata)

In the Unresolved Incidents Explorer, the level of each incident is defined as the level when that detection object last triggered an event. This means if the same fault first triggered a warning and later escalated to critical, the card displays the level as critical.

For more details, refer to Incident Level Description.

Incident Title¶

The incident title displayed in the Unresolved Incidents Explorer comes directly from the title set during the monitor rule configuration. It represents the title used when that detection object last triggered an event. Titles typically contain key variable substitutions, such as hostname, metric value, etc., facilitating quick identification of the problematic object.

Duration¶

Indicates the time from when the current detection object first triggered an abnormal event until the end time of the current time widget, e.g., 5 minutes (08/20 17:53:00 ~ 17:57:38). This duration reflects the fault's persistence and is an important basis for assessing impact scope and urgency. If the duration exceeds the expected recovery time, it is recommended to prioritize handling or escalate.

Alert Notification¶

The alert notification status for the last triggered event of the current detection object, indicating whether the incident has reached relevant personnel. It mainly includes the following three statuses:

Mute: Indicates the current incident is affected by a mute rule but no alert notification has been sent externally. The event is only recorded within the system and no external alert notifications are sent, suitable for known issues or maintenance windows.
Identifier of the actual sent notification target: Includes DingTalk bot, WeCom bot, Lark bot, etc., indicating the alert has been successfully pushed to the corresponding channel.
-: No external alert notification was triggered. Possible reasons include no notification configured for the monitor, notification target failure, or exceeding notification frequency limits.

Monitor Detection Type¶

Refers to the monitor type, identifying which detection rule triggered this event, such as threshold detection, log detection, anomaly detection, etc. The detection type allows quick location of the monitor configuration entry for rule adjustment or temporary disabling.

Detection Object¶

When configuring monitor rules, if a by grouping query is used at the detection metric, the event card will display the filter condition, e.g., source:kodo-servicemap. This indicates the event is a detection result for a specific dimension group, not a global aggregation. Clicking the detection object tag can quickly filter other events with the same dimension.

Incident Content¶

The incident content from the last triggered event of the current detection object, sourced from the pre-configured content in the monitor rule. It represents the incident content when that detection object last triggered an event. The content typically includes:

Specific metric value at trigger time
Comparison result with the threshold
Pre-set troubleshooting suggestions or handling guidance
Complete description after variable substitution

Historical Trigger Trend Chart¶

This trend is displayed using a Window function, showing the historical trend of the detection result value for the last 60 actual data checks.

Based on the detection result value of the current unresolved incident, it displays the historical event anomaly trend. The trigger threshold condition value configured within the monitor detection rule is set as a clear reference line. The system specifically marks the detection result of the last triggered event for the current detection object. Through the vertical line in the trend chart, you can quickly locate the specific time point when the event was triggered. Simultaneously, the corresponding detection interval for that detection result is displayed, providing an intuitive analysis tool to assess the event's development process and its impact.

Managing Incident Cards¶

Display Items¶

The Unresolved Incidents list supports the following display styles to accommodate different information density needs across scenarios:

Standard: Displays incident title, detection dimension, and incident content.
Extended: In addition to standard information, also displays the historical trend of the detection result value for unresolved incidents.
List: Displays incident data in list form, with customizable fields.

Muting Incidents¶

In large-scale monitoring scenarios, to avoid the cumbersome steps, time consumption, and error-prone nature of manually handling a large number of similar alerts, you can directly "Mute" rules on the current page. During the mute period, events continue to be detected and recorded, but no alert notifications are sent. This is suitable for the following scenarios:

Known issues are being fixed, requiring temporary noise reduction.
Planned maintenance windows, expected anomalies.
Batch alerts for non-production environments or low-priority systems.

Mute Operation Steps¶

Hover over a single incident, click Mute on the right side.
Select the Mute Time Type.
Confirm.

Mute Time Type¶

One-time OnlyRecurring

Supports customizing the mute start time and end time, or quickly setting it to 1 hour, 6 hours, 12 hours, 1 day, 1 week.

Select the mute start time and duration.
Select the mute cycle starting from a certain point in time.
Select the mute expiration time. You can choose to repeat forever according to the above time or repeat until a specific moment.

Recovering Incidents¶

An incident status is considered recovered (df_sub_status = ok) when it is normal. Recovery means the detection object no longer meets the monitor's abnormal trigger conditions, or has been manually confirmed as resolved.

Single Recovery: Can be done via the button on the right side of the rule, or by going to the Monitor settings, or manually recovered.
Batch Recovery: Click "Recover All at Once" to recover all abnormal incidents under the current list.

Recovered incidents are divided into four types:

Name	`df_status`	Description
Recovery	ok	Previously detected "Fatal", "Critical", "Important", "Warning" abnormal incidents. If not triggered again within N checks, it is considered recovered. This is the most common automatic recovery type, indicating the metric has returned to the normal range.
Data Gap Recovery	ok	Data stopped reporting and then resumed reporting, judged as recovery.
Data Gap Treated as Recovery	ok	Detection data gap occurs, treated as normal status.
Manual Recovery	ok	User manually clicks to recover, supporting single/batch recovery.