Incident Center¶
The Incident Center is a core feature module of the TrueWatch platform, specifically designed for managing abnormal events (i.e., incidents) automatically discovered by system monitoring. It provides a standardized handling process covering the entire lifecycle management from incident discovery, resolution, to post-mortem review.
What is an Incident¶
In TrueWatch, an Incident refers to a system abnormal event automatically detected and generated by the monitors you have configured. When a monitor detects out-of-bounds metrics, log errors, or application performance degradation, it triggers the creation of an incident record.
Incidents have the following core characteristics:
- Driven entirely by monitoring rules without requiring manual intervention.
- The system automatically merges repeated alerts with the same root cause within a short period into a single incident, effectively suppressing alert storms.
- Every step from discovery, handling to review is traceable.
Where Do Incidents Come From¶
The generation of incidents depends entirely on the monitoring system you have pre-configured; it is the result of automated monitoring. These monitoring rules continuously inspect your infrastructure, applications, logs, and other data. When any data exceeds the normal range defined by the rules, it is judged as "abnormal," thereby triggering the incident process.
For information on how to configure monitors to generate incidents, please refer to the Monitor Configuration Guide.
Core Concepts¶
The Incident Center is built around the following three core concepts, which together form a complete incident response system:
| Concept | Description | Core Explanation |
|---|---|---|
| Incident | An abnormal event requiring manual intervention and handling. | Status flow: Open (Unhandled) → Working (In Progress) → Resolved (Recovered) → Closed (Closed) |
| On-call | The person or team responsible for receiving incident notifications. | Automatic routing of incidents is achieved through tag matching, ensuring notifications reach the right person at the first moment. |
| Escalation | The escalation mechanism when an incident is not handled promptly. | When an incident times out without a response, it automatically notifies more personnel or superiors level by level, avoiding handling delays. |
How to Manage Incidents¶
All incident events are automatically aggregated into the Incident Center for unified management. You can perform all operations from viewing and handling to review here.
1. View the Incident List¶
On the Incident List page, you can centrally browse all incidents and quickly grasp the overall status. The list supports filtering by multiple dimensions such as status, severity, assignee, helping you prioritize the most critical incidents.
2. Dive into Incident Details¶
Clicking on any incident will take you to its Details page. This is the core workspace for resolving incidents, providing you with three key capabilities:
- Complete Context: The system automatically associates and displays full-chain data related to the incident, including performance metrics, error logs, trace spans, infrastructure topology, etc., eliminating the need to manually search across multiple modules.
- Visualized Impact Scope: Based on data from the last 2 hours, it visually presents the impact surface of the incident, helping you quickly assess the severity and scope of the problem.
- Collaboration Timeline: All status changes, assignee handovers, team discussions, and key operations are automatically recorded by the system, forming a complete, auditable handling timeline for easy post-mortem review.
3. Follow the Handling Process¶
Incident handling follows a standardized, normative process to ensure each step is clear and controllable:
Step 1: Automated Notification and Response
The system automatically notifies the primary responsible person based on preset On-call Rules. If the incident is not handled within the specified time limit, it will automatically notify subsequent personnel or teams according to the Escalation Policy, forming a multi-level response guarantee to ensure no incident is missed.
Step 2: Analysis and Root Cause Identification Based on Aggregated Information
The assignee centrally views all associated data on the Incident Details page, using the aggregated analysis environment provided by the system for root cause identification, without needing to switch between multiple tools.
Step 3: Standardized Process Tracking
The handling process must follow the standard status flow (Open → Working → Resolved → Closed). All key operations and team communications are automatically recorded by the system, ensuring the entire process is traceable and responsibilities are accountable.