Skip to content

Incident


The Incident Explorer centrally analyzes error data in APM. Through it, you can:

  • View error history trends: Observe the frequency change curve of specific error types or sources over time through charts such as Top List and Time Series;

  • Analyze error distribution: Quickly locate high-frequency error sources, such as service error rate and resource endpoint error rate;

  • Aggregate similar errors: Automatically group error requests with the same exception stack or similar error characteristics to avoid repeatedly viewing single traces;

  • ...

Data Display

The Incident Explorer provides multiple professional analysis views based on lists and charts.

List

Displays detailed records and aggregation results of APM errors in the current workspace, including occurrence time, error type, error message, associated services, and resources.

In list mode, two analysis modes are provided:

All Errors

Records all Spans marked as errors (status=error) and containing error types (error_type), finally viewing all error records that meet the conditions.

Data Details

In the Incident Explorer, click any error to view its trace details, including service, error type, content, distribution chart, details, trace details, extended attributes, and associated logs, hosts, network, etc.

In the error distribution chart on the error details page, based on the error_message and error_type fields, aggregate statistics of error traces with high similarity, and automatically select the time interval according to the time range to present the error distribution trend.

Displays error details.

Displays field information under the current error trace service.

TOBY AI Error Analysis

TrueWatch provides the ability to parse error data with one click. It uses large models to automatically extract key information from the data, combined with online search engines and operation knowledge bases, to quickly analyze possible fault causes and provide preliminary solutions.

  1. Click a single data to expand the details page;
  2. Click the "TOBY AI Error Analysis" button in the upper right corner;
  3. The anomaly analysis will start automatically.

Pattern

Automatically groups similar errors and identifies high-frequency patterns, displaying the top 10,000 error Spans within the selected time range. Based on clustering fields, calculates the similarity of error trace data, extracts common patterns, and helps quickly discover abnormal traces and locate problems.

By default, aggregation is performed based on the error_message field, and clustering fields can be customized, with up to 3 fields.

Pattern Details

In the Pattern list, click any error to view all associated traces.

In the associated traces page, you can sort the number of documents in ascending/descending order (default is descending).

Click a data in the associated traces again to enter the details page. You can perform the following operations:

  • View the host and service where the error occurred, error distribution, etc.;
  • Click the icon in the upper right corner of the details page to export the current data;
  • Perform AI Intelligent Analysis on the current error details;
  • Click to jump to the associated traces of the current error details.

Charts

Based on count, last, first, count_distinct operation modes, filter data under by conditions in the form of charts. Includes the following charts, which can be selected as needed:

  • Top List
  • Time Series
  • Pie Chart
  • Treemap
  • Grouped Table Chart

Issue Auto Discovery

After enabling the "Issue Auto Discovery" configuration, the system counts abnormal data according to different grouping dimensions, and performs stack tracking and automatic condensation for subsequent similar problems, finally generating an Issue. Issues generated through this entry will help you quickly obtain the context and root cause of the problem, effectively shortening the problem resolution time.

Configure

Note

Before enabling this configuration, rules must be configured first. Otherwise, enabling is not supported.

  1. Data Source: The enabling entry of the current configuration page;

  2. Combination Dimensions: Categorize and count according to the content of the configuration fields, including service, version, resource, error_type;

    • For the data source, you can add filter conditions to filter the data, and the system will further query the data that meets the conditions to narrow the data range.
  3. Detection Frequency: The system determines the time range of the query data according to the selected frequency, options include 5 minutes, 10 minutes, 15 minutes, 30 minutes, and 1 hour;

  4. Issue Definition: After enabling this configuration, the Issue will be presented as defined here. To prevent information loss, fill in sequentially.

    • Among them, in the Title and Description of the Issue, the following template variables are supported:
    Variable Meaning
    count Count
    service Service Name
    version Version
    resource Resource Name
    error_type Error Type
    error_message Error Content
    error_stack Error Stack

After saving the configuration and enabling it, Issues automatically discovered and generated by the system will be displayed in Incident.

Further Reading