Skip to content

Threshold Detection


Used to monitor anomalies in Metrics, LOGs, infrastructure, Resource Catalog, events, APM, RUM, and other data types. Rules can be set with thresholds, and when the threshold is exceeded, the system triggers alerts and notifies relevant personnel. It also supports multi-metric detection and allows configuring different alert levels for each metric. For example, monitoring whether the memory usage of a host is abnormally high.

Detection Configuration

Detection Frequency

The execution frequency of the detection rule; the default is 5 minutes.

In addition to the specific options provided by the system, you can also input a custom crontab task to configure scheduled tasks based on seconds, minutes, hours, days, months, and weeks.

When using a custom Crontab detection frequency, the detection intervals include the last 1 minute, last 5 minutes, last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, and last 24 hours.

Detection Interval

The time range for querying the detection metric. The available detection intervals vary depending on the detection frequency.

Detection Frequency Detection Interval (Dropdown Options)
30s 1m/5m/15m/30m/1h/3h
1m 1m/5m/15m/30m/1h/3h
5m 5m/15m/30m/1h/3h
15m 15m/30m/1h/3h/6h
30m 30m/1h/3h/6h
1h 1h/3h/6h/12h/24h
6h 6h/12h/24h
12h 12h/24h
24h 24h

Detection Metrics

  1. Data Type: The current data type being detected, including Metrics, LOGs, infrastructure, Resource Catalog, events, APM, RUM, and network data types.

  2. Aggregation Algorithm: Provides multiple aggregation algorithms, including:

    • Avg by (average value)
    • Min by (minimum value)
    • Max by (maximum value)
    • Sum by (sum)
    • Last (last value)
    • First by (first value)
    • Count by (number of data points)
    • Count_distinct by (number of unique data points)
    • p50 (median value)
    • p75 (value at the 75th percentile)
    • p90 (value at the 90th percentile)
    • p99 (value at the 99th percentile)
  3. Detection Dimensions: You can configure string-type (keyword) fields in the data as detection dimensions, supporting up to three fields. By combining multiple detection dimension fields, you can determine the specific detection object. The system will judge whether the statistical metrics of the object meet the trigger conditions, and if so, generate an event.

    • For example, selecting detection dimensions host and host_ip, the detection object can be represented as {host: host1, host_ip: 127.0.0.1}. When the detection object is "LOG", the default detection dimensions are status, host, service, source, and filename.
  4. Filter Conditions: Filters the detection data based on the tags of the metrics, limiting the detection scope. Supports adding one or more tag filter conditions, and non-metric data supports fuzzy matching and fuzzy non-matching filters.

  5. Alias: Custom name for the detection metric.

  6. Query Method: Supports simple queries, expression queries, PromQL queries, and data source queries.

Cross-Workspace Query Metrics

After authorization, you can select detection metrics from other workspaces under the current account. After the monitor rule is successfully created, cross-workspace alert configurations can be achieved.

Note

After selecting another workspace, the detection metric dropdown options will only display the data types that have been authorized in the current workspace.

Trigger Conditions

Set the trigger conditions for alert levels: You can configure any one of the trigger conditions for critical, error, warning, or normal.

Configure the trigger conditions and severity. When the query result contains multiple values, an event is generated if any value meets the trigger condition.

For more details, refer to Event Level Description.

If continuous trigger judgment is enabled, you can configure the system to generate an event only after the trigger conditions are met multiple times consecutively. The maximum limit is 10 times.

Alert Levels

  1. Critical (Red), Error (Orange), Warning (Yellow): Based on the configured condition judgment operators.

    For more operator details, refer to Operator Description;

    For the likeTrue and likeFalse truth table details, refer to Truth Table Description.

  2. Normal (Green): Based on the configured number of detections, as explained below:

    • Each execution of a detection task counts as 1 detection. For example, if detection frequency = 5 minutes, then 1 detection = 5 minutes;
    • You can customize the number of detections. For example, if detection frequency = 5 minutes, then 3 detections = 15 minutes.
    Level Description
    Normal After the detection rule takes effect, if critical, error, or warning anomaly events occur, and the data detection results return to normal within the configured number of detections, a recovery alert event is generated.
    ⚠ Recovery alert events are not restricted by Alert Silence. If the number of detections for recovery alert events is not set, the alert event will not recover and will remain in the Events > Unrecovered Events List.

Recovery Conditions

After enabling the recovery conditions, you can set recovery conditions and severity for the current explorer. When the query result contains multiple values, a recovery event is generated if any value meets the trigger condition.

When the alert level changes from low to high, the corresponding alert event is sent; when it returns to normal, a normal recovery event is sent.

Note
  • The recovery conditions will only be displayed when all trigger conditions are >, >=, <, <=;
  • The recovery threshold for the corresponding level must be less than the trigger threshold (e.g., critical recovery threshold < critical trigger threshold).

Recovery Alert Logic

After enabling "Recovery Conditions," the system uses Fault ID (fault ID) as a unique identifier to manage the entire lifecycle of the alert (including creating Issues, etc.).

When hierarchical recovery is also enabled:

  • The platform will configure a separate set of recovery rules (i.e., recovery thresholds) for each alert level (e.g., critical, warning)

  • The alert status and recovery status for each level are calculated independently

  • The original Fault ID identifier's alert lifecycle is not affected

Therefore, when the monitor triggers an alert for the first time (i.e., starting a new alert lifecycle), the system simultaneously generates two alert messages. They appear similar because:

  1. The first alert comes from the overall detection (check), representing the start of the entire fault lifecycle (based on the original rule);

  2. The second alert comes from hierarchical detection (critical/error/warning/…), indicating that the hierarchical recovery function has been activated, used to present the specific alert level and its subsequent recovery status (e.g., critical_ok).

In the above, the df_monitor_checker_sub field is the core basis for distinguishing the two types of alerts:

  • check: Represents the result of the overall detection;
  • Other values (e.g., critical, error, warning, etc.): Correspond to the results of the hierarchical detection rules.

Therefore, when an alert is triggered for the first time, two records will appear, with similar content but different sources and purposes.

df_monitor_checker_sub T+0 T+1 T+2 T+3
check check error warning ok
critical critical critical_ok
error error error_ok
warning warning warning_ok

Data Gap

For data gap status, seven strategies can be configured.

  1. Linked to the detection interval time range, judge the query result of the detection metric for the recent minutes, do not trigger an event;

  2. Linked to the detection interval time range, judge the query result of the detection metric for the recent minutes, treat the query result as 0; at this time, the query result will be re-compared with the threshold configured in the Trigger Conditions above to determine whether to trigger an anomaly event.

  3. Custom fill the detection interval value, trigger data gap event, trigger critical event, trigger error event, trigger warning event, and trigger recovery event; when selecting this type of configuration strategy, it is recommended to configure the custom data gap time >= detection interval time interval. If the configured time <= detection interval time interval, there may be cases where both data gap and anomaly conditions are met, and only the data gap processing result will be applied in such cases.

Information Generation

Enabling this option will generate "information" events for detection results that do not match the above trigger conditions.

Note

If trigger conditions, data gap, and information generation are configured simultaneously, the triggering is judged according to the following priority: data gap > trigger conditions > information event generation.

Other Configurations

For more details, refer to Rule Configuration.