Threshold Detection¶
Used to monitor anomalies in Metrics, Logs, APM, Infrastructure, Resource Catalog, Events, APM, RUM, Security Check, and Network data. Users can set thresholds, and when the threshold is exceeded, the system triggers alerts and notifies relevant personnel. It also supports multi-metric detection, and different alert levels can be configured for each metric. For example, monitoring whether the host memory usage is abnormally high.
Detection Configuration¶
Detection Frequency¶
The execution frequency of the detection rule; defaults to 5 minutes.
In addition to the specific options provided by the system, you can also input a custom crontab task to configure scheduled tasks based on minute, hour, day, month, and week cycles.
When using a custom Crontab detection frequency, the detection intervals include the last 1 minute, last 5 minutes, last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, and last 24 hours.
Detection Interval¶
The time range for querying detection metrics. The available detection intervals vary depending on the detection frequency.
Detection Frequency | Detection Interval (Dropdown Options) |
---|---|
30s | 1m/5m/15m/30m/1h/3h |
1m | 1m/5m/15m/30m/1h/3h |
5m | 5m/15m/30m/1h/3h |
15m | 15m/30m/1h/3h/6h |
30m | 30m/1h/3h/6h |
1h | 1h/3h/6h/12h/24h |
6h | 6h/12h/24h |
12h | 12h/24h |
24h | 24h |
Detection Metrics¶
-
Data Type: The current detection data type, including Metrics, Logs, Infrastructure, Resource Catalog, Events, APM, RUM, Security Check, and Network data types.
-
Aggregation Algorithm: Provides multiple aggregation algorithms, including:
- Avg by (Average)
- Min by (Minimum)
- Max by (Maximum)
- Sum by (Sum)
- Last (Last Value)
- First by (First Value)
- Count by (Data Point Count)
- Count_distinct by (Unique Data Point Count)
- p50 (Median)
- p75 (75th Percentile)
- p90 (90th Percentile)
- p99 (99th Percentile)
-
Detection Dimension: You can configure string-type (
keyword
) fields in the data as detection dimensions, supporting up to three fields. By combining multiple detection dimension fields, specific detection objects can be determined. The system will determine whether the statistical metrics of the object meet the trigger conditions, and if so, an event will be generated.- For example, selecting detection dimensions
host
andhost_ip
, the detection object can be represented as {host: host1, host_ip: 127.0.0.1}. When the detection object is "Logs," the default detection dimensions arestatus
,host
,service
,source
, andfilename
.
- For example, selecting detection dimensions
-
Filter Conditions: Filter detection data based on metric tags to limit the detection scope. Supports adding one or more tag filter conditions, and non-metric data supports fuzzy matching and fuzzy non-matching filters.
-
Alias: Custom detection metric name.
-
Query Method: Supports simple query, expression query, PromQL query, and data source query.
Cross-Workspace Query Metrics¶
After authorization, you can select detection metrics from other workspaces under the current account. After the monitor rule is successfully created, cross-workspace alert configuration can be achieved.
Note
After selecting another workspace, the detection metric dropdown options will only display data types that have been authorized in the current workspace.
Trigger Conditions¶
Set the trigger conditions for alert levels: You can configure any one of the trigger conditions for Critical, Error, Warning, or Normal.
Configure trigger conditions and severity. When the query result has multiple values, any value that meets the trigger condition will generate an event.
For more details, refer to Event Level Description.
If continuous trigger judgment is enabled, you can configure the system to generate an event after the trigger condition is met multiple times consecutively. The maximum limit is 10 times.
Alert Levels¶
-
Critical (Red), Error (Orange), Warning (Yellow): Based on configured condition judgment operators.
For more operator details, refer to Operator Description;
For
likeTrue
andlikeFalse
truth table details, refer to Truth Table Description. -
Normal (Green): Based on the configured number of detections, explained as follows:
- Each execution of a detection task counts as 1 detection. For example, if
Detection Frequency = 5 minutes
, then 1 detection = 5 minutes; - You can customize the number of detections. For example, if
Detection Frequency = 5 minutes
, then 3 detections = 15 minutes.
Level Description Normal After the detection rule takes effect, if Critical, Error, or Warning events are generated, and the data detection result returns to normal within the configured number of detections, a recovery alert event is generated.
Recovery alert events are not subject to Alert Silence restrictions. If the number of detections for recovery alert events is not set, the alert event will not recover and will always appear in the Events > Unrecovered Events List.
- Each execution of a detection task counts as 1 detection. For example, if
Recovery Conditions¶
After enabling recovery conditions, you can set recovery conditions and severity for the current explorer. When the query result has multiple values, any value that meets the trigger condition will generate a recovery event.
When the alert level changes from low to high, the corresponding alert event is sent; when it returns to normal, a normal recovery event is sent.
Note
- Recovery conditions will only be displayed when all trigger conditions are >, >=, <, <=;
- The recovery threshold for the corresponding level must be less than the trigger threshold (e.g., Critical recovery threshold < Critical trigger threshold).
Recovery Alert Logic¶
After enabling "Recovery Conditions," the system uses Fault ID (Fault ID) as a unique identifier to manage the entire lifecycle of the alert (including creating Issues, etc.).
When hierarchical recovery is also enabled:
-
The platform will configure a separate set of recovery rules (i.e., recovery thresholds) for each alert level (e.g.,
critical
,warning
) -
The alert status and recovery status for each level are calculated independently
-
The original Fault ID identifier's alert lifecycle is not affected
Therefore, when the monitor first triggers an alert (i.e., starts a new alert lifecycle), the system simultaneously generates two alert messages. They appear similar because:
-
The first alert source: Overall detection (
check
), representing the start of the entire fault lifecycle (based on the original rule); -
The second alert source: Hierarchical detection (
critical
/error
/warning
/…), indicating that the hierarchical recovery function has been activated to display the specific alert level and its subsequent recovery status (e.g.,critical_ok
).
In the above, the df_monitor_checker_sub
field is the core basis for distinguishing between the two types of alerts:
check
: Represents the result of overall detection;- Other values (e.g.,
critical
,error
,warning
, etc.): Correspond to the results of hierarchical detection rules.
Therefore, when an alert is first triggered, two records will appear, similar in content but different in source and purpose.
df_monitor_checker_sub |
T+0 | T+1 | T+2 | T+3 |
---|---|---|---|---|
check |
check |
error |
warning |
ok |
critical |
critical |
critical_ok |
||
error |
error |
error_ok |
||
warning |
warning |
warning_ok |
Data Gap¶
For data gap status, seven strategies can be configured.
-
Link the detection interval time range to determine the query result of the detection metric for the last few minutes, do not trigger an event;
-
Link the detection interval time range to determine the query result of the detection metric for the last few minutes, treat the query result as 0; at this point, the query result will be compared with the threshold configured in the Trigger Conditions above to determine whether to trigger an abnormal event.
-
Custom fill the detection interval value, trigger data gap event, trigger critical event, trigger error event, trigger warning event, and trigger recovery event; when selecting this configuration strategy, it is recommended that the custom data gap time configuration >= detection interval time interval. If the configured time <= detection interval time interval, there may be situations where both data gap and abnormal conditions are met, in which case only the data gap processing result will be applied.
Information Generation¶
Enabling this option will generate "Information" events for detection results that do not match the above trigger conditions.
Note
When trigger conditions, data gap, and information generation are configured simultaneously, the triggering priority is as follows: data gap > trigger conditions > information event generation.
Other Configurations¶
For more details, refer to Rule Configuration.