Threshold Detection¶
Used to monitor anomalies in Metrics, LOG, APM, Infrastructure, Resource Catalog, Events, APM, RUM, Security Check, and Network data. Users can set thresholds, and the system will trigger alerts and notify relevant personnel when thresholds are exceeded. It also supports multi-metric detection, with different alert levels configurable for each metric. For example, monitoring whether the host memory usage is abnormally high.
Detection Configuration¶
Detection Frequency¶
The execution frequency of the detection rule; the default is 5 minutes.
In addition to the specific options provided by the system, you can also input a custom crontab task to configure scheduled tasks based on minutes, hours, days, months, weeks, etc.
When using a custom Crontab detection frequency, the detection intervals include the last 1 minute, last 5 minutes, last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, and last 24 hours.
Detection Interval¶
The time range for querying the detection metric. The available detection intervals vary depending on the detection frequency.
Detection Frequency | Detection Interval (Dropdown Options) |
---|---|
30s | 1m/5m/15m/30m/1h/3h |
1m | 1m/5m/15m/30m/1h/3h |
5m | 5m/15m/30m/1h/3h |
15m | 15m/30m/1h/3h/6h |
30m | 30m/1h/3h/6h |
1h | 1h/3h/6h/12h/24h |
6h | 6h/12h/24h |
12h | 12h/24h |
24h | 24h |
Detection Metric¶
-
Data Type: The current data type being detected, including Metrics, LOG, Infrastructure, Resource Catalog, Events, APM, RUM, Security Check, and Network data types.
-
Aggregation Algorithm: Provides multiple aggregation algorithms, including:
- Avg by (average value)
- Min by (minimum value)
- Max by (maximum value)
- Sum by (sum)
- Last (last value)
- First by (first value)
- Count by (data point count)
- Count_distinct by (unique data point count)
- p50 (median value)
- p75 (value at the 75th percentile)
- p90 (value at the 90th percentile)
- p99 (value at the 99th percentile)
-
Detection Dimension: You can configure string-type (
keyword
) fields in the data as detection dimensions, with up to three fields supported. By combining multiple detection dimension fields, the specific detection object can be determined. The system will determine whether the statistical metric of the object meets the trigger conditions, and if so, an event will be generated.- For example, selecting detection dimensions
host
andhost_ip
, the detection object can be represented as {host: host1, host_ip: 127.0.0.1}. When the detection object is "LOG", the default detection dimensions arestatus
,host
,service
,source
, andfilename
.
- For example, selecting detection dimensions
-
Filter Conditions: Filter the detection data based on the metric tags to limit the detection scope. Supports adding one or more tag filter conditions, and non-metric data supports fuzzy matching and fuzzy non-matching filters.
-
Alias: Custom detection metric name.
-
Query Method: Supports simple queries, expression queries, PromQL queries, and data source queries.
Trigger Conditions¶
Set the trigger conditions for alert levels: You can configure any of the trigger conditions for Critical, Important, Warning, or Normal.
Configure the trigger conditions and severity. When the query result has multiple values, any value meeting the trigger condition will generate an event.
For more details, refer to Event Level Description.
If Continuous Trigger Judgment is enabled, you can configure the system to generate an event after the trigger condition is met multiple times consecutively. The maximum limit is 10 times.
Alert Levels¶
-
Critical (Red), Important (Orange), Warning (Yellow): Based on the configured condition judgment operators.
For more operator details, refer to Operator Description;
For details on the
likeTrue
andlikeFalse
truth tables, refer to Truth Table Description. -
Normal (Green): Based on the configured detection count, as explained below:
- Each execution of a detection task counts as 1 detection, e.g., if
Detection Frequency = 5 minutes
, then 1 detection = 5 minutes; - You can customize the detection count, e.g., if
Detection Frequency = 5 minutes
, then 3 detections = 15 minutes.
Level Description Normal After the detection rule takes effect, if Critical, Important, or Warning events occur, and the data detection result returns to normal within the configured custom detection count, a recovery alert event is generated.
Recovery alert events are not restricted by Alert Silence. If the recovery alert event detection count is not set, the alert event will not recover and will remain in the Events > Unrecovered Events List.
- Each execution of a detection task counts as 1 detection, e.g., if
Recovery Conditions¶
After enabling the recovery conditions, you can set recovery conditions and severity for the current viewer. When the query result has multiple values, any value meeting the trigger condition will generate a recovery event.
When the alert level changes from low to high, the corresponding alert event is sent; when returning to normal, a normal recovery event is sent.
Note
- Recovery conditions are only displayed when all trigger conditions are >, >=, <, <=;
- The recovery threshold for the corresponding level must be less than the trigger threshold (e.g., Critical recovery threshold < Critical trigger threshold).
Recovery Alert Logic¶
After enabling "Recovery Conditions," the system uses Fault ID (Fault ID) as a unique identifier to manage the entire alert lifecycle (including creating Issues, etc.).
When hierarchical recovery is also enabled:
-
The platform will configure a separate set of recovery rules (i.e., recovery thresholds) for each alert level (e.g.,
critical
,warning
) -
The alert and recovery status for each level is calculated independently
-
The original Fault ID identifier's alert lifecycle is not affected
Therefore, when the monitor first triggers an alert (i.e., starting a new alert lifecycle), the system simultaneously generates two alert messages. They appear similar because:
-
The first alert source: Overall detection (
check
), representing the start of the entire fault lifecycle (based on the original rule); -
The second alert source: Hierarchical detection (
critical
/error
/warning
/…), indicating that the hierarchical recovery function has been activated, used to present the specific alert level and its subsequent recovery status (e.g.,critical_ok
).
In the above, the df_monitor_checker_sub
field is the core basis for distinguishing the two types of alerts:
check
: Represents the result of the overall detection;- Other values (e.g.,
critical
,error
,warning
, etc.): Correspond to the results of the hierarchical detection rules.
Thus, when an alert is first triggered, two records appear, similar in content but different in source and purpose.
df_monitor_checker_sub |
T+0 | T+1 | T+2 | T+3 |
---|---|---|---|---|
check |
check |
error |
warning |
ok |
critical |
critical |
critical_ok |
||
error |
error |
error_ok |
||
warning |
warning |
warning_ok |
Data Gap¶
For data gap status, seven strategies can be configured.
-
Link to the detection interval time range, judge the query result of the detection metric for the last few minutes, do not trigger an event;
-
Link to the detection interval time range, judge the query result of the detection metric for the last few minutes, treat the query result as 0; at this time, the query result will be re-compared with the threshold configured in the Trigger Conditions above to determine whether to trigger an abnormal event.
-
Custom fill the detection interval value, trigger a data gap event, trigger a critical event, trigger an important event, trigger a warning event, and trigger a recovery event; when selecting this type of configuration strategy, it is recommended to set the custom data gap time >= detection interval time interval, if the configured time <= detection interval time interval, there may be situations where both data gap and abnormal conditions are met, in which case only the data gap processing result will be applied.
Information Generation¶
Enabling this option will generate "Information" events for detection results that do not match the above trigger conditions.
Note
If trigger conditions, data gap, and information generation are configured simultaneously, the triggering priority is as follows: Data Gap > Trigger Conditions > Information Event Generation.
Other Configurations¶
For more details, refer to Rule Configuration.