On-call¶
The on-call feature helps teams establish a 7x24 fault response mechanism, ensuring each incident has a clear responsible person and automatically escalates when not handled within the timeout period, achieving "guaranteed alert delivery".
Core Concepts¶
On-call Rules¶
On-call rules define who is responsible for which type of incident at what time. Each rule includes the following elements:
- On-call personnel: Members, teams, or notification targets.
- Time period: The effective time range for the on-call duty (supports timezone settings).
- Matching tags/dimensions: Determines which incidents are routed to this rule.
- Escalation strategy: Notification escalation rules when incidents are not handled within the timeout period.
Escalation Strategy¶
An escalation strategy is a multi-level notification mechanism attached to an on-call rule. When an incident is not claimed or resolved within a specified time, the system will gradually expand the notification scope according to preset levels, ensuring no incident is missed.
Matching Tag Logic¶
Incidents automatically match on-call rules based on their tags. Matching rules support:
- AND: Multiple tags must be satisfied simultaneously (full match).
- OR: Any one of multiple tags being satisfied is sufficient (partial match).
- Wildcard:
key:value*supports prefix matching.
Example:
Incident tags: {service:payment, env:prod, team:backend}
- On-call Rule A: Tags
service:payment AND env:prod→ Matches ✓ - On-call Rule B: Tag
team:frontend→ Does not match ✗ - On-call Rule C: No tags (global) → Matches ✓ (fallback)
Note
If no matching tags are set, the on-call rule is considered "globally matched" and will receive all incidents not matched by other rules.
On-call Calendar¶
The on-call calendar provides a visual scheduling view, making it easy to quickly understand current and future on-call arrangements.
-
Default view: After entering the on-call page, "My On-call" is displayed by default. The calendar on the right highlights all on-call schedules the current user participates in, and the list on the left shows all on-call rules that include the current user.
-
All on-call: After clicking "All On-call", the calendar on the right displays the on-call schedules for all members, and the list on the left shows all on-call rules.
-
Default on-call: The system-built "Default On-call" is always displayed in the on-call list and cannot be deleted or hidden.
-
View details: Click on the colored block or member name on the calendar to display detailed information about that on-call duty, including the associated on-call rule, escalation strategy, and specific on-call time periods. The top left corner supports switching timezones and dates to view historical or future schedules.
On-call Management¶
The "On-call Management" page centrally displays all on-call rules in a list format. Each rule lists key information such as on-call timezone, execution cycle, on-call personnel, matching tags, and escalation strategy. The list includes both the system default on-call and custom on-call rules. Clicking any entry takes you to its detail page for in-depth configuration.
To ensure incident notifications are accurately delivered and responsibility is closed-loop, the core of configuring on-call strategies lies in establishing the following two-layer guarantee mechanism:
-
Clarify "who is responsible when": By setting on-call personnel, effective time periods, and enabling notification rotation (supports automatic handover by day, week, etc.), the system achieves clear duty scheduling and automatic handover, ensuring there is always a clear "first responder" at any time.
-
Preset escalation path ("how to escalate if no response"): By configuring escalation strategies, a "T+N minutes" multi-level notification timeline is built. When an incident is not handled within the set time, the system will automatically escalate the alert notification to other levels of members or broader teams according to this rule, ensuring critical incidents are always delivered.
Create an On-call Rule¶
Creating an on-call rule requires completing the following configuration steps.
Basic Information¶
- Enter the on-call rule name.
- Select the timezone on which the on-call duty is based.
- Select the time period covered by this on-call duty. By setting the effective time (including start and end times), the validity period of the current on-call duty is precisely defined.
Matching Tags/Dimensions (Optional)¶
This section determines which incidents will be handled by this rule. If no tags/dimensions are added, this rule will be globally matched.
-
Matching tags:
- Select existing tags from the dropdown list.
- Supports directly entering new tags for quick creation, or directly going to "Global Tags" for management.
-
Matching dimensions:
- You can select detection dimensions (e.g.,
service,host) and set specific matching values. - Supports logical relationships: AND (full match, all conditions must be met) or OR (partial match, any one condition being met is sufficient). Default is AND.
- Values support wildcards, format is
key:value*, e.g.,service:auth*can matchauth-api,auth-service, etc.
- You can select detection dimensions (e.g.,
On-call Personnel Settings¶
- Select on-call personnel: Can be one or more members, or an entire team.
- Enable rotation: If rotation is needed, enable the rotation function. Set the rotation cycle (e.g., daily, weekly, monthly), and the system will automatically schedule rotations in the order of the member list, visually displaying the rotation effect in the calendar on the right.
Rotation example:
- Before enabling rotation:
- After enabling rotation:
Note
If no on-call personnel are configured for the current rule, you cannot add an escalation strategy.
Configure Escalation Strategy¶
The escalation strategy ensures that when an incident is not handled within the timeout period, the notification scope is automatically expanded to more people or higher levels (❗️ The escalation strategy is the core of the on-call rule, strongly recommended to configure).
Timeline Mechanism (T+N)¶
All time point calculations are based on the incident generation moment (recorded as T=0). The system triggers notifications at each level sequentially according to preset time intervals:
| Trigger Time | Level | Description |
|---|---|---|
| T+0 | Level 0 | Immediate notification when incident occurs (initial) |
| T+5 minutes | Level 1 | First-level escalation |
| T+15 minutes | Level 2 | Second-level escalation |
| T+30 minutes | Level 3 | Third-level escalation |
Level Configuration Description¶
1. Level 0 (Initial Notification) (Required)
- Trigger timing: Immediate notification when the incident occurs (T=0).
- Notification targets: Fixed as the on-call personnel configured in the current on-call rule. Cannot add other personnel.
- Notification method: Select individually for each notification target (email, SMS, phone call, multiple selections allowed).
2. Level 1~10 (Escalation Levels) (Optional)
- Trigger conditions: The following conditions must be met simultaneously to trigger this level:
- The incident duration has reached the set wait time (e.g., T+20 minutes).
- The incident severity is within the specified range (e.g., only effective for P0, P1).
- The incident status is a specified value (e.g., Open or Working).
- Notification targets: Can add new personnel or teams on top of the original notification targets. That is, Level 1 notification targets = Level 0 notification targets + Level 1 added personnel.
- Notification method: Set the notification method individually for the added personnel.
Note
The incident severity and status ranges for higher levels must not exceed the ranges selected for lower levels. For example, if Level 0 applies to P0/P1, then Level 1 can only select a subset of P0 or P1 (cannot expand to P2).
Repeat Notification Mechanism¶
Within each level, you can choose whether to enable repeat notifications:
- Disable repeat notification: The level sends only one notification, then waits to enter the next level.
- Enable repeat notification: Periodically sends notifications at the set frequency (e.g., every 5 minutes) until the incident status changes or enters the next level.
Note
The repeat interval must be less than the wait time to enter the next level, otherwise it cannot be set.
Example:
- Level 1 wait time: 30 minutes
- Repeat interval: 5 minutes
- Final effect: Sends a notification at T+5, T+10, T+15, T+20, T+25, T+30 minutes.
Note
If the last level (e.g., Level 10) has repeat notifications enabled and the incident is never handled, the system will infinitely repeat sending notifications until someone claims or resolves it.
Handling Cross-shift Handovers¶
If the incident duration spans a shift handover time, subsequent escalation notifications will be transferred to the new on-call personnel and executed according to the new on-call personnel's escalation strategy.
Example:
- An incident occurs at 23:55, and the on-call person at that time is A.
- The wait time for Level 1 in the escalation strategy is 15 minutes, configured to repeat every 5 minutes.
- The first repeat notification triggers 5 minutes after the incident occurs (i.e., 0:00). At this time, the on-call person has switched to B, so this notification will be sent to B, and all subsequent escalation notifications (including remaining repeats and the next level) will be executed according to B's escalation strategy.
After crossing days, the system will continue processing the incident based on the new on-call person B's escalation rules.
Note
It is recommended to consider cross-day scenarios when configuring escalation strategies to ensure incidents can be effectively responded to at any time.
Deduplication for Multiple Escalation Strategies¶
When the same incident matches multiple on-call rules (thus matching multiple escalation strategies), the system automatically performs notification deduplication to ensure the same user does not receive duplicate notifications. The deduplication logic is based on user, incident, and notification content.
Escalation Strategy Configuration Example¶
Scenario: Escalation strategy for core service P0 incidents
| Level | Wait Time | Applicable Conditions | Notification Targets | Notification Method |
|---|---|---|---|---|
| Level 0 | T+0 | Severity = P0 | Current on-call person A | SMS + Email |
| Level 1 | T+5 minutes | Severity = P0, Status = Open/Working | + On-call team lead B | B: Phone call |
| Level 2 | T+15 minutes | Severity = P0, Status = Open/Working | + Department manager C | C: Phone call |
| Level 3 | T+30 minutes | Severity = P0, Status = Open/Working | + CTO D | D: Phone call + SMS |
In this example:
- When the incident occurs, the current on-call person A is notified immediately.
- If the incident is not handled after 5 minutes, the on-call team lead B is added to the notification (notification targets are now A + B).
- If still not handled after 15 minutes, the department manager C is added to the notification (notification targets are now A + B + C).
- If still not handled after 30 minutes, the CTO D is added to the notification, and Level 3 has repeat notifications enabled (e.g., every 10 minutes) until someone responds.
Notification Method Description¶
Prerequisites
- The notified personnel must have configured corresponding contact methods (email, phone number) in their preference settings, otherwise they cannot receive notifications via those channels.
- If "on-call phone" or "on-call email" are additionally configured in the "Preference Settings", the system will prioritize using these dedicated contact methods for notifications to improve reliability and distinguishability.
The system supports three notification channels. You can select individually for each notification target:
| Method | Description | Applicable Scenarios |
|---|---|---|
| Sends email notifications containing incident details and links. | Non-urgent incidents, scenarios requiring detailed information. | |
| SMS | Sends SMS notifications with concise content, only containing key information and links. | Scenarios requiring timely awareness but not immediate phone response. |
| Phone | IVR voice call. After connection, the alert content can be played, and keypress confirmation is required. ❗️If you need to configure on-call phone numbers for contacts in different timezones/regions, be sure to use the +area code format. |
Urgent incidents, ensuring information delivery, suitable for nighttime or high-priority scenarios. |
Default On-call¶
The system has a built-in "Default On-call", which is a simplified version of an on-call rule suitable for simple scenarios. Its characteristics are as follows:
- Only the on-call personnel, personnel rotation, and escalation strategy can be configured.
- Non-configurable items: Timezone (fixed as empty, follows system timezone), matching tags/dimensions (not supported for setting, defaults to global match).
- The default on-call is always displayed in the on-call list and cannot be deleted.
Rule Limitations¶
- An on-call rule can have a maximum of 10 escalation levels (Level 0 + Level 1~10).
- The maximum single wait time is 360 minutes (6 hours). Cannot save if exceeded.
- The incident severity and status ranges for higher levels must be subsets of the ranges selected for lower levels.
Configuration Checklist
Before saving the on-call rule, it is recommended to confirm item by item:
- Does Level 0 include the current on-call personnel? (Included by default)
- Is the wait time for each level reasonable? (Consider that nighttime response may require more time)
- Does the final level include the contact person who "must be reached no matter what"?
- If repeat notifications are enabled, is the repeat interval less than the wait time for the next level?
- Have all notification targets configured their corresponding contact methods (especially phone)?
- Does the continuity of the escalation strategy meet requirements for cross-day scenarios?
Next Steps¶
After configuring the on-call rules, you can see the on-call information automatically associated with incidents in the Incident List. When an incident occurs, the system will automatically notify the corresponding personnel according to your set rules and execute the escalation strategy after timeout, ensuring every incident receives a timely response.



