AWS OpenSearch¶
AWS OpenSearch, including connections, requests, latency, slow queries, etc.
Configuration¶
Install Func¶
It is recommended to enable TrueWatch integration - Extensions - DataFlux Func (Automata): all prerequisites are automatically installed, please continue with script installation
For self-deployed Func, refer to Self-deployed Func
Install Script¶
Note: Please prepare an Amazon AK with the required permissions in advance (for simplicity, you can directly grant global read-only permissions
ReadOnlyAccess
)
Managed Version Activation Script¶
- Log in to the TrueWatch console
- Click on the 【Integration】 menu, select 【Cloud Account Management】
- Click on 【Add Cloud Account】, select 【AWS】, and fill in the required information on the interface. If the cloud account information has been configured before, skip this step
- Click on 【Test】, and after a successful test, click on 【Save】. If the test fails, please check if the relevant configuration information is correct and test again
- Click on 【Cloud Account Management】 list to see the added cloud account, click on the corresponding cloud account to enter the details page
- Click on the 【Integration】 button on the cloud account details page, find
AWS OpenSearch
under theNot Installed
list, and click on the 【Install】 button to pop up the installation interface for installation.
Manual Activation Script¶
-
Log in to the Func console, click on 【Script Market】, enter the TrueWatch script market, search for:
integration_aws_open_search
-
Click on 【Install】, and enter the corresponding parameters: AWS AK ID, AK Secret, and account name.
-
Click on 【Deploy Startup Script】, the system will automatically create a
Startup
script set and automatically configure the corresponding startup scripts. -
After enabling, you can see the corresponding automatic trigger configuration in 「Management / Automatic Trigger Configuration」. Click on 【Execute】 to immediately execute once without waiting for the scheduled time. Wait a moment, you can view the execution task records and corresponding logs.
Verification¶
- In 「Management / Automatic Trigger Configuration」, confirm whether the corresponding task has the corresponding automatic trigger configuration, and you can also check the corresponding task records and logs to see if there are any exceptions
- In TrueWatch, check if there is asset information in 「Infrastructure / Custom」
- In TrueWatch, check if there is corresponding monitoring data in 「Metrics」
Metrics¶
After configuring AWS OpenSearch, the default measurement sets are as follows. You can collect more metrics by configuring AWS CloudWatch Metrics Details
Cluster Metrics¶
Amazon OpenSearch Service provides the following metrics for clusters.
Metric | Description |
---|---|
ClusterStatus.green |
A value of 1 indicates that all index shards are allocated to nodes in the cluster. Related statistics: Maximum |
ClusterStatus.yellow |
A value of 1 indicates that all primary shards of the indices are allocated to nodes in the cluster, but at least one index's replica shards are not. For more information, see Yellow Cluster Status: Related statistics: Maximum |
ClusterStatus.red |
A value of 1 indicates that at least one index's primary and replica shards are not allocated to nodes in the cluster. For more information, see Red Cluster Status: Related statistics: Maximum |
Shards.active |
The total number of active primary and replica shards. Related statistics: Maximum, Sum |
Shards.unassigned |
The number of shards not allocated to nodes in the cluster. Related statistics: Maximum, Sum |
Shards.delayedUnassigned |
The number of shards whose node allocation is delayed due to timeout settings. Related statistics: Maximum, Sum |
Shards.activePrimary |
The number of active primary shards. Related statistics: Maximum, Sum |
Shards.initializing |
The number of shards being initialized. Related statistics: Sum |
Shards.relocating |
The number of shards being relocated. Related statistics: Sum |
Nodes |
The number of nodes in the OpenSearch Service cluster, including dedicated master UltraWarm nodes and nodes. For more information, see Changing Configuration in Amazon OpenSearch Service: Related statistics: Maximum |
SearchableDocuments |
The total number of searchable documents across all data nodes in the cluster. Related statistics: Minimum, Maximum, Average |
CPUUtilization |
The percentage of CPU utilization for data nodes in the cluster. Maximum shows the node with the highest CPU utilization. Average represents all nodes in the cluster. This metric is also available for individual nodes. Related statistics: Maximum, Average |
ClusterUsedSpace |
The total amount of used space in the cluster. You must keep a one-minute period to get an accurate value. The OpenSearch Service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB. Related statistics: Minimum, Maximum |
ClusterIndexWritesBlocked |
Indicates whether your cluster is accepting or blocking incoming write requests. A value of 0 indicates that the cluster is accepting requests. A value of 1 indicates that requests are being blocked. Some common factors include: FreeStorageSpace being too low or JVMMemoryPressure being too high. To mitigate this, consider increasing disk space or scaling the cluster. Related statistics: Maximum |
FreeStorageSpace |
The available space on each data node in the cluster. Sum shows the total available space in the cluster, but you must keep a one-minute period to get an accurate value. Minimum and Maximum show the nodes with the smallest and largest available space, respectively. This metric is also available for individual nodes. OpenSearchClusterBlockException is thrown when this metric reaches 0. To recover, you must delete indices, add larger instances, or add EBS-based storage to existing instances. To learn more, see Insufficient Available Storage Space. The OpenSearch Service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB. |
JVMMemoryPressure |
The maximum percentage of the Java heap used for all data nodes in the cluster. OpenSearch Service uses half of the instance's RAM for the Java heap, with a maximum heap size of 32 GiB. You can vertically scale the instance's RAM up to 64GiB, at which point you can horizontally scale by adding instances. See Recommended CloudWatch Alarms for Amazon OpenSearch Service. Related statistics: Maximum Note The logic of this metric changed in service software R20220323. For more information, see Release Notes. |
JVMGCYoungCollectionCount |
The number of times "young generation" garbage collection has run. In clusters with sufficient resources, this number should remain small and not grow frequently. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCOldCollectionTime |
The time spent by the cluster performing "old generation" garbage collection, in milliseconds. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCYoungCollectionTime |
The time spent by the cluster performing "young generation" garbage collection, in milliseconds. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCOldCollectionCount |
The number of times "young generation" garbage collection has run. A large and growing number of runs is normal for cluster operations. This metric is also captured at the node level. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
IndexingLatency |
The difference in the total time (in milliseconds) spent on all indexing operations in a node between minute N and minute (N-1). |
IndexingRate |
The number of indexing operations per minute. |
SearchLatency |
The difference in the total time (in milliseconds) spent on all searches in a node between minute N and minute (N-1). |
SearchRate |
The total number of search requests per minute across all shards on data nodes. |
SegmentCount |
The number of segments on a data node. The more segments you have, the longer each search takes. OpenSearch sometimes merges smaller segments into larger ones. Related node statistics: Maximum, Average Related cluster statistics: Sum, Maximum, Average |
SysMemoryUtilization |
The percentage of instance memory in use. A high value for this metric is normal and usually does not indicate a problem with the cluster. For a better indication of potential performance and stability issues, see the JVMMemoryPressure metric. Related node statistics: Minimum, Maximum, Average Related cluster statistics: Minimum, Maximum, Average |
OpenSearchDashboardsConcurrentConnections |
The number of active concurrent connections to OpenSearch Dashboards. If this number is consistently high, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapTotal |
The amount of heap memory allocated to OpenSearch Dashboards in MiB. Different EC2 instance types may affect the exact memory allocation. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapUsed |
The absolute amount of heap memory used by OpenSearch Dashboards in MiB. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapUtilization |
The maximum percentage of available heap memory used by OpenSearch Dashboards. If this value exceeds 80%, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Minimum, Maximum, Average |
OpenSearchDashboardsResponseTimesMaxInMillis |
The maximum time (in milliseconds) taken by OpenSearch Dashboards to respond to a request. If requests consistently take a long time to return results, consider increasing the size of the instance type. Related node statistics: Maximum Related cluster statistics: Maximum, Average |
OpenSearchDashboardsOS1MinuteLoad |
The one-minute CPU load average for OpenSearch Dashboards. Ideally, CPU load should remain below 1.00. While temporary spikes are fine, if this metric is consistently above 1.00, we recommend increasing the size of the instance type. Related node statistics: Average Related cluster statistics: Average, Maximum |
OpenSearchDashboardsRequestTotal |
The total number of HTTP requests made to OpenSearch Dashboards. If your system is slow or you see a large number of dashboard requests, consider increasing the size of the instance type. Related node statistics: Sum Related cluster statistics: Sum |
ThreadpoolForce_mergeQueue |
The number of queued tasks in the force merge thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
ThreadpoolForce_mergeRejected |
The number of rejected tasks in the force merge thread pool. If this number continues to grow, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
ThreadpoolForce_mergeThreads |
The size of the force merge thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolSearchQueue |
The number of queued tasks in the search thread pool. If the queue size is consistently large, consider scaling your cluster. The maximum size of the search queue is 1000. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolSearchRejected |
The number of rejected tasks in the search thread pool. If this number continues to grow, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
ThreadpoolSearchThreads |
The size of the search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
Threadpoolsql-workerQueue |
The number of queued tasks in the SQL search thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
Threadpoolsql-workerRejected |
The number of rejected tasks in the SQL search thread pool. If this number continues to grow, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
Threadpoolsql-workerThreads |
The size of the SQL search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteQueue |
The number of queued tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteRejected |
The number of rejected tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteThreads |
The size of the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
CoordinatingWriteRejected |
The total number of rejections that have occurred on coordinating nodes due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later. |
ReplicaWriteRejected |
The total number of rejections that have occurred on replica shards due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later. |
PrimaryWriteRejected |
The total number of rejections that have occurred on primary shards due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later. |
ReadLatency |
The latency of read operations on an EBS volume in seconds. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
ReadThroughput |
The throughput of read operations on an EBS volume in bytes per second. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
ReadIOPS |
The number of input and output (I/O) operations per second for read operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
WriteIOPS |
The number of input and output (I/O) operations per second for write operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
WriteLatency |
The latency of write operations on an EBS volume in seconds. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
BurstBalance |
The percentage of I/O credits remaining in the burst bucket of an EBS volume. A value of 100 indicates that the volume has accumulated the maximum number of credits. If this percentage is below 70%, see Low EBS Burst Balance. For domains with gp3 volume types and domains with gp2 volumes larger than 1000 GiB, the burst balance remains at 0. Related statistics: Minimum, Maximum, Average |
CurrentPointInTime |
The number of active PIT search contexts in a node. |
TotalPointInTime |
The number of expired PIT search contexts since the node started. |
HasActivePointInTime |
A value of 1 indicates that there is an active PIT context on the node since the node started. A value of 0 indicates there is not. |
HasUsedPointInTime |
A value of 1 indicates that there has been an expired PIT context on the node since the node started. A value of 0 indicates there has not. |
AsynchronousSearchInitializedRate |
The number of asynchronous searches initialized in the past 1 minute. |
AsynchronousSearchRunningCurrent |
The number of asynchronous searches currently running. |
AsynchronousSearchCompletionRate |
The number of asynchronous searches successfully completed in the past 1 minute. |
AsynchronousSearchFailureRate |
The number of asynchronous searches completed and failed in the last minute. |
AsynchronousSearchPersistRate |
The number of asynchronous searches persisted in the past 1 minute. |
AsynchronousSearchRejected |
The total number of asynchronous searches rejected since the node started. |
AsynchronousSearchCancelled |
The total number of asynchronous searches cancelled since the node started. |
SQLRequestCount |
The number of requests to the _SQL API. Related statistics: Sum |
SQLUnhealthy |
A value of 1 indicates that the SQL plugin will return a 5xx response code or pass invalid query DSL to OpenSearch in response to a specific request. Other requests will continue to succeed. A value of 0 indicates that there have been no recent failures. If you see a persistent value of 1, troubleshoot the requests your client is making to the plugin. Related statistics: Maximum |
SQLDefaultCursorRequestCount |
Similar to SQLRequestCount, but only counts paginated requests. Related statistics: Sum |
SQLFailedRequestCountByCusErr |
The number of requests to the _SQL API that failed due to client issues. For example, a request might return HTTP status code 400 due to IndexNotFoundException. Related statistics: Sum |
SQLFailedRequestCountBySysErr |
The number of requests to the _SQL API that failed due to server issues or functional limitations. For example, a request might return HTTP status code 503 due to VerificationException. Related statistics: Sum |
OldGenJVMMemoryPressure |
The maximum percentage of the Java heap used for the "old generation" on all data nodes in the cluster. This metric is also captured at the node level. Related statistics: Maximum |
OpenSearchDashboardsHealthyNodes (formerly KibanaHealthyNodes ) |
The health check for OpenSearch Dashboards. If the minimum, maximum, and average are all equal to 1, the dashboard is functioning normally. If you have 10 nodes, the maximum is 1, the minimum is 0, and the average is 0.7, it means 7 nodes (70%) are functioning normally and 3 nodes (30%) are unhealthy. Related statistics: Minimum, Maximum, Average |
InvalidHostHeaderRequests |
The number of HTTP requests to the OpenSearch cluster that contain an invalid (or missing) host header. Valid requests include the domain hostname as the host header value. OpenSearch Service rejects invalid requests to public access domains without restrictive access policies. We recommend applying restrictive access policies to all domains. If you see a large value for this metric, confirm that your OpenSearch client includes the domain hostname (for example, not its IP address) in its requests. Related statistics: Sum |
OpenSearchRequests(previously ElasticsearchRequests) |
The number of requests made to the OpenSearch cluster. Related statistics: Sum |
2xx, 3xx, 4xx, 5xx |
The number of requests to the domain that resulted in the specified HTTP response code (2xx, 3xx, 4xx, 5xx). Related statistics: Sum |
Objects¶
The collected AWS OpenSearch object data structure can be seen in 「Infrastructure - Custom」
{
"measurement": "aws_opensearch",
"tags": {
"name" : "df-prd-es",
"EngineVersion" : "Elasticsearch_7.10",
"DomainId" : "5882XXXXX135/df-prd-es",
"DomainName" : "df-prd-es",
"ClusterConfig" : "{JSON data of instance types and instance counts in the domain}",
"ServiceSoftwareOptions": "{JSON data of the current state of the service software}",
"region" : "cn-northwest-1",
"RegionId" : "cn-northwest-1"
},
"fields": {
"EBSOptions": "{JSON data of the Elastic Block Store for the specified domain}",
"Endpoints" : "{JSON data of the mapping of domain endpoints for submitting indexing and search requests}",
"message" : "{JSON data of the instance}"
}
}
Note: Fields in
tags
andfields
may change with subsequent updates Tip 1: The value oftags.name
is the instance ID, used as a unique identifier Tip 2: Thetags.name
in this script corresponds to theDomainName
data field. When using this script, ensure that there are no duplicateDomainName
values across multiple AWS accounts. Tip 3:tags.ClusterConfig
,tags.Endpoint
,tags.ServiceSoftwareOptions
,fields.message
,fields.EBSOptions
,fields.Endpoints
are all JSON serialized strings