AWS OpenSearch¶
AWS OpenSearch, including connections, requests, latency, slow queries, etc.
Configuration¶
Install Func¶
It is recommended to enable TrueWatch Integration - Extensions - DataFlux Func (Automata): All prerequisites are automatically installed, please proceed with script installation
If you deploy Func yourself, refer to Self-deploying Func
Install Script¶
Note: Please prepare an Amazon AK with the required permissions in advance (for simplicity, you can directly grant global read-only permissions
ReadOnlyAccess
)
Enable Script in Automata¶
- Log in to the TrueWatch console
- Click on the 【Integration】 menu, select 【Cloud Account Management】
- Click on 【Add Cloud Account】, select 【AWS】, and fill in the required information on the interface. If you have already configured the cloud account information before, you can skip this step
- Click on 【Test】, after the test is successful, click on 【Save】. If the test fails, please check if the relevant configuration information is correct and retest
- Click on 【Cloud Account Management】, you can see the added cloud account in the list, click on the corresponding cloud account to enter the details page
- Click on the 【Integration】 button on the cloud account details page, find
AWS OpenSearch
under theNot Installed
list, and click on the 【Install】 button to pop up the installation interface for installation.
Manually Enable Script¶
-
Log in to the Func console, click on 【Script Market】, enter the TrueWatch script market, search for:
integration_aws_open_search
-
Click on 【Install】, then enter the corresponding parameters: AWS AK ID, AK Secret, and account name.
-
Click on 【Deploy Startup Script】, the system will automatically create the
Startup
script set and automatically configure the corresponding startup script. -
After enabling, you can see the corresponding automatic trigger configuration in 「Manage / Automatic Trigger Configuration」. Click on 【Execute】 to execute it immediately without waiting for the scheduled time. After a while, you can check the execution task records and corresponding logs.
Verification¶
- Confirm in 「Manage / Automatic Trigger Configuration」 whether the corresponding task has the corresponding automatic trigger configuration, and you can also check the corresponding task records and logs to see if there are any exceptions
- In TrueWatch, check in 「Infrastructure / Custom」 to see if there is asset information
- In TrueWatch, check in 「Metrics」 to see if there is corresponding monitoring data
Metrics¶
After configuring AWS OpenSearch, the default measurement is as follows, you can collect more metrics through configuration AWS CloudWatch Metrics Details
Cluster Metrics¶
Amazon OpenSearch Service provides the following metrics for clusters.
Metric | Description |
---|---|
ClusterStatus.green |
A value of 1 indicates that all index shards are allocated to nodes in the cluster. Related statistics: Maximum |
ClusterStatus.yellow |
A value of 1 indicates that all primary shards of all indices are allocated to nodes in the cluster, but at least one index's replica shard is not. For more information, see Yellow Cluster Status: Related statistics: Maximum |
ClusterStatus.red |
A value of 1 indicates that at least one index's primary and replica shards are not allocated to nodes in the cluster. For more information, see Red Cluster Status: Related statistics: Maximum |
Shards.active |
The total number of active primary and replica shards. Related statistics: Maximum, Sum |
Shards.unassigned |
The number of shards not allocated to nodes in the cluster. Related statistics: Maximum, Sum |
Shards.delayedUnassigned |
The number of shards whose node allocation has been delayed due to timeout settings. Related statistics: Maximum, Sum |
Shards.activePrimary |
The number of active primary shards. Related statistics: Maximum, Sum |
Shards.initializing |
The number of shards being initialized. Related statistics: Sum |
Shards.relocating |
The number of shards being relocated. Related statistics: Sum |
Nodes |
The number of nodes in the OpenSearch Service cluster, including dedicated master UltraWarm nodes and nodes. For more information, see Changing Configuration in Amazon OpenSearch Service: Related statistics: Maximum |
SearchableDocuments |
The total number of searchable documents across all data nodes in the cluster. Related statistics: Minimum, Maximum, Average |
CPUUtilization |
The percentage of CPU utilization for data nodes in the cluster. The maximum shows the node with the highest CPU utilization. The average represents all nodes in the cluster. This metric is also available for individual nodes. Related statistics: Maximum, Average |
ClusterUsedSpace |
The total used space for the cluster. You must wait one minute to get an accurate value. The OpenSearch Service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB. Related statistics: Minimum, Maximum |
ClusterIndexWritesBlocked |
Indicates whether your cluster is accepting or blocking incoming write requests. A value of 0 means the cluster is accepting requests. A value of 1 means requests are being blocked. Some common factors include: FreeStorageSpace being too low or JVMMemoryPressure being too high. To mitigate this, consider increasing disk space or scaling the cluster. Related statistics: Maximum |
FreeStorageSpace |
The available space on each data node in the cluster. Sum shows the total available space for the cluster, but you must wait one minute to get an accurate value. Minimum and Maximum show the nodes with the smallest and largest available space, respectively. This metric is also available for individual nodes. The OpenSearchClusterBlockException is thrown when this metric reaches 0. To recover, you must delete indices, add larger instances, or add EBS-based storage to existing instances. To learn more, see Insufficient Free Storage Space. The OpenSearch Service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB. |
JVMMemoryPressure |
The maximum percentage of Java heap used for all data nodes in the cluster. OpenSearch Service uses half of an instance's RAM for the Java heap, with a maximum heap size of 32 GiB. You can vertically scale an instance's RAM up to 64GiB, at which point you can horizontally scale by adding instances. See Recommended CloudWatch Alarms for Amazon OpenSearch Service. Related statistics: Maximum Note The logic for this metric changed in service software R20220323. For more information, see Release Notes. |
JVMGCYoungCollectionCount |
The number of times "young generation" garbage collection has run. In a cluster with sufficient resources, this number should remain small and not grow frequently. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCOldCollectionTime |
The time, in milliseconds, that the cluster has spent performing "old generation" garbage collection. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCYoungCollectionTime |
The time, in milliseconds, that the cluster has spent performing "young generation" garbage collection. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
JVMGCOldCollectionCount |
The number of times "young generation" garbage collection has run. A large and growing number of runs is normal for cluster operations. This metric is also captured at the node level. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
IndexingLatency |
The difference in the total time, in milliseconds, that all indexing operations in a node took between minute N and minute (N-1). |
IndexingRate |
The number of indexing operations per minute. |
SearchLatency |
The difference in the total time, in milliseconds, that all searches in a node took between minute N and minute (N-1). |
SearchRate |
The total number of search requests per minute for all shards on a data node. |
SegmentCount |
The number of segments on a data node. The more segments you have, the longer each search takes. OpenSearch sometimes merges smaller segments into larger ones. Related node statistics: Maximum, Average Related cluster statistics: Sum, Maximum, Average |
SysMemoryUtilization |
The percentage of instance memory in use. A high value for this metric is normal and does not usually indicate a problem with the cluster. For a better indication of potential performance and stability issues, see the JVMMemoryPressure metric. Related node statistics: Minimum, Maximum, Average Related cluster statistics: Minimum, Maximum, Average |
OpenSearchDashboardsConcurrentConnections |
The number of active concurrent connections to OpenSearch Dashboards. If this number is consistently high, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapTotal |
The amount of heap memory, in MiB, allocated to OpenSearch Dashboards. Different EC2 instance types may affect the exact memory allocation. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapUsed |
The absolute amount of heap memory, in MiB, used by OpenSearch Dashboards. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
OpenSearchDashboardsHeapUtilization |
The percentage of the maximum available heap memory used by OpenSearch Dashboards. If this value exceeds 80%, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Minimum, Maximum, Average |
OpenSearchDashboardsResponseTimesMaxInMillis |
The maximum time, in milliseconds, that OpenSearch Dashboards took to respond to a request. If requests consistently take a long time to return results, consider increasing the size of the instance type. Related node statistics: Maximum Related cluster statistics: Maximum, Average |
OpenSearchDashboardsOS1MinuteLoad |
The one-minute CPU load average for OpenSearch Dashboards. Ideally, CPU load should remain below 1.00. While temporary spikes are fine, if this metric is consistently above 1.00, we recommend increasing the size of the instance type. Related node statistics: Average Related cluster statistics: Average, Maximum |
OpenSearchDashboardsRequestTotal |
The total number of HTTP requests made to OpenSearch Dashboards. If your system is slow or you see a large number of dashboard requests, consider increasing the size of the instance type. Related node statistics: Sum Related cluster statistics: Sum |
ThreadpoolForce_mergeQueue |
The number of queued tasks in the force merge thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
ThreadpoolForce_mergeRejected |
The number of rejected tasks in the force merge thread pool. If this number is consistently growing, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
ThreadpoolForce_mergeThreads |
The size of the force merge thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolSearchQueue |
The number of queued tasks in the search thread pool. If the queue size is consistently large, consider scaling your cluster. The maximum size for the search queue is 1000. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolSearchRejected |
The number of rejected tasks in the search thread pool. If this number is consistently growing, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
ThreadpoolSearchThreads |
The size of the search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
Threadpoolsql-workerQueue |
The number of queued tasks in the SQL search thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average |
Threadpoolsql-workerRejected |
The number of rejected tasks in the SQL search thread pool. If this number is consistently growing, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum |
Threadpoolsql-workerThreads |
The size of the SQL search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteQueue |
The number of queued tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteRejected |
The number of rejected tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
ThreadpoolWriteThreads |
The size of the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum |
CoordinatingWriteRejected |
The total number of rejections that have occurred on coordinating nodes due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later. |
ReplicaWriteRejected |
The total number of rejections that have occurred on replica shards due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later. |
PrimaryWriteRejected |
The total number of rejections that have occurred on primary shards due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later. |
ReadLatency |
The latency, in seconds, for read operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
ReadThroughput |
The throughput, in bytes per second, for read operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
ReadIOPS |
The number of input and output (I/O) operations per second for read operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
WriteIOPS |
The number of input and output (I/O) operations per second for write operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
WriteLatency |
The latency, in seconds, for write operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average |
BurstBalance |
The percentage of I/O credits remaining in the burst bucket of an EBS volume. A value of 100 means the volume has accumulated the maximum number of credits. If this percentage is below 70%, see Low EBS Burst Capacity Balance. For domains with gp3 volume types and domains with gp2 volumes larger than 1000 GiB, the burst balance remains at 0. Related statistics: Minimum, Maximum, Average |
CurrentPointInTime |
The number of active PIT search contexts in a node. |
TotalPointInTime |
The number of expired PIT search contexts since the node started. |
HasActivePointInTime |
A value of 1 indicates that an active PIT context has existed on the node since the node started. A value of 0 indicates that it has not. |
HasUsedPointInTime |
A value of 1 indicates that an expired PIT context has existed on the node since the node started. A value of 0 indicates that it has not. |
AsynchronousSearchInitializedRate |
The number of asynchronous searches initialized in the last 1 minute. |
AsynchronousSearchRunningCurrent |
The number of asynchronous searches currently running. |
AsynchronousSearchCompletionRate |
The number of asynchronous searches successfully completed in the last 1 minute. |
AsynchronousSearchFailureRate |
The number of asynchronous searches completed and failed in the last minute. |
AsynchronousSearchPersistRate |
The number of asynchronous searches persisted in the last 1 minute. |
AsynchronousSearchRejected |
The total number of asynchronous searches rejected since the node started. |
AsynchronousSearchCancelled |
The total number of asynchronous searches canceled since the node started. |
SQLRequestCount |
The number of requests to the _SQL API. Related statistics: Sum |
SQLUnhealthy |
A value of 1 indicates that the SQL plugin will return a 5xx response code or pass invalid query DSL to OpenSearch in response to a specific request. Other requests will continue to succeed. A value of 0 indicates that no recent failures have occurred. If you see a consistent value of 1, troubleshoot the requests your client makes to the plugin. Related statistics: Maximum |
SQLDefaultCursorRequestCount |
Similar to SQLRequestCount, but only counts paged requests. Related statistics: Sum |
SQLFailedRequestCountByCusErr |
The number of requests to the _SQL API that failed due to client issues. For example, a request might return an HTTP status code of 400 due to an IndexNotFoundException. Related statistics: Sum |
SQLFailedRequestCountBySysErr |
The number of requests to the _SQL API that failed due to server issues or functional limitations. For example, a request might return an HTTP status code of 503 due to a VerificationException. Related statistics: Sum |
OldGenJVMMemoryPressure |
The maximum percentage of Java heap used for the "old generation" on all data nodes in the cluster. This metric is also captured at the node level. Related statistics: Maximum |
OpenSearchDashboardsHealthyNodes (formerly KibanaHealthyNodes ) |
The health check for OpenSearch Dashboards. If the minimum, maximum, and average all equal 1, the dashboard is functioning normally. If you have 10 nodes, the maximum is 1, the minimum is 0, and the average is 0.7, this means that 7 nodes (70%) are functioning normally and 3 nodes (30%) are unhealthy. Related statistics: Minimum, Maximum, Average |
InvalidHostHeaderRequests |
The number of HTTP requests to an OpenSearch cluster that contain an invalid (or missing) host header. Valid requests include the domain hostname as the host header value. OpenSearch Service rejects invalid requests to public access domains without restrictive access policies. We recommend applying restrictive access policies to all domains. If you see a large value for this metric, confirm that your OpenSearch client includes the domain hostname (e.g., not its IP address) in its requests. Related statistics: Sum |
OpenSearchRequests(previously ElasticsearchRequests) |
The number of requests made to an OpenSearch cluster. Related statistics: Sum |
2xx, 3xx, 4xx, 5xx |
The number of requests to the domain that resulted in the specified HTTP response code (2xx, 3xx, 4xx, 5xx). Related statistics: Sum |
Object¶
The collected AWS OpenSearch object data structure can be seen in 「Infrastructure - Custom」.
{
"measurement": "aws_opensearch",
"tags": {
"name" : "df-prd-es",
"EngineVersion" : "Elasticsearch_7.10",
"DomainId" : "5882XXXXX135/df-prd-es",
"DomainName" : "df-prd-es",
"ClusterConfig" : "{JSON data of instance types and instance counts in the domain}",
"ServiceSoftwareOptions": "{JSON data of the current state of the service software}",
"region" : "cn-northwest-1",
"RegionId" : "cn-northwest-1"
},
"fields": {
"EBSOptions": "{JSON data of Elastic Block Store data for the specified domain}",
"Endpoints" : "{JSON data of a map of domain endpoints for submitting indexing and search requests}",
"message" : "{JSON data of instance}"
}
}
Note: The fields in
tags
andfields
may change with subsequent updates Tip 1: The value oftags.name
is the instance ID, used as a unique identifier Tip 2: The data field corresponding totags.name
in this script isDomainName
. When using this script, ensure that there are no duplicateDomainName
values across multiple AWS accounts. Tip 3:tags.ClusterConfig
,tags.Endpoint
,tags.ServiceSoftwareOptions
,fields.message
,fields.EBSOptions
,fields.Endpoints
are all JSON serialized strings ```