Skip to content

AWS OpenSearch

AWS OpenSearch, including connections, requests, latency, slow queries, etc.

Configuration

Install Func

It is recommended to enable TrueWatch Integration - Extensions - DataFlux Func (Automata): All prerequisites are automatically installed, please proceed with script installation

If you deploy Func yourself, refer to Self-deploying Func

Install Script

Note: Please prepare an Amazon AK with the required permissions in advance (for simplicity, you can directly grant global read-only permissions ReadOnlyAccess)

Enable Script in Automata

  1. Log in to the TrueWatch console
  2. Click on the 【Integration】 menu, select 【Cloud Account Management】
  3. Click on 【Add Cloud Account】, select 【AWS】, and fill in the required information on the interface. If you have already configured the cloud account information before, you can skip this step
  4. Click on 【Test】, after the test is successful, click on 【Save】. If the test fails, please check if the relevant configuration information is correct and retest
  5. Click on 【Cloud Account Management】, you can see the added cloud account in the list, click on the corresponding cloud account to enter the details page
  6. Click on the 【Integration】 button on the cloud account details page, find AWS OpenSearch under the Not Installed list, and click on the 【Install】 button to pop up the installation interface for installation.

Manually Enable Script

  1. Log in to the Func console, click on 【Script Market】, enter the TrueWatch script market, search for: integration_aws_open_search

  2. Click on 【Install】, then enter the corresponding parameters: AWS AK ID, AK Secret, and account name.

  3. Click on 【Deploy Startup Script】, the system will automatically create the Startup script set and automatically configure the corresponding startup script.

  4. After enabling, you can see the corresponding automatic trigger configuration in 「Manage / Automatic Trigger Configuration」. Click on 【Execute】 to execute it immediately without waiting for the scheduled time. After a while, you can check the execution task records and corresponding logs.

Verification

  1. Confirm in 「Manage / Automatic Trigger Configuration」 whether the corresponding task has the corresponding automatic trigger configuration, and you can also check the corresponding task records and logs to see if there are any exceptions
  2. In TrueWatch, check in 「Infrastructure / Custom」 to see if there is asset information
  3. In TrueWatch, check in 「Metrics」 to see if there is corresponding monitoring data

Metrics

After configuring AWS OpenSearch, the default measurement is as follows, you can collect more metrics through configuration AWS CloudWatch Metrics Details

Cluster Metrics

Amazon OpenSearch Service provides the following metrics for clusters.

Metric Description
ClusterStatus.green A value of 1 indicates that all index shards are allocated to nodes in the cluster. Related statistics: Maximum
ClusterStatus.yellow A value of 1 indicates that all primary shards of all indices are allocated to nodes in the cluster, but at least one index's replica shard is not. For more information, see Yellow Cluster Status: Related statistics: Maximum
ClusterStatus.red A value of 1 indicates that at least one index's primary and replica shards are not allocated to nodes in the cluster. For more information, see Red Cluster Status: Related statistics: Maximum
Shards.active The total number of active primary and replica shards. Related statistics: Maximum, Sum
Shards.unassigned The number of shards not allocated to nodes in the cluster. Related statistics: Maximum, Sum
Shards.delayedUnassigned The number of shards whose node allocation has been delayed due to timeout settings. Related statistics: Maximum, Sum
Shards.activePrimary The number of active primary shards. Related statistics: Maximum, Sum
Shards.initializing The number of shards being initialized. Related statistics: Sum
Shards.relocating The number of shards being relocated. Related statistics: Sum
Nodes The number of nodes in the OpenSearch Service cluster, including dedicated master UltraWarm nodes and nodes. For more information, see Changing Configuration in Amazon OpenSearch Service: Related statistics: Maximum
SearchableDocuments The total number of searchable documents across all data nodes in the cluster. Related statistics: Minimum, Maximum, Average
CPUUtilization The percentage of CPU utilization for data nodes in the cluster. The maximum shows the node with the highest CPU utilization. The average represents all nodes in the cluster. This metric is also available for individual nodes. Related statistics: Maximum, Average
ClusterUsedSpace The total used space for the cluster. You must wait one minute to get an accurate value. The OpenSearch Service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB. Related statistics: Minimum, Maximum
ClusterIndexWritesBlocked Indicates whether your cluster is accepting or blocking incoming write requests. A value of 0 means the cluster is accepting requests. A value of 1 means requests are being blocked. Some common factors include: FreeStorageSpace being too low or JVMMemoryPressure being too high. To mitigate this, consider increasing disk space or scaling the cluster. Related statistics: Maximum
FreeStorageSpace The available space on each data node in the cluster. Sum shows the total available space for the cluster, but you must wait one minute to get an accurate value. Minimum and Maximum show the nodes with the smallest and largest available space, respectively. This metric is also available for individual nodes. The OpenSearchClusterBlockException is thrown when this metric reaches 0. To recover, you must delete indices, add larger instances, or add EBS-based storage to existing instances. To learn more, see Insufficient Free Storage Space. The OpenSearch Service console displays this value in GiB. The Amazon CloudWatch console displays it in MiB.
JVMMemoryPressure The maximum percentage of Java heap used for all data nodes in the cluster. OpenSearch Service uses half of an instance's RAM for the Java heap, with a maximum heap size of 32 GiB. You can vertically scale an instance's RAM up to 64GiB, at which point you can horizontally scale by adding instances. See Recommended CloudWatch Alarms for Amazon OpenSearch Service. Related statistics: Maximum Note The logic for this metric changed in service software R20220323. For more information, see Release Notes.
JVMGCYoungCollectionCount The number of times "young generation" garbage collection has run. In a cluster with sufficient resources, this number should remain small and not grow frequently. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
JVMGCOldCollectionTime The time, in milliseconds, that the cluster has spent performing "old generation" garbage collection. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
JVMGCYoungCollectionTime The time, in milliseconds, that the cluster has spent performing "young generation" garbage collection. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
JVMGCOldCollectionCount The number of times "young generation" garbage collection has run. A large and growing number of runs is normal for cluster operations. This metric is also captured at the node level. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
IndexingLatency The difference in the total time, in milliseconds, that all indexing operations in a node took between minute N and minute (N-1).
IndexingRate The number of indexing operations per minute.
SearchLatency The difference in the total time, in milliseconds, that all searches in a node took between minute N and minute (N-1).
SearchRate The total number of search requests per minute for all shards on a data node.
SegmentCount The number of segments on a data node. The more segments you have, the longer each search takes. OpenSearch sometimes merges smaller segments into larger ones. Related node statistics: Maximum, Average Related cluster statistics: Sum, Maximum, Average
SysMemoryUtilization The percentage of instance memory in use. A high value for this metric is normal and does not usually indicate a problem with the cluster. For a better indication of potential performance and stability issues, see the JVMMemoryPressure metric. Related node statistics: Minimum, Maximum, Average Related cluster statistics: Minimum, Maximum, Average
OpenSearchDashboardsConcurrentConnections The number of active concurrent connections to OpenSearch Dashboards. If this number is consistently high, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
OpenSearchDashboardsHeapTotal The amount of heap memory, in MiB, allocated to OpenSearch Dashboards. Different EC2 instance types may affect the exact memory allocation. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
OpenSearchDashboardsHeapUsed The absolute amount of heap memory, in MiB, used by OpenSearch Dashboards. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
OpenSearchDashboardsHeapUtilization The percentage of the maximum available heap memory used by OpenSearch Dashboards. If this value exceeds 80%, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Minimum, Maximum, Average
OpenSearchDashboardsResponseTimesMaxInMillis The maximum time, in milliseconds, that OpenSearch Dashboards took to respond to a request. If requests consistently take a long time to return results, consider increasing the size of the instance type. Related node statistics: Maximum Related cluster statistics: Maximum, Average
OpenSearchDashboardsOS1MinuteLoad The one-minute CPU load average for OpenSearch Dashboards. Ideally, CPU load should remain below 1.00. While temporary spikes are fine, if this metric is consistently above 1.00, we recommend increasing the size of the instance type. Related node statistics: Average Related cluster statistics: Average, Maximum
OpenSearchDashboardsRequestTotal The total number of HTTP requests made to OpenSearch Dashboards. If your system is slow or you see a large number of dashboard requests, consider increasing the size of the instance type. Related node statistics: Sum Related cluster statistics: Sum
ThreadpoolForce_mergeQueue The number of queued tasks in the force merge thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
ThreadpoolForce_mergeRejected The number of rejected tasks in the force merge thread pool. If this number is consistently growing, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum
ThreadpoolForce_mergeThreads The size of the force merge thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum
ThreadpoolSearchQueue The number of queued tasks in the search thread pool. If the queue size is consistently large, consider scaling your cluster. The maximum size for the search queue is 1000. Related node statistics: Maximum Related cluster statistics: Average, Sum
ThreadpoolSearchRejected The number of rejected tasks in the search thread pool. If this number is consistently growing, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum
ThreadpoolSearchThreads The size of the search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum
Threadpoolsql-workerQueue The number of queued tasks in the SQL search thread pool. If the queue size is consistently large, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum, Maximum, Average
Threadpoolsql-workerRejected The number of rejected tasks in the SQL search thread pool. If this number is consistently growing, consider scaling your cluster. Related node statistics: Maximum Related cluster statistics: Sum
Threadpoolsql-workerThreads The size of the SQL search thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum
ThreadpoolWriteQueue The number of queued tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum
ThreadpoolWriteRejected The number of rejected tasks in the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum
ThreadpoolWriteThreads The size of the write thread pool. Related node statistics: Maximum Related cluster statistics: Average, Sum
CoordinatingWriteRejected The total number of rejections that have occurred on coordinating nodes due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later.
ReplicaWriteRejected The total number of rejections that have occurred on replica shards due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later.
PrimaryWriteRejected The total number of rejections that have occurred on primary shards due to indexing pressure since the last OpenSearch Service process start. Related node statistics: Maximum Related cluster statistics: Average, Sum This metric is available in version 7.1 and later.
ReadLatency The latency, in seconds, for read operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average
ReadThroughput The throughput, in bytes per second, for read operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average
ReadIOPS The number of input and output (I/O) operations per second for read operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average
WriteIOPS The number of input and output (I/O) operations per second for write operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average
WriteLatency The latency, in seconds, for write operations on an EBS volume. This metric is also available for individual nodes. Related statistics: Minimum, Maximum, Average
BurstBalance The percentage of I/O credits remaining in the burst bucket of an EBS volume. A value of 100 means the volume has accumulated the maximum number of credits. If this percentage is below 70%, see Low EBS Burst Capacity Balance. For domains with gp3 volume types and domains with gp2 volumes larger than 1000 GiB, the burst balance remains at 0. Related statistics: Minimum, Maximum, Average
CurrentPointInTime The number of active PIT search contexts in a node.
TotalPointInTime The number of expired PIT search contexts since the node started.
HasActivePointInTime A value of 1 indicates that an active PIT context has existed on the node since the node started. A value of 0 indicates that it has not.
HasUsedPointInTime A value of 1 indicates that an expired PIT context has existed on the node since the node started. A value of 0 indicates that it has not.
AsynchronousSearchInitializedRate The number of asynchronous searches initialized in the last 1 minute.
AsynchronousSearchRunningCurrent The number of asynchronous searches currently running.
AsynchronousSearchCompletionRate The number of asynchronous searches successfully completed in the last 1 minute.
AsynchronousSearchFailureRate The number of asynchronous searches completed and failed in the last minute.
AsynchronousSearchPersistRate The number of asynchronous searches persisted in the last 1 minute.
AsynchronousSearchRejected The total number of asynchronous searches rejected since the node started.
AsynchronousSearchCancelled The total number of asynchronous searches canceled since the node started.
SQLRequestCount The number of requests to the _SQL API. Related statistics: Sum
SQLUnhealthy A value of 1 indicates that the SQL plugin will return a 5xx response code or pass invalid query DSL to OpenSearch in response to a specific request. Other requests will continue to succeed. A value of 0 indicates that no recent failures have occurred. If you see a consistent value of 1, troubleshoot the requests your client makes to the plugin. Related statistics: Maximum
SQLDefaultCursorRequestCount Similar to SQLRequestCount, but only counts paged requests. Related statistics: Sum
SQLFailedRequestCountByCusErr The number of requests to the _SQL API that failed due to client issues. For example, a request might return an HTTP status code of 400 due to an IndexNotFoundException. Related statistics: Sum
SQLFailedRequestCountBySysErr The number of requests to the _SQL API that failed due to server issues or functional limitations. For example, a request might return an HTTP status code of 503 due to a VerificationException. Related statistics: Sum
OldGenJVMMemoryPressure The maximum percentage of Java heap used for the "old generation" on all data nodes in the cluster. This metric is also captured at the node level. Related statistics: Maximum
OpenSearchDashboardsHealthyNodes(formerly KibanaHealthyNodes The health check for OpenSearch Dashboards. If the minimum, maximum, and average all equal 1, the dashboard is functioning normally. If you have 10 nodes, the maximum is 1, the minimum is 0, and the average is 0.7, this means that 7 nodes (70%) are functioning normally and 3 nodes (30%) are unhealthy. Related statistics: Minimum, Maximum, Average
InvalidHostHeaderRequests The number of HTTP requests to an OpenSearch cluster that contain an invalid (or missing) host header. Valid requests include the domain hostname as the host header value. OpenSearch Service rejects invalid requests to public access domains without restrictive access policies. We recommend applying restrictive access policies to all domains. If you see a large value for this metric, confirm that your OpenSearch client includes the domain hostname (e.g., not its IP address) in its requests. Related statistics: Sum
OpenSearchRequests(previously ElasticsearchRequests) The number of requests made to an OpenSearch cluster. Related statistics: Sum
2xx, 3xx, 4xx, 5xx The number of requests to the domain that resulted in the specified HTTP response code (2xx, 3xx, 4xx, 5xx). Related statistics: Sum

Object

The collected AWS OpenSearch object data structure can be seen in 「Infrastructure - Custom」.

{
  "measurement": "aws_opensearch",
  "tags": {
    "name"                  : "df-prd-es",
    "EngineVersion"         : "Elasticsearch_7.10",
    "DomainId"              : "5882XXXXX135/df-prd-es",
    "DomainName"            : "df-prd-es",
    "ClusterConfig"         : "{JSON data of instance types and instance counts in the domain}",
    "ServiceSoftwareOptions": "{JSON data of the current state of the service software}",
    "region"                : "cn-northwest-1",
    "RegionId"              : "cn-northwest-1"
  },
  "fields": {
    "EBSOptions": "{JSON data of Elastic Block Store data for the specified domain}",
    "Endpoints" : "{JSON data of a map of domain endpoints for submitting indexing and search requests}",
    "message"   : "{JSON data of instance}"
  }
}

Note: The fields in tags and fields may change with subsequent updates Tip 1: The value of tags.name is the instance ID, used as a unique identifier Tip 2: The data field corresponding to tags.name in this script is DomainName. When using this script, ensure that there are no duplicate DomainName values across multiple AWS accounts. Tip 3: tags.ClusterConfig, tags.Endpoint, tags.ServiceSoftwareOptions, fields.message, fields.EBSOptions, fields.Endpoints are all JSON serialized strings ```