Skip to content

AWS SageMaker

Collect AWS SageMaker metrics

Configuration

Install Func

It is recommended to activate the TrueWatch integration - extension - DataFlux Func (Automata): all prerequisites are automatically installed, please proceed with the script installation.

If you want to deploy Func yourself, refer to Self-deploy Func

Install Script

Note: Please prepare the Amazon AK in advance (for simplicity, you can directly grant the global read-only permission ReadOnlyAccess).

Manually Activate Script

  1. Log in to the Func console, click on 【Script Market】, enter the TrueWatch script market, and search for integration_aws_sagemaker.

  2. After clicking 【Install】, enter the corresponding parameters: AWS AK ID, AK Secret, and account name.

  3. Click 【Deploy Startup Script】, the system will automatically create the Startup script set and configure the corresponding startup script.

  4. After activation, you can see the corresponding automatic trigger configuration in 「Manage / Automatic Trigger Configuration」. Click 【Execute】 to immediately execute once without waiting for the scheduled time. After a while, you can check the execution task records and corresponding logs.

Verification

  1. In 「Manage / Automatic Trigger Configuration」, confirm whether the corresponding task has the corresponding automatic trigger configuration, and you can also check the corresponding task records and logs to see if there are any anomalies.
  2. In TrueWatch, check if asset information exists in 「Infrastructure / Custom」.
  3. In TrueWatch, check if there is corresponding monitoring data in 「Metrics」.

Metrics

After configuring Amazon CloudWatch, the default Measurement is as follows. More metrics can be collected through configuration:

Amazon CloudWatch AWS SageMaker Metrics Details

Inference Component Metrics

Metric Description
CPUUtilizationNormalized The normalized CPU utilization metric value reported by each inference component replica, ranging from 0%-100%. If the NumberOfCpuCoresRequired parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit.
GPUMemoryUtilizationNormalized The normalized GPU memory utilization metric value reported by each inference component replica.
GPUUtilizationNormalized The normalized GPU utilization metric value reported by each inference component replica. If the NumberOfAcceleratorDevicesRequired parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit.
MemoryUtilizationNormalized The normalized memory utilization value reported by each inference component replica. If the MinMemoryRequiredInMb parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit.
Dimensions of Inference Component Metrics
Dimension Description
InferenceComponentName Filter inference component metrics.

Multi-Model Endpoint Model Loading Metrics

Metric Description
ModelLoadingWaitTime The time interval that the invocation request waits for downloading, loading, or both downloading and loading the target model to run inference. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelUnloadingTime The time interval taken to unload the model through the container's UnloadModel API call. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelDownloadingTime The time interval taken to download the model from Amazon S3. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelLoadingTime The time interval taken to load the model through the container's LoadModel API call. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelCacheHit The number of InvokeEndpoint requests sent to the loaded model in a multi-model endpoint. The "Average" statistic shows the ratio of requests for the loaded model. Unit: None. Valid statistics: Average, Sum, Sample Count.
Dimensions of Multi-Model Endpoint Model Loading Metrics
Dimension Description
EndpointName, VariantName Filter endpoint invocation metrics for the specified endpoint and variant's ProductionVariant.