AWS SageMaker¶

Collect AWS SageMaker metrics

Configuration¶

Install Func¶

It is recommended to activate the TrueWatch integration - extension - DataFlux Func (Automata): all prerequisites are automatically installed, please proceed with the script installation.

If you want to deploy Func yourself, refer to Self-deploy Func

Install Script¶

Note: Please prepare the Amazon AK in advance (for simplicity, you can directly grant the global read-only permission ReadOnlyAccess).

Manually Activate Script¶

Log in to the Func console, click on 【Script Market】, enter the TrueWatch script market, and search for integration_aws_sagemaker.
After clicking 【Install】, enter the corresponding parameters: AWS AK ID, AK Secret, and account name.
Click 【Deploy Startup Script】, the system will automatically create the Startup script set and configure the corresponding startup script.
After activation, you can see the corresponding automatic trigger configuration in 「Manage / Automatic Trigger Configuration」. Click 【Execute】 to immediately execute once without waiting for the scheduled time. After a while, you can check the execution task records and corresponding logs.

Verification¶

In 「Manage / Automatic Trigger Configuration」, confirm whether the corresponding task has the corresponding automatic trigger configuration, and you can also check the corresponding task records and logs to see if there are any anomalies.
In TrueWatch, check if asset information exists in 「Infrastructure / Custom」.
In TrueWatch, check if there is corresponding monitoring data in 「Metrics」.

Metrics¶

After configuring Amazon CloudWatch, the default Measurement is as follows. More metrics can be collected through configuration:

Amazon CloudWatch AWS SageMaker Metrics Details

Inference Component Metrics¶

Metric	Description
CPUUtilizationNormalized	The normalized CPU utilization metric value reported by each inference component replica, ranging from 0%-100%. If the NumberOfCpuCoresRequired parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit.
GPUMemoryUtilizationNormalized	The normalized GPU memory utilization metric value reported by each inference component replica.
GPUUtilizationNormalized	The normalized GPU utilization metric value reported by each inference component replica. If the NumberOfAcceleratorDevicesRequired parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit.
MemoryUtilizationNormalized	The normalized memory utilization value reported by each inference component replica. If the MinMemoryRequiredInMb parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit.

Dimensions of Inference Component Metrics¶

Dimension	Description
InferenceComponentName	Filter inference component metrics.

Multi-Model Endpoint Model Loading Metrics¶

Metric	Description
ModelLoadingWaitTime	The time interval that the invocation request waits for downloading, loading, or both downloading and loading the target model to run inference. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelUnloadingTime	The time interval taken to unload the model through the container's UnloadModel API call. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelDownloadingTime	The time interval taken to download the model from Amazon S3. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelLoadingTime	The time interval taken to load the model through the container's LoadModel API call. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count.
ModelCacheHit	The number of InvokeEndpoint requests sent to the loaded model in a multi-model endpoint. The "Average" statistic shows the ratio of requests for the loaded model. Unit: None. Valid statistics: Average, Sum, Sample Count.

Dimensions of Multi-Model Endpoint Model Loading Metrics¶

Dimension	Description
EndpointName, VariantName	Filter endpoint invocation metrics for the specified endpoint and variant's ProductionVariant.