AWS SageMaker¶
Collect AWS SageMaker metrics
Configuration¶
Install Func¶
It is recommended to activate the TrueWatch integration - extension - DataFlux Func (Automata): all prerequisites are automatically installed, please proceed with the script installation.
If you want to deploy Func yourself, refer to Self-deploy Func
Install Script¶
Note: Please prepare the Amazon AK in advance (for simplicity, you can directly grant the global read-only permission
ReadOnlyAccess
).
Manually Activate Script¶
-
Log in to the Func console, click on 【Script Market】, enter the TrueWatch script market, and search for
integration_aws_sagemaker
. -
After clicking 【Install】, enter the corresponding parameters: AWS AK ID, AK Secret, and account name.
-
Click 【Deploy Startup Script】, the system will automatically create the
Startup
script set and configure the corresponding startup script. -
After activation, you can see the corresponding automatic trigger configuration in 「Manage / Automatic Trigger Configuration」. Click 【Execute】 to immediately execute once without waiting for the scheduled time. After a while, you can check the execution task records and corresponding logs.
Verification¶
- In 「Manage / Automatic Trigger Configuration」, confirm whether the corresponding task has the corresponding automatic trigger configuration, and you can also check the corresponding task records and logs to see if there are any anomalies.
- In TrueWatch, check if asset information exists in 「Infrastructure / Custom」.
- In TrueWatch, check if there is corresponding monitoring data in 「Metrics」.
Metrics¶
After configuring Amazon CloudWatch, the default Measurement is as follows. More metrics can be collected through configuration:
Amazon CloudWatch AWS SageMaker Metrics Details
Inference Component Metrics¶
Metric | Description |
---|---|
CPUUtilizationNormalized | The normalized CPU utilization metric value reported by each inference component replica, ranging from 0%-100%. If the NumberOfCpuCoresRequired parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit. |
GPUMemoryUtilizationNormalized | The normalized GPU memory utilization metric value reported by each inference component replica. |
GPUUtilizationNormalized | The normalized GPU utilization metric value reported by each inference component replica. If the NumberOfAcceleratorDevicesRequired parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit. |
MemoryUtilizationNormalized | The normalized memory utilization value reported by each inference component replica. If the MinMemoryRequiredInMb parameter is set, it shows the reserved utilization; otherwise, it shows the utilization beyond the limit. |
Dimensions of Inference Component Metrics¶
Dimension | Description |
---|---|
InferenceComponentName | Filter inference component metrics. |
Multi-Model Endpoint Model Loading Metrics¶
Metric | Description |
---|---|
ModelLoadingWaitTime | The time interval that the invocation request waits for downloading, loading, or both downloading and loading the target model to run inference. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count. |
ModelUnloadingTime | The time interval taken to unload the model through the container's UnloadModel API call. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count. |
ModelDownloadingTime | The time interval taken to download the model from Amazon S3. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count. |
ModelLoadingTime | The time interval taken to load the model through the container's LoadModel API call. Unit: microseconds. Valid statistics: Average, Sum, Min, Max, Sample Count. |
ModelCacheHit | The number of InvokeEndpoint requests sent to the loaded model in a multi-model endpoint. The "Average" statistic shows the ratio of requests for the loaded model. Unit: None. Valid statistics: Average, Sum, Sample Count. |
Dimensions of Multi-Model Endpoint Model Loading Metrics¶
Dimension | Description |
---|---|
EndpointName, VariantName | Filter endpoint invocation metrics for the specified endpoint and variant's ProductionVariant. |