DaaS / Products / ML Training Pipeline End-to-End Monitoring

ML Training Pipeline End-to-End Monitoring

A data scientist runs ML training jobs on PAI that read/write large datasets in RDS. When training is slow or fails, they need to monitor PAI job metrics (GPU utilization, training logs) alongside RDS performance (slow queries, database CPU) to pinpoint whether the bottleneck is in the compute layer or the data layer.

Products involved

Scenario

A data scientist runs ML training jobs on PAI that read/write large datasets in RDS. When training is slow or fails, they need to monitor PAI job metrics (GPU utilization, training logs) alongside RDS performance (slow queries, database CPU) to pinpoint whether the bottleneck is in the compute layer or the data layer.

How the products combine

  1. pai · pai-monitor-jobs — Platform for AI (PAI) — Monitor and debug AI jobs
  2. See pai/pai-monitor-jobs.

  3. rds · rds-monitor-performance — ApsaraDB RDS — Monitor and analyze database performance metrics
  4. See rds/rds-monitor-performance.

Typical questions

FAQ

Q: How can I diagnose whether a slow PAI training job reading from RDS is bottlenecked by compute or data performance? A: You can pinpoint whether the bottleneck is in the compute or data layer by monitoring PAI job metrics like GPU utilization and training logs alongside RDS performance metrics such as slow queries and database CPU. This combined monitoring approach helps you determine if delays stem from idle GPUs waiting for data or from underlying database performance issues.