DaaS / Products / Debug slow AI job querying database

Debug slow AI job querying database

An AI training or inference job on PAI reads data from OceanBase (or RDS) and runs slowly; the developer monitors the PAI job to identify resource bottlenecks, discovers slow database queries are the cause, then optimizes those SQL queries in OceanBase.

Products involved

Scenario

When a PAI training or inference job shows high io_wait or prolonged GPU idle time, the bottleneck typically stems from data ingestion rather than compute. This workflow correlates PAI resource telemetry with OceanBase query diagnostics to isolate slow SQL, apply targeted optimizations, and restore pipeline throughput.

Integration steps

  1. Retrieve PAI job metrics: Use the PAI CLI to pull compute and I/O stats for your TrainingJobId.
   pai-cli training-job describe --job-id <TrainingJobId> --metrics cpu,gpu,io_wait

Flag jobs where io_wait > 40% during the data-loading epoch.

  1. Extract the offending SQL: Fetch job logs via the PAI API endpoint /api/v1/jobs/{TrainingJobId}/logs. Filter for JDBC execution traces and copy the exact query string.
  2. Analyze execution plan in OceanBase: Connect to your cluster and run:
   EXPLAIN SELECT * FROM training_data WHERE feature_ts BETWEEN '2024-01-01' AND '2024-01-02';

Look for table_scan or high cost in the plan output.

  1. Verify runtime performance: Query OceanBase’s audit view to confirm scan type and latency:
   SELECT query_sql, elapsed_time, scan_type FROM oceanbase.GV$OB_SQL_AUDIT WHERE query_sql LIKE '%training_data%';
  1. Apply optimization: Add a covering index aligned with filter predicates, then clear stale plans:
   CREATE INDEX idx_feature_ts ON training_data(feature_ts, label_id);
   ALTER SYSTEM FLUSH PLAN CACHE;
  1. Validate: Restart the PAI job. Re-run step 1 and confirm io_wait drops below 15% and GPU utilization stabilizes.

Architecture

PAI orchestrates compute workloads and pulls training batches via JDBC/ODBC. OceanBase executes SQL queries, manages indexes, and streams result sets back to PAI containers. Monitoring flows bidirectionally: PAI emits node-level metrics to CloudMonitor, while OceanBase exposes query execution plans and audit trails. The integration bridges PAI’s I/O telemetry with OceanBase’s SQL diagnostics to isolate data-fetch latency.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How do I debug and resolve a slow AI training job that is bottlenecked by database queries? A: You can debug and resolve a slow AI training job bottlenecked by database queries by correlating PAI I/O telemetry with OceanBase SQL diagnostics to isolate and optimize the offending SQL. Start by using the PAI CLI to flag jobs with io_wait exceeding 40%, extract the exact query from job logs, and analyze its execution plan in OceanBase. Finally, add a covering index aligned with your filter predicates, flush the plan cache, and restart the job to verify that io_wait drops below 15%.