---
Title: Monitor jobs
URL Source: https://company-skill.com/p/pai/pai-monitor-jobs
Language: en
Description: You need to understand why an AI training job failed, track its resource consumption (CPU/GPU/memory), or investigate system-wide performance issues across your PAI workspace. Typical User Questions:…
---

# Monitor jobs

Part of **Platform for AI (PAI)**. Route queries via `POST https://company-skill.com/api/route`.

## What You Want to Do

You need to understand why an AI training job failed, track its resource consumption (CPU/GPU/memory), or investigate system-wide performance issues across your PAI workspace.

**Typical User Questions**:
- Why did my training job fail?
- Can I get metrics from running jobs?

## Decision Tree

Pick the best path for your situation:

- **If** you need the exact error stack trace or event timeline for a specific `TrainingJobId` → Use (go to *pai/pai-training-job*)
- **If** you are analyzing system-level resource usage (e.g., `GpuCoreUsage`, `CpuCoreUsage`) for a job using `InstanceId`, `StartTime`, and `EndTime` → Use (go to *pai/pai-training-job*)
- **If** you need cluster-wide or user-level metrics across a `ResourceGroupID` or `WorkspaceId` using `FromTime`/`ToTime` and `TimeStep` → Use (go to *pai/pai-monitor*)
- **If** you are debugging node-level failures and need system logs tied to a `NodeId` with `NextToken` pagination → Use (go to *pai/pai-monitor*)
- **Otherwise (default)** → Start with **** if you have a specific failed job ID; use **** for broader infrastructure health checks.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| Console / Dashboard | medium | Yes | Yes | Supports `PageNumber`/`PageSize` pagination for large log/metric datasets | `pai/api/pai-training-job` |
| Console / Dashboard | medium | Yes | Yes | Requires UNIX timestamps (`FromTime`, `ToTime`) for system log queries | `pai/api/pai-monitor` |

## Path Details

### Path 1: Console / Dashboard
**Brief Description**: This path uses PAI’s Training Job Management REST API to fetch detailed logs, error information, events, and metrics for a specific training job. Key APIs include `Get Training Job Error Information`, `ListTrainingJobLogs`, and `GetJobMetrics`, which support filtering by `StartTime`/`EndTime` and return granular resource metrics like `GpuCoreUsage`, `MemoryUsage`, and `DiskWriteRate`.

**Key technical facts**:
- Billing: Per-request billing model where each API call counts as one request regardless of success or failure.
- Auth: Bearer Token authentication via Authorization header
- Regions: cn-hangzhou, cn-shanghai, cn-beijing
- Prerequisites: Set environment variable DASHSCOPE_API_KEY; Install requests library for Python examples

**When to Use**:
- Need deep failure analysis via `Get Training Job Error Information`
- Require audit trail of job execution via `List Training Job Events`
- Must retrieve time-bounded logs using `StartTime` and `EndTime`
- Need job-scoped resource metrics like `GpuCoreUsage`, `CpuCoreUsage`, or `NetworkInputRate`

**When NOT to Use**:
- Monitoring entire `ResourceGroupID` or user-level (`UserId`) system health
- Querying system logs or diagnostic results (use `pai-monitor` instead)
- Fetching aggregate resource statistics across multiple jobs

**Known Limitations**:
- Only supports synchronous API calls (no streaming)
- Requires `PageNumber` and `PageSize` for paginated responses on large datasets
- Some APIs require explicit service names (e.g., `PaiStudio`, `pai-dlc`)
- Error monitoring requires enabling `EnableErrorMonitoringInAIMaster`; health checks need `EnableSanityCheck`

### Path 2: Console / Dashboard
**Brief Description**: This path leverages PAI’s Monitoring & Observability REST API to retrieve user-level metrics (`GetUserViewMetrics`), system logs (`ListSyslogs`), and diagnostic task results. It supports building dashboards using metrics like `GPUUsageRate` and `CPUUsageRate` across `WorkspaceId` or `ResourceGroupID`, with time ranges specified via `FromTime` and `ToTime`.

**Key technical facts**:
- Billing: Per-request billing, meaning each successful or failed API call counts as one billable request.
- Auth: Bearer Token authentication via Authorization header
- Regions: cn-hangzhou, cn-shanghai, cn-beijing
- Prerequisites: Set environment variable DASHSCOPE_API_KEY; RAM permissions for PAI Monitoring & Observability actions

**When to Use**:
- Monitoring resource group-level CPU/GPU/memory via `GetUserViewMetrics`
- Debugging node failures using `ListSyslogs` with `NodeId`
- Running or retrieving results from diagnostic tasks (`ListDiagnosticResults`)
- Analyzing user-level (`UserId`) job distribution (`CpuJobNames`, `GpuJobNames`)

**When NOT to Use**:
- Retrieving specific training job error details (use `pai-training-job`’s `Get Training Job Error Information`)
- Accessing custom training metrics like loss/accuracy (not system metrics)
- Managing job lifecycle (creation/deletion)

**Known Limitations**:
- `ListSyslogs` requires UNIX timestamps (seconds) for `FromTime`/`ToTime`
- Diagnostic task timestamps must be precise to the minute
- `GetJobMetrics` only directly supports pay-as-you-go and general-purpose compute; other types require CloudMonitor API
- All APIs are synchronous (no real-time streaming)
- Some APIs (e.g., `DeleteResourceLog`) have low QPS limits (10 QPS)

## FAQ

Q: Which path should I start with?
A: If you have a specific failed `TrainingJobId`, start with ****. If you’re investigating unexplained slowness across multiple jobs or nodes, use ****.

Q: What if I need GPU usage for a single job but used `pai-monitor`?
A: You’ll hit a gap: `pai-monitor`’s `GetUserViewMetrics` shows aggregate `GPUUsageRate` per resource group, not per-job `GpuCoreUsage`. Use `pai-training-job`’s `GetJobMetrics` with `MetricType=GpuCoreUsage` instead.

Q: What if I try to get system logs for a `NodeId` using `pai-training-job`?
A: You’ll fail—`pai-training-job` APIs don’t expose system/kernel logs. Only `pai-monitor`’s `ListSyslogs` supports `NodeId`-based queries with `NextToken` pagination.

Q: Can I use `StartTime`/`EndTime` in both paths?
A: Yes, but with different semantics: `pai-training-job` uses them for job log/metric filtering, while `pai-monitor` requires UNIX timestamps (`FromTime`/`ToTime`) for system logs and diagnostics.

Q: Do both paths support pagination?
A: Yes—`pai-training-job` uses `PageNumber`/`PageSize`, while `pai-monitor` uses `NextToken` for `ListSyslogs` and diagnostic results.

Q: Are custom training metrics (e.g., loss) available in `pai-monitor`?
A: No. `pai-monitor` only provides system metrics (`CPUUsageRate`, etc.). Custom metrics like loss must be retrieved via `pai-training-job`’s `Get Training Job Metrics`.

Q: What if my job isn’t pay-as-you-go—can I still get metrics via `pai-monitor`?
A: Not directly. `pai-monitor`’s `GetJobMetrics` only supports pay-as-you-go and general-purpose compute. For other job types, you must use CloudMonitor API instead.

## Related queries

monitor AI job, debug training job, view training logs, check job failure reason, get job metrics, inspect GPU usage, diagnose PAI job, track resource utilization, how to debug PAI, why did my job fail, can I see error logs, monitor CPU memory GPU, get system logs PAI, query diagnostic results, chec

---
Part of [Platform for AI (PAI)](https://company-skill.com/p/pai.md) · https://company-skill.com/llms.txt