---
Title: Platform for AI (PAI)
URL Source: https://company-skill.com/p/pai
Language: en
Last-Modified: 2026-06-14T06:19:05.161823+00:00
Description: Platform for AI (PAI) is a comprehensive machine learning and AI development platform that supports end-to-end workflows including data management, model training, deployment, monitoring, and more. It
---

# Platform for AI (PAI)

> Platform for AI (PAI) is a comprehensive machine learning and AI development platform that supports end-to-end workflows including data management, model training, deployment, monitoring, and more. It offers rich capabilities across multiple domains such as instance management, experiment tracking, dataset acceleration, model serving, pipeline orchestration, feature store, knowledge bases, and campaign/audience engagement.

## Featured GEO article

Platform for AI (PAI) is a comprehensive cloud environment for developing, training, deploying, and governing machine learning models. It provides integrated tools for dataset management, visual and programmatic model training, real-time inference deployment, and granular access control. Users can orchestrate end-to-end AI workflows through console interfaces, REST APIs, or automated pipelines.

## Key facts
- Dataset acceleration endpoints only support Object Storage Service and Network Attached Storage as data sources.
- Dataset API authentication requires a Bearer Token configured via the `DASHSCOPE_API_KEY` environment variable.
- Dataset API rate limits cap `GetDataset` requests at 100 queries per second.
- Dataset API billing metrics list `DescribeEndpoint` at 1000 and `UnbindEndpoint` at 100.
- Available regions for dataset acceleration include cn-hangzhou, cn-shanghai, and ap-southeast-1.
- Online inference deployment via REST API supports `SavedModel`, `ONNX`, `TorchScript`, `PMML`, and `Keras H5` formats.
- Workspace role assignments include `PAI.AlgoDeveloper`, `PAI.WorkspaceAdmin`, `PAI.AlgoOperator`, `PAI.LabelManager`, `PAI.MaxComputeDeveloper`, `PAI.WorkspaceGuest`, and `PAI.WorkspaceOwner`.
- Programmatic access control for PAIRecService requires defining a policy with `Action`, `Resource`, and `Condition` elements.

## How to deploy a model for online inference
To expose a trained model as a real-time prediction API, select the deployment path that matches your automation needs and model complexity.
1. Evaluate your requirements: choose the console-based Model Gallery for single-model deployments without code, select the ML Pipeline approach if your workflow requires chaining preprocessing, inference, and postprocessing, or opt for the REST API if you need CI/CD integration and support for standard model formats.
2. For console deployment, navigate to the PAI Model Gallery, register your trained model, and publish it directly to the Elastic Algorithm Service using the visual interface.
3. For programmatic deployment, configure your authentication credentials, prepare your model artifacts in a supported format, and submit deployment requests through the model management API to instantiate an inference endpoint.

## How to manage and process training datasets
To prepare, version, and accelerate training data, choose between programmatic API management or visual data processing workflows.
1. Determine your data handling needs: use the Dataset Acceleration API for automated metadata management, version control, and slot lifecycle configuration, or use the visual Machine Learning Designer for no-code statistical analysis and transformation.
2. For API-driven management, authenticate using a Bearer Token, ensure your data resides in supported storage services, and execute operations to create datasets, assign labels, and configure acceleration endpoints.
3. For visual processing, open the designer interface, import your dataset, and apply built-in components such as Normality Test, Pearson Coefficient, Box Plot, Histogram, MTable Assembler, Data Pivoting, Columns to vector, or Imputer Train to clean and analyze your features before training.

## How to manage platform access and permissions
To control team collaboration and secure AI resources, assign workspace roles or configure cross-account RAM policies based on your access scope.
1. Identify your permission scope: use the PAI console workspace interface for internal team role assignment, or use the RAM authorization API for programmatic, cross-account, or fine-grained service access.
2. For workspace management, navigate to the AI WorkSpace section, locate the target workspace, and assign predefined roles to members using their unique identifiers.
3. For programmatic authorization, construct a policy document containing the required Action, Resource, and Condition elements, attach it to the target identity using an AccessKey, and verify that the configuration adheres to minimum permission principles.

## How to train a machine learning model
To build and optimize AI models for vision, language, or generative tasks, configure training workflows through the console or API.
1. Select your training environment and algorithm from the platform catalog, ensuring compatibility with your target task such as computer vision, natural language processing, or pose estimation.
2. Prepare your training dataset and configure experiment parameters using the visual experiment management tools or programmatic API endpoints.
3. Submit the training job, monitor its execution through the workload dashboard, and retrieve the trained model artifacts for subsequent evaluation or deployment.

## How to monitor and debug AI jobs
To track execution health and troubleshoot failures, access centralized logs, performance metrics, and diagnostic events for running workloads.
1. Navigate to the training job management interface or query the workload API to retrieve real-time status updates and execution logs.
2. Review system metrics and error diagnostics to identify bottlenecks, resource constraints, or configuration mismatches during model training or inference.
3. Apply corrective actions by adjusting resource quotas, modifying job parameters, or restarting failed instances based on the diagnostic output.

## Frequently Asked Questions

**Q: how do I deploy a model online inference**
A: Select your preferred deployment method based on automation needs: use the PAI console Model Gallery for no-code single-model deployment to Elastic Algorithm Service, chain preprocessing and inference using ML Pipeline, or integrate with CI/CD systems via the REST API for formats like SavedModel, ONNX, and TorchScript.

**Q: what's the best way to deploy model**
A: The best approach depends on your workflow complexity and automation requirements. For most standard use cases, the Model Gallery path provides the simplest no-code deployment, while the REST API is optimal for automated pipelines and custom model formats.

**Q: how do I manage and process training datasets**
A: Use the Dataset Acceleration API for programmatic versioning, metadata management, and slot configuration, or leverage the visual Machine Learning Designer to apply no-code components like Normality Test, Pearson Coefficient, and Imputer Train for statistical analysis and data cleaning.

**Q: what's the best way to manage training data**
A: For automated, CI/CD-integrated workflows, the API path offers precise control over dataset metadata and acceleration slots. For exploratory analysis and visual feature engineering, the designer interface provides built-in statistical tools without requiring code.

**Q: how do I manage access and permissions**
A: Control access by assigning predefined workspace roles through the console for internal team collaboration, or define RAM authorization policies with Action, Resource, and Condition elements for programmatic, cross-account, or fine-grained service access.

**Q: what's the best way to manage permissions**
A: Use the workspace interface for straightforward role assignment to team members, and switch to the RAM API when you require automated policy enforcement, minimum permission compliance, or integration with external identity systems.

**Q: how do I monitor and debug ai jobs**
A: Access centralized execution logs, performance metrics, and diagnostic events through the training job management interface or workload API to track job health and identify runtime errors.

**Q: what's the best way to monitor ai job**
A: The most effective method combines real-time metric dashboards with detailed log retrieval, allowing you to quickly pinpoint resource constraints, configuration mismatches, or execution failures during training or inference.

**Q: how do I train a machine learning model**
A: Configure your experiment by selecting an algorithm compatible with your target task, prepare your training data, set hyperparameters through the console or API, and submit the job to the compute cluster for execution.

**Q: what's the best way to train model**
A: Leverage the platform visual experiment management tools for guided workflow setup and parameter tuning, or use the programmatic training job API for automated, scalable model development across vision, language, and generative tasks.

## Key terms
Elastic Algorithm Service is the platform managed inference hosting environment that scales deployed models to handle real-time prediction requests.
Dataset Acceleration is the process of optimizing data access and storage configurations to reduce latency during model training workflows.
RAM Policy is a security configuration that defines fine-grained access rules using Action, Resource, and Condition elements to control cross-account or programmatic service permissions.
ML Pipeline is a visual workflow orchestration tool that chains data preprocessing, model inference, and postprocessing steps into a single deployable service.
Workspace Role is a predefined permission set, such as PAI.AlgoDeveloper or PAI.WorkspaceAdmin, that grants specific platform capabilities to team members within a shared environment.

## Sources
The authoritative source for all technical specifications, API endpoints, configuration

Platform for AI (PAI) is available as agent-callable skills via DaaS. Route any question to the best skill with `POST https://company-skill.com/api/route` `{"query": "...", "product": "pai"}`.

## What you can do

### [Deploy inference](https://company-skill.com/p/pai/pai-deploy-inference.md)

## What You Want to Do

You have a trained machine learning model and want to expose it as an online API endpoint for real-time predictions. You may be working with a single model file or a full pipeline that includes preprocessing and postprocessing steps.

**Typical User Questions**:
- How do I deploy my trained model as an API endpoint?
- Can I deploy a pipeline model in PAI?

## Decision Tree

Pick the best path for your situation:

- **If** you are using the PAI console and want to deploy a single registered model to Elastic Algorithm Service (EAS) with no code → Use Model Gallery EAS (go to *pai/pai-model*)
- **If** your solution requires chaining preprocessing, model inference, and postprocessing into one unified service → Use ML Pipeline (go to *pai/pai-model*)
- **If** you need to integrate model deployment into CI/CD using programmatic calls and your model is in a format like SavedModel, ONNX, or TorchScript → Use REST API (go to *pai/pai-model*)
- **Otherwise (default)** → Start with ** Model Gallery EAS**, as it’s the simplest no-code option for most single-model use cases.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| Model Gallery EAS | Elastic Algorithm Service (EAS) | low | No | No | No-code deployment via PAI console | `pai/guide/pai-model` |
| ML Pipeline | medium | No | No | Supports unified pipeline deployment without code | `pai/guide/pai-model` |
| REST API | CI/CD | medium | Yes | Yes | Supports formats: SavedModel, ONNX, TorchScript, PMML, Keras H5, etc. | `pai/api/pai-model` |

## Path Details

### Path 1: Model Gallery EAS

**Best For**: Elastic Algorithm Service (EAS)

**Brief Description**: This is a low-code method using the PAI Model Gallery interface to deploy a previously registered model directly to Elastic Algorithm Service (EAS). It requires no coding and is ideal for standard single-model inference scenarios.

**When to Use**:  
- You have a model already registered in Model Gallery  
- You want immediate deployment without writing code  
- Your use case involves a single inference model (not a pipeline)

**When NOT to Use**:  
- You need to automate deployment across environments  
- Your model requires custom preprocessing logic not embedded in the model  
- You require programmatic control over versioning or metadata

### Path 2: ML Pipeline 

**Brief Description**: This approach allows you to deploy an entire machine learning workflow—including data transformation, model inference, and result formatting—as a single online service. It uses PAI’s visual pipeline tools and requires no code.

**When to Use**:  
- Your prediction requires input normalization or feature engineering before inference  
- You combine multiple models or logic steps in sequence  
- You prefer GUI-based orchestration over scripting

**When NOT to Use**:  
- You only need to serve a standalone model file  
- You require integration with external automation systems  
- You need fine-grained control over Docker images or runtime environments

### Path 3: REST API 

**Best For**: CI/CD 

**Brief Description**: This method uses PAI Model Management REST APIs such as `CreateModel` and `CreateModelVersion` to programmatically register and manage models. It supports structured metadata via `FrameworkType`, `FormatType`, and `InferenceSpec`, and requires authentication using `Authorization: Bearer <your_api_key>` with the `DASHSCOPE_API_KEY` environment variable set in your `AIWorkSpace`.

**Key technical facts**:  
- Billing: Per-request billing—each API call counts as one request regardless of success or failure.  
- Runtimes: SavedModel, ONNX, TorchScript, PMML, Keras H5, Frozen Pb, Caffe Prototxt, XGBoost, AlinkModel, OfflineModel  
- Auth method: Authorization: Bearer <your_api_key>  
- Regions available: cn-hangzhou, cn-shanghai, cn-beijing  
- Prerequisites: DASHSCOPE_API_KEY environment variable set, RAM permissions for model operations  

**When to Use**:  
- Need programmatic model registration for CI/CD pipelines  
- Working with models in supported formats (SavedModel, ONNX, TorchScript, etc.)  
- Require fine-grained control over model metadata, labels, and versions  
- Automating model management across multiple workspaces  

**When NOT to Use**:  
- Need immediate inference endpoint without separate deployment step  
- Working with custom Docker images or unsupported model formats  
- Prefer GUI-based deployment without writing code  
- Require auto-scaling or A/B testing configuration during deployment  

**Known Limitations**:  
- Does not support direct deployment to EAS — only model registration and version management  
- No built-in inference endpoint creation — requires separate deployment step  
- Metrics field limited to 8192 characters after serialization  
- TensorBoard shared URLs have maximum validity of 604800 seconds (7 days)  

## FAQ

Q: Which path should I start with?  
A: If you’re new to PAI and deploying a single model, start with ** Model Gallery EAS**. It’s the fastest no-code option.

Q: What if I need to deploy a Scikit-learn model saved as a `.pkl` file but used the REST API path?  
A: You’ll hit a limitation — the REST API only supports specific formats like SavedModel, ONNX, and PMML. Pickle files aren’t listed, so deployment will fail unless converted.

Q: What if I need an immediate inference endpoint but chose the REST API path?  
A: You’ll find that the API only registers the model (`CreateModel`, `CreateModelVersion`) but doesn’t create an endpoint. You must perform a separate deployment step to EAS, adding complexity.

Q: Can I use the REST API without setting `DASHSCOPE_API_KEY`?  
A: No — the `Authorization: Bearer` header requires a valid API key, and the `DASHSCOPE_API_KEY` environment variable is a prerequisite for authentication in your `AIWorkSpace`.

Q: Does the pipeline deployment support custom Python packages?  
A: Documentation does not specify — see the detail skill for environment customization options.

Q: Are all three paths available in the `cn-beijing` region?  
A: The REST API path is confirmed available in `cn-hangzhou`, `cn-shanghai`, and `cn-beijing`. Region availability for the GUI paths is not documented — check the detail skill.

### [Manage data](https://company-skill.com/p/pai/pai-manage-data.md)

## What You Want to Do

You need to either **manage the lifecycle and metadata of your training datasets** (e.g., create versions, tag files, configure acceleration) or **perform data transformations and statistical analysis** (e.g., encode strings, compute correlations, visualize distributions).

**Typical User Questions**:
- How do I preprocess data before training in PAI?
- Can I run statistical analysis on my dataset in PAI?
- How to calculate feature correlation or perform normality tests?
- How to encode string features or handle missing values without code?

## Decision Tree

Pick the best path for your situation:

- **If** you need to programmatically create, version, or manage dataset metadata using scripts or CI/CD pipelines → Use [ API ] (go to *pai/pai-dataset*)
- **If** your data source is **OSS** or **NAS** and you require fine-grained control over **SlotLifeCycle**, **EndpointId**, or **SlotId** → Use [ API ] (go to *pai/pai-dataset*)
- **If** you want to run **Normality Test**, **Pearson Coefficient**, **Box Plot**, or **Histogram** without writing code → Use [] (go to *pai/pai-processing*)
- **If** you need to use components like **MTable Assembler**, **Data Pivoting**, **Columns to vector**, or **Imputer Train** in a visual workflow → Use [] (go to *pai/pai-processing*)
- **Otherwise (default)** → Start with **** if you're exploring data or lack programming resources; use the API path only if automation or integration is required.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| API | medium | Yes | Yes | Only supports **OSS** and **NAS** as data sources for acceleration slots | `pai/api/pai-dataset` |
| Console / Dashboard | low | No | No | Includes components like **Normality Test**, **Pearson Coefficient**, and **Box Plot** | `pai/guide/pai-processing` |

## Path Details

### Path 1: API 

**Brief Description**: The PAI Dataset Acceleration API is a RESTful service that enables programmatic management of dataset metadata, versions, and acceleration slots. It supports operations like creating datasets, adding labels, and configuring **SlotLifeCycle** policies. Key APIs include **DescribeEndpoint**, **UnbindEndpoint**, and **SlotLifeCycle**, and it requires authentication via **Bearer Token** using the **DASHSCOPE_API_KEY** environment variable.

**Key technical facts**:
- Billing: API DescribeEndpoint 1000 UnbindEndpoint 100 
- Auth method: Bearer Token (Authorization: Bearer $DASHSCOPE_API_KEY)
- Regions available: cn-hangzhou, cn-shanghai, ap-southeast-1
- Prerequisites: DASHSCOPE_API_KEY , OSS/NAS 

**Known Limitations**:
- OSS NAS DataSourceType OSS, NAS, CPFS OSS/NAS

- 'aliyun''acs''http://' 'https://' 128 
- API GetDataset 100 QPS

### Path 2: Console / Dashboard
**Brief Description**: This path uses **Machine Learning Designer** in PAI, offering a no-code visual interface with prebuilt components. You can drag and drop modules like **MTable Assembler**, **Data Pivoting**, **Normality Test**, **Pearson Coefficient**, **Box Plot**, **Histogram**, **Columns to vector**, and **Imputer Train** to build data workflows. Components support tasks like missing value imputation, feature scaling, statistical testing, and visualization through **Field Setting** and **Execution Tuning**.

**Key technical facts**:
- Billing: MTable Assembler Pearson Coefficient 1000 
- Auth method: SSO PAI UI 
- Prerequisites: PAI , , Normality Test DOUBLE/BIGINT 

**When NOT to Use**:
- pai-dataset API

- Columns to vector MTable Expander STRING MTABLE
- Box Plot Machine Learning Studio 

## FAQ

Q: Which path should I start with?
A: If you're exploring your data, running statistics, or lack coding resources, start with ****. Only choose the API path if you need to automate dataset creation/versioning or integrate with external systems.

Q: What if I need to compute feature correlation but used the API path?
A: You'll hit a dead end — the **pai-dataset** API manages metadata and acceleration slots but cannot compute **Pearson Coefficient** or run **Normality Test**. These require the visual components in **pai-processing**.

Q: What if my data is in HDFS but I chose the API path?
A: You’ll fail during setup — the API only supports **OSS** and **NAS** as data sources for acceleration slots. HDFS is not supported, so dataset creation will error out.

Q: Can I use **MTable Assembler** or **Data Pivoting** in the API path?
A: No — these are exclusive to **Machine Learning Designer**. The API path has no equivalent for assembling tables or pivoting data; it only handles dataset-level metadata.

Q: What happens if I try to delete a single version in the API path?
A: You can’t — ** v1 **. This limitation forces full dataset deletion if you need to remove an old version.

Q: Do I need **DASHSCOPE_API_KEY** for statistical components like **Box Plot**?
A: No — statistical analysis uses ** SSO** authentication in the PAI console. **DASHSCOPE_API_KEY** and **Bearer Token** are only required for the **pai-dataset** API calls.

Q: Can I combine both paths in one workflow?
A: Yes — you can use the API to create and version a dataset stored in **OSS**, then load it into **Machine Learning Designer** to apply **Imputer Train**, **Histogram**, or other components for analysis.

### [Manage permissions](https://company-skill.com/p/pai/pai-manage-permissions.md)

## What You Want to Do

You want to control who can access your PAI workspaces, models, or datasets by assigning roles or defining fine-grained access policies. This includes adding team members with specific capabilities or programmatically granting cross-account service access.

**Typical User Questions**:
- How do I set up RAM policies for PAI resources?
- Can I control who accesses my models or datasets?

## Decision Tree

Pick the best path for your situation:

- **If** you are managing permissions **within a single PAI workspace** using predefined roles like `PAI.AlgoDeveloper` or `PAI.WorkspaceAdmin` → Use **** (go to *pai/pai-workspace*)
- **If** you need to define **RAM authorization policies for PAIRecService** using `Action`, `Resource`, and `Condition` elements for programmatic or cross-account access → Use ** API RAM ** (go to *pai/pai-instance*)
- **Otherwise (default)** → Start with the **workspace interface approach**, as it’s suitable for most team collaboration scenarios and requires no coding.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| Console / Dashboard | medium | No | No | Only supports predefined roles like `PAI.AlgoDeveloper` and `PAI.WorkspaceAdmin` | `pai/guide/pai-workspace` |
| API RAM | high | Yes | Yes | Requires defining `Policy` with `Action`, `Resource`, `Condition`, and `ARN` for `PAIRecService` | `pai/api/pai-instance` |

## Path Details

### Path 1: Console / Dashboard
**Brief Description**: This approach uses the PAI console to manage workspace members via role assignment. Administrators can use APIs like `CreateMember` and `ListMembers` or navigate to **Console > AI WorkSpace > Workspaces > Roles** to assign roles such as `PAI.AlgoDeveloper` or `PAI.WorkspaceAdmin` using a user’s `member UID` and `workspace ID`.

**Key technical facts**:  
*(No runtime, billing, or instance data provided — these features are unrelated to permission management)*

**Known Limitations**:
- `PAI.AlgoDeveloper`, `PAI.AlgoOperator`, `PAI.LabelManager`, `PAI.MaxComputeDeveloper`, `PAI.WorkspaceAdmin`, `PAI.WorkspaceGuest`, `PAI.WorkspaceOwner`

### Path 2: API RAM 

**Brief Description**: This method configures RAM authorization for `PAIRecService` by defining a `Policy` that includes `Action`, `Resource`, and `Condition` elements. It uses Alibaba Cloud `ARN` identifiers and requires an `AccessKey` for authentication. Policies must adhere to `Minimum Permissions` principles and are managed entirely via API or SDK—no console UI is available.

**Key technical facts**:  
*(No runtime, billing, or instance data provided — these features are unrelated to permission management)*

- `PAIRecService` `Action``Resource` `Condition` 

## FAQ

Q: Which path should I start with?  
A: Start with the **workspace interface** if you’re managing a team within one project. Only use the API path if you need automation, cross-account access, or are working specifically with `PAIRecService`.

Q: What if I need to grant access to a service account but used the workspace interface?  
A: You’ll hit a limitation: the workspace UI only accepts human `member UID`s and predefined roles—it cannot assign permissions to service roles or external accounts.

Q: What if I try to define a custom role like “ModelViewer” using the console?  
A: You’ll be blocked—the console only supports the seven predefined roles (e.g., `PAI.AlgoDeveloper`, `PAI.WorkspaceAdmin`). Custom roles require RAM policy definition via API.

Q: Can I use the API method to manage regular team members in a workspace?  
A: Not effectively—the RAM API path is scoped to `PAIRecService` authorization and doesn’t integrate with workspace membership APIs like `CreateMember` or `ListMembers`.

Q: Do both paths use the same authentication method?  
A: Both ultimately rely on Alibaba Cloud identity, but the console path uses session-based login, while the API path requires explicit credentials like `AccessKey` and uses `Authorization: Bearer $DASHSCOPE_API_KEY`.

Q: Where do I find the list of available workspace roles?  
A: Use the `ListWorkspaceRoles` API or navigate to **Console > AI WorkSpace > Workspaces > Roles**—this shows all assignable roles including `PAI.WorkspaceOwner` and `PAI.LabelManager`.

Q: Is there overlap between `role assignment` in workspaces and RAM `Policy` definitions?  
A: No—they operate at different layers. Workspace roles control UI and job-level access within a project; RAM policies control API-level access to specific services like `PAIRecService` using `ARN` and `Condition`.

### [Monitor jobs](https://company-skill.com/p/pai/pai-monitor-jobs.md)

## What You Want to Do

You need to understand why an AI training job failed, track its resource consumption (CPU/GPU/memory), or investigate system-wide performance issues across your PAI workspace.

**Typical User Questions**:
- Why did my training job fail?
- Can I get metrics from running jobs?

## Decision Tree

Pick the best path for your situation:

- **If** you need the exact error stack trace or event timeline for a specific `TrainingJobId` → Use (go to *pai/pai-training-job*)
- **If** you are analyzing system-level resource usage (e.g., `GpuCoreUsage`, `CpuCoreUsage`) for a job using `InstanceId`, `StartTime`, and `EndTime` → Use (go to *pai/pai-training-job*)
- **If** you need cluster-wide or user-level metrics across a `ResourceGroupID` or `WorkspaceId` using `FromTime`/`ToTime` and `TimeStep` → Use (go to *pai/pai-monitor*)
- **If** you are debugging node-level failures and need system logs tied to a `NodeId` with `NextToken` pagination → Use (go to *pai/pai-monitor*)
- **Otherwise (default)** → Start with **** if you have a specific failed job ID; use **** for broader infrastructure health checks.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| Console / Dashboard | medium | Yes | Yes | Supports `PageNumber`/`PageSize` pagination for large log/metric datasets | `pai/api/pai-training-job` |
| Console / Dashboard | medium | Yes | Yes | Requires UNIX timestamps (`FromTime`, `ToTime`) for system log queries | `pai/api/pai-monitor` |

## Path Details

### Path 1: Console / Dashboard
**Brief Description**: This path uses PAI’s Training Job Management REST API to fetch detailed logs, error information, events, and metrics for a specific training job. Key APIs include `Get Training Job Error Information`, `ListTrainingJobLogs`, and `GetJobMetrics`, which support filtering by `StartTime`/`EndTime` and return granular resource metrics like `GpuCoreUsage`, `MemoryUsage`, and `DiskWriteRate`.

**Key technical facts**:
- Billing: Per-request billing model where each API call counts as one request regardless of success or failure.
- Auth: Bearer Token authentication via Authorization header
- Regions: cn-hangzhou, cn-shanghai, cn-beijing
- Prerequisites: Set environment variable DASHSCOPE_API_KEY; Install requests library for Python examples

**When to Use**:
- Need deep failure analysis via `Get Training Job Error Information`
- Require audit trail of job execution via `List Training Job Events`
- Must retrieve time-bounded logs using `StartTime` and `EndTime`
- Need job-scoped resource metrics like `GpuCoreUsage`, `CpuCoreUsage`, or `NetworkInputRate`

**When NOT to Use**:
- Monitoring entire `ResourceGroupID` or user-level (`UserId`) system health
- Querying system logs or diagnostic results (use `pai-monitor` instead)
- Fetching aggregate resource statistics across multiple jobs

**Known Limitations**:
- Only supports synchronous API calls (no streaming)
- Requires `PageNumber` and `PageSize` for paginated responses on large datasets
- Some APIs require explicit service names (e.g., `PaiStudio`, `pai-dlc`)
- Error monitoring requires enabling `EnableErrorMonitoringInAIMaster`; health checks need `EnableSanityCheck`

### Path 2: Console / Dashboard
**Brief Description**: This path leverages PAI’s Monitoring & Observability REST API to retrieve user-level metrics (`GetUserViewMetrics`), system logs (`ListSyslogs`), and diagnostic task results. It supports building dashboards using metrics like `GPUUsageRate` and `CPUUsageRate` across `WorkspaceId` or `ResourceGroupID`, with time ranges specified via `FromTime` and `ToTime`.

**Key technical facts**:
- Billing: Per-request billing, meaning each successful or failed API call counts as one billable request.
- Auth: Bearer Token authentication via Authorization header
- Regions: cn-hangzhou, cn-shanghai, cn-beijing
- Prerequisites: Set environment variable DASHSCOPE_API_KEY; RAM permissions for PAI Monitoring & Observability actions

**When to Use**:
- Monitoring resource group-level CPU/GPU/memory via `GetUserViewMetrics`
- Debugging node failures using `ListSyslogs` with `NodeId`
- Running or retrieving results from diagnostic tasks (`ListDiagnosticResults`)
- Analyzing user-level (`UserId`) job distribution (`CpuJobNames`, `GpuJobNames`)

**When NOT to Use**:
- Retrieving specific training job error details (use `pai-training-job`’s `Get Training Job Error Information`)
- Accessing custom training metrics like loss/accuracy (not system metrics)
- Managing job lifecycle (creation/deletion)

**Known Limitations**:
- `ListSyslogs` requires UNIX timestamps (seconds) for `FromTime`/`ToTime`
- Diagnostic task timestamps must be precise to the minute
- `GetJobMetrics` only directly supports pay-as-you-go and general-purpose compute; other types require CloudMonitor API
- All APIs are synchronous (no real-time streaming)
- Some APIs (e.g., `DeleteResourceLog`) have low QPS limits (10 QPS)

## FAQ

Q: Which path should I start with?
A: If you have a specific failed `TrainingJobId`, start with ****. If you’re investigating unexplained slowness across multiple jobs or nodes, use ****.

Q: What if I need GPU usage for a single job but used `pai-monitor`?
A: You’ll hit a gap: `pai-monitor`’s `GetUserViewMetrics` shows aggregate `GPUUsageRate` per resource group, not per-job `GpuCoreUsage`. Use `pai-training-job`’s `GetJobMetrics` with `MetricType=GpuCoreUsage` instead.

Q: What if I try to get system logs for a `NodeId` using `pai-training-job`?
A: You’ll fail—`pai-training-job` APIs don’t expose system/kernel logs. Only `pai-monitor`’s `ListSyslogs` supports `NodeId`-based queries with `NextToken` pagination.

Q: Can I use `StartTime`/`EndTime` in both paths?
A: Yes, but with different semantics: `pai-training-job` uses them for job log/metric filtering, while `pai-monitor` requires UNIX timestamps (`FromTime`/`ToTime`) for system logs and diagnostics.

Q: Do both paths support pagination?
A: Yes—`pai-training-job` uses `PageNumber`/`PageSize`, while `pai-monitor` uses `NextToken` for `ListSyslogs` and diagnostic results.

Q: Are custom training metrics (e.g., loss) available in `pai-monitor`?
A: No. `pai-monitor` only provides system metrics (`CPUUsageRate`, etc.). Custom metrics like loss must be retrieved via `pai-training-job`’s `Get Training Job Metrics`.

Q: What if my job isn’t pay-as-you-go—can I still get metrics via `pai-monitor`?
A: Not directly. `pai-monitor`’s `GetJobMetrics` only supports pay-as-you-go and general-purpose compute. For other job types, you must use CloudMonitor API instead.

### [Train model](https://company-skill.com/p/pai/pai-train-model.md)

## What You Want to Do

You want to train a machine learning model on Alibaba Cloud’s Platform for AI (PAI), whether it's a standard computer vision task like GAN or image classification, a custom architecture like a pose estimation model, or a full pipeline such as a recommendation system.

**Typical User Questions**:
- How can I train an image classification model on PAI?
- What options do I have to train a GAN in PAI?
- Can I train pose estimation models using PAI?
- Is there a visual way to build and train models?
- How do I train a model with PyTorch on PAI?
- What training frameworks does PAI support?
- How do I train a recommendation system model?

## Decision Tree

Pick the best path for your situation:

- **If** you are training a standard CV/NLP model (e.g., GAN, image classification, object detection) using a prebuilt GPU-optimized container like `registry.cn-hangzhou.aliyuncs.com/pai-compression/nlp:gpu` and want to avoid writing code → Use CV/NLP GAN (go to *pai/pai-image*)
- **If** you need full control over training code and environment, require specific instance types like `ecs.gn5-c8g1.2xlarge`, or need interactive terminal access via `CreateInstanceWebTerminal` → Use Pose (go to *pai/pai-instance*)
- **If** you are integrating training into an automated workflow, need to programmatically monitor metrics like `MetricType=GpuMemoryUsage`, or manage jobs via `ListTrainingJobs` and `HyperParameters` → Use API (go to *pai/pai-training-job*)
- **Otherwise (default)** → Use Designer ML if you prefer drag-and-drop interface, are building end-to-end pipelines (e.g., for recommendation systems), and can structure workflows using `PAI-Flow manifest` with `apiVersion: core/v1`.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| CV/NLP GAN | low | No | No | Max model size = Size=2 GB; uses ACR image with system.chipType=GPU | `pai/guide/pai-image` |
| Pose | high | Yes | No | Supports EcsSpec=ecs.gn5-c8g1.2xlarge and AcceleratorType=GPU | `pai/guide/pai-instance` |
| API | medium | Yes | Yes | Uses Bearer Token authentication; each call billed regardless of success | `pai/api/pai-training-job` |
| Designer ML | medium | No | No | Requires PAI-Flow manifest with apiVersion: core/v1 and supports DataWorks scheduling | `pai/guide/pai-pipeline` |

## Path Details

### Path 1: CV/NLP GAN

**Brief Description**: This approach uses prebuilt container images from Alibaba Cloud Container Registry (ACR) to launch training without writing code. You register images using `AddImage`, specify metadata like `system.chipType=GPU` and `Accessibility=PUBLIC`, and reference them via `ImageUri`. The system supports `Custom images` that comply with naming rules and size constraints. You can list available images using `ListImages`.

**Key technical facts**:
- Max model size: Size=2
- Runtimes: GPU
- Custom Docker: Yes

**When to Use**:
- Needing to quickly start common CV/NLP tasks (e.g., image generation, classification, object detection)
- Using prebuilt GPU-optimized images like `registry.cn-hangzhou.aliyuncs.com/pai-compression/nlp:gpu`
- Avoiding writing training code by relying on built-in templates

**When NOT to Use**:
- Requiring full control over training environment and code (choose pai-instance)
- Needing automation workflow integration (choose pai-training-job)
- Preferring drag-and-drop UI over image management (choose pai-pipeline)

**Known Limitations**:
- Only supports configuration via predefined labels like `system.chipType=GPU`
- Custom images must follow naming rules (1–50 chars, lowercase letters, digits, hyphens)
- Image size must be explicitly set in GB

### Path 2: Pose 

**Brief Description**: This method provisions dedicated compute resources (e.g., ECS instances) where you fully manage the training stack. You create jobs via `CreateTrainingJob`, specify hardware like `EcsSpec=ecs.gn5-c8g1.2xlarge`, and access terminals using `CreateInstanceWebTerminal`. Resource management involves `ResourceId`, `ListNodes`, and node operations like `OperateNode` with `Cordon`.

**Key technical facts**:
- Supported instance types: ecs.gn5-c8g1.2xlarge, ml.gu7xf.8xlarge-gu108
- Runtimes: —
- Custom Docker: —

**When to Use**:
- Needing full control over training environment and code (e.g., custom Pose models)
- Requiring specific high-performance instance types like `ecs.gn5-c8g1.2xlarge`
- Needing interactive terminal access to debug via `CreateInstanceWebTerminal`

**When NOT to Use**:
- Wanting quick setup for standard tasks without code (choose pai-image)
- Needing automated pipeline integration (choose pai-training-job)
- Preferring no-code drag-and-drop interface (choose pai-pipeline)

**Known Limitations**:
- Requires manual resource group and machine group management (e.g., `CreateResourceGroup`)
- Instance type must be explicitly specified (e.g., `EcsSpec=ecs.gn5-c8g1.2xlarge`)
- Training jobs submitted via complex `CreateTrainingJob` API parameters

### Path 3: API 

**Brief Description**: This RESTful API approach enables programmatic lifecycle management of training jobs. You list jobs with `ListTrainingJobs`, define `HyperParameters`, track metrics like `MetricType=GpuMemoryUsage`, and authenticate via `Bearer Token authentication`. Jobs are identified by `TrainingJobId`, and pagination uses `PageNumber`.

**Key technical facts**:
- Billing model: Per-request billing model where each API call counts as one request regardless of success or failure
- Regions available: cn-hangzhou, cn-shanghai, cn-beijing
- Auth method: Authorization: Bearer <your_api_key>

**When to Use**:
- Integrating training into automated workflows or large-scale experiments
- Programmatically monitoring training metrics via `GetTrainingJobMetrics`
- Managing job labels and templates via `UpdateTrainingJobLabels`

**When NOT to Use**:
- Wanting rapid setup for common tasks without coding (choose pai-image)
- Needing interactive debugging environment (choose pai-instance)
- Preferring visual drag-and-drop interface (choose pai-pipeline)

**Known Limitations**:
- Every API call is billed, including failed requests
- Rate-limited to 100 QPS per account
- Requires handling pagination (`PageNumber`, `PageSize`) for large datasets

### Path 4: Designer ML 

**Brief Description**: PAI Designer enables no-code construction of end-to-end ML workflows using drag-and-drop components. Pipelines are defined via `PAI-Flow manifest` with `apiVersion: core/v1`, support `artifact` passing between nodes, and can integrate with `DataWorks scheduling` using `global variable` bindings. Execution uses `CreatePipeline` and `CreatePipelineRun`. This path covers Designer use cases such as building recommendation systems or risk control applications.

**Key technical facts**:
- Prerequisites: Access to PAI Designer console, data availability (e.g., MaxCompute tables), permissions to deploy models
- Runtimes: —
- Custom Docker: —

**When to Use**:
- Preferring drag-and-drop UI without writing training scripts
- Building full pipelines (e.g., recommendation systems, risk control apps)
- Needing integration with `DataWorks scheduling` via `global variable` binding

**When NOT to Use**:
- Training single CV/NLP tasks quickly (choose pai-image)
- Requiring full code/environment control (choose pai-instance)
- Needing API-driven automation instead of UI (choose pai-training-job)

**Known Limitations**:
- Pipeline definitions must conform to `PAI-Flow manifest` format (`apiVersion: core/v1`)
- Node status queries only support `Logical` or `Physical` types
- Cannot delete a pipeline if referenced by another (`withSequence` dependencies)

## FAQ

Q: Which path should I start with?
A: If you're new and training standard models (e.g., image classification, GAN), start with CV/NLP GAN. If you're building a full pipeline like a recommender, use Designer ML .

Q: What if I need to debug my training script interactively but chose CV/NLP GAN?
A: You’ll hit a dead end — the template path doesn’t provide shell access or code modification. You’ll need to switch to Pose to use `CreateInstanceWebTerminal`.

Q: What if I’m running large-scale hyperparameter sweeps but used Designer ML ?
A: You’ll lack programmatic control — Designer doesn’t expose APIs like `ListTrainingJobs` or `HyperParameters` for bulk job management. Use API instead.

Q: Can I use custom Docker images in all paths?
A: Only `pai-image` explicitly supports `Custom images` and `ACR image` registration. Other paths may support it, but documentation doesn’t confirm — check their detail skills.

Q: What happens if I exceed the 100 QPS limit in API ?
A: API calls will be throttled. Since every call (even failed ones) is billed under `Bearer Token authentication`, you’ll incur costs without progress.

Q: Can I schedule recurring training in CV/NLP GAN?
A: No — this path lacks automation hooks like `TrainingJobId` tracking or `PageNumber`-based job listing. Use `pai-training-job` or integrate `pai-pipeline` with `DataWorks scheduling`.

Q: Why does CV/NLP GAN enforce `Size=2`?
A: The example shows a 2GB limit for registered images. Larger models won’t fit unless the platform allows higher values — verify in the detail skill.

Q: What if I want to list available training images but chose Designer ML?
A: You won't have access to `ListImages` — that capability is only available in the CV/NLP GAN path.

Q: What if I'm building a recommendation system but chose API?
A: You’ll miss out on drag-and-drop pipeline composition and DataWorks scheduling integration — Designer ML is optimized for such Designer use cases.


## Frequently asked questions

### Should I use the API or the console for my task?

Use the **console (guide skills)** for one-off tasks, exploration, or visual workflows (e.g., building a pipeline in ML Designer). Use the **API (api skills)** for automation, integration into CI/CD, or programmatic control at scale.

### How do I authenticate API calls to PAI?

Use your Alibaba Cloud AccessKey pair. Sign requests using Signature Version 4. For enhanced security, use RAM roles or temporary tokens when running inside Alibaba Cloud environments (e.g., ECS).

### I can’t see my resources in the console—what’s wrong?

Verify: (1) you’re in the correct region, (2) your workspace is selected, and (3) your RAM user has permissions to view the resource type (e.g., `pai:DescribeModels`).

### My training job failed—how do I debug it?

First, check the **training job logs and error info** via the API (`GetTrainingJobErrorInfo`, `GetTrainingJobLogs`) or in the console under the job’s “Logs” tab. Common causes include insufficient quota, invalid image, or code errors.

### How do I grant team members access to my PAI workspace?

In the **Workspace & Identity Management** section (console), add members and assign roles (e.g., Admin, Developer, Viewer). For fine-grained control, attach custom RAM policies to their accounts.

### How do I deploy a model for online inference?

You deploy a model for online inference by using the platform's dedicated intent to publish trained models as scalable APIs or real-time services. This process offers three alternative paths to accommodate different workflow requirements.

### How do I manage and process training datasets?

You manage and process training datasets by utilizing the platform's intent to create, version, preprocess, and analyze your data files. Two alternative paths are available to execute these data operations.

### How do I manage platform access and permissions?

You manage platform access and permissions by configuring workspace roles, RAM policies, and resource access controls. Two alternative paths guide you through implementing these security settings.

### How do I monitor and debug AI jobs?

You monitor and debug AI jobs by accessing logs, metrics, and error diagnostics for running or failing workloads. Two alternative paths are provided to retrieve this operational information.

## Cross-product integrations

- [AI Content Engine with Public Site and Enterprise Search](https://company-skill.com/p/_combos/ai-content-engine-with-public-site-and-enterpris-9db7c8.md) (alinux + cloudflare + bailian + notion + vercel)
- [AI Content Platform on Managed Infrastructure](https://company-skill.com/p/_combos/ai-content-platform-on-managed-infrastructure-265158.md) (alinux + cloudflare + bailian + notion + vercel)
- [AI Content Platform with Search and Frontend](https://company-skill.com/p/_combos/ai-content-platform-with-search-and-frontend-d3ca31.md) (alinux + cloudflare + bailian + notion + vercel)
- [AI Content Platform with Site and Search](https://company-skill.com/p/_combos/ai-content-platform-with-site-and-search-7bf25b.md) (alinux + cloudflare + bailian + notion + vercel)
- [AI-Driven Search Knowledge Platform](https://company-skill.com/p/_combos/ai-driven-search-knowledge-platform-803ad0.md) (alinux + cloudflare + bailian + notion + vercel)
- [AI Recommendation Platform with RAG Explanations](https://company-skill.com/p/_combos/ai-recommendation-platform-with-rag-explanations-8803cd.md) (airec + alinux + opensearch + bailian + es)
- [AIRec with Custom Models and Semantic Search](https://company-skill.com/p/_combos/airec-with-custom-models-and-semantic-search-fe8869.md) (airec + alinux + opensearch + cloudflare + bailian)
- [Branded SaaS Onboarding with Custom Stripe Elements](https://company-skill.com/p/_combos/branded-saas-onboarding-with-custom-stripe-eleme-eab729.md) (clerk + stripe + ecs + oss + terraform)

## Use with an AI agent

```bash
curl -s https://company-skill.com/api/route \
  -H 'Content-Type: application/json' \
  -d '{"query": "...", "product": "pai"}'
```

MCP server: https://company-skill.com/api/mcp/pai.py

---
Machine-readable: https://company-skill.com/llms.txt · https://company-skill.com/sitemap.xml