---
Title: Train model
URL Source: https://company-skill.com/p/pai/pai-train-model
Language: en
Description: You want to train a machine learning model on Alibaba Cloud’s Platform for AI (PAI), whether it's a standard computer vision task like GAN or image classification, a custom architecture like a pose…
---

# Train model

Part of **Platform for AI (PAI)**. Route queries via `POST https://company-skill.com/api/route`.

## What You Want to Do

You want to train a machine learning model on Alibaba Cloud’s Platform for AI (PAI), whether it's a standard computer vision task like GAN or image classification, a custom architecture like a pose estimation model, or a full pipeline such as a recommendation system.

**Typical User Questions**:
- How can I train an image classification model on PAI?
- What options do I have to train a GAN in PAI?
- Can I train pose estimation models using PAI?
- Is there a visual way to build and train models?
- How do I train a model with PyTorch on PAI?
- What training frameworks does PAI support?
- How do I train a recommendation system model?

## Decision Tree

Pick the best path for your situation:

- **If** you are training a standard CV/NLP model (e.g., GAN, image classification, object detection) using a prebuilt GPU-optimized container like `registry.cn-hangzhou.aliyuncs.com/pai-compression/nlp:gpu` and want to avoid writing code → Use CV/NLP GAN (go to *pai/pai-image*)
- **If** you need full control over training code and environment, require specific instance types like `ecs.gn5-c8g1.2xlarge`, or need interactive terminal access via `CreateInstanceWebTerminal` → Use Pose (go to *pai/pai-instance*)
- **If** you are integrating training into an automated workflow, need to programmatically monitor metrics like `MetricType=GpuMemoryUsage`, or manage jobs via `ListTrainingJobs` and `HyperParameters` → Use API (go to *pai/pai-training-job*)
- **Otherwise (default)** → Use Designer ML if you prefer drag-and-drop interface, are building end-to-end pipelines (e.g., for recommendation systems), and can structure workflows using `PAI-Flow manifest` with `apiVersion: core/v1`.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| CV/NLP GAN | low | No | No | Max model size = Size=2 GB; uses ACR image with system.chipType=GPU | `pai/guide/pai-image` |
| Pose | high | Yes | No | Supports EcsSpec=ecs.gn5-c8g1.2xlarge and AcceleratorType=GPU | `pai/guide/pai-instance` |
| API | medium | Yes | Yes | Uses Bearer Token authentication; each call billed regardless of success | `pai/api/pai-training-job` |
| Designer ML | medium | No | No | Requires PAI-Flow manifest with apiVersion: core/v1 and supports DataWorks scheduling | `pai/guide/pai-pipeline` |

## Path Details

### Path 1: CV/NLP GAN

**Brief Description**: This approach uses prebuilt container images from Alibaba Cloud Container Registry (ACR) to launch training without writing code. You register images using `AddImage`, specify metadata like `system.chipType=GPU` and `Accessibility=PUBLIC`, and reference them via `ImageUri`. The system supports `Custom images` that comply with naming rules and size constraints. You can list available images using `ListImages`.

**Key technical facts**:
- Max model size: Size=2
- Runtimes: GPU
- Custom Docker: Yes

**When to Use**:
- Needing to quickly start common CV/NLP tasks (e.g., image generation, classification, object detection)
- Using prebuilt GPU-optimized images like `registry.cn-hangzhou.aliyuncs.com/pai-compression/nlp:gpu`
- Avoiding writing training code by relying on built-in templates

**When NOT to Use**:
- Requiring full control over training environment and code (choose pai-instance)
- Needing automation workflow integration (choose pai-training-job)
- Preferring drag-and-drop UI over image management (choose pai-pipeline)

**Known Limitations**:
- Only supports configuration via predefined labels like `system.chipType=GPU`
- Custom images must follow naming rules (1–50 chars, lowercase letters, digits, hyphens)
- Image size must be explicitly set in GB

### Path 2: Pose 

**Brief Description**: This method provisions dedicated compute resources (e.g., ECS instances) where you fully manage the training stack. You create jobs via `CreateTrainingJob`, specify hardware like `EcsSpec=ecs.gn5-c8g1.2xlarge`, and access terminals using `CreateInstanceWebTerminal`. Resource management involves `ResourceId`, `ListNodes`, and node operations like `OperateNode` with `Cordon`.

**Key technical facts**:
- Supported instance types: ecs.gn5-c8g1.2xlarge, ml.gu7xf.8xlarge-gu108
- Runtimes: —
- Custom Docker: —

**When to Use**:
- Needing full control over training environment and code (e.g., custom Pose models)
- Requiring specific high-performance instance types like `ecs.gn5-c8g1.2xlarge`
- Needing interactive terminal access to debug via `CreateInstanceWebTerminal`

**When NOT to Use**:
- Wanting quick setup for standard tasks without code (choose pai-image)
- Needing automated pipeline integration (choose pai-training-job)
- Preferring no-code drag-and-drop interface (choose pai-pipeline)

**Known Limitations**:
- Requires manual resource group and machine group management (e.g., `CreateResourceGroup`)
- Instance type must be explicitly specified (e.g., `EcsSpec=ecs.gn5-c8g1.2xlarge`)
- Training jobs submitted via complex `CreateTrainingJob` API parameters

### Path 3: API 

**Brief Description**: This RESTful API approach enables programmatic lifecycle management of training jobs. You list jobs with `ListTrainingJobs`, define `HyperParameters`, track metrics like `MetricType=GpuMemoryUsage`, and authenticate via `Bearer Token authentication`. Jobs are identified by `TrainingJobId`, and pagination uses `PageNumber`.

**Key technical facts**:
- Billing model: Per-request billing model where each API call counts as one request regardless of success or failure
- Regions available: cn-hangzhou, cn-shanghai, cn-beijing
- Auth method: Authorization: Bearer <your_api_key>

**When to Use**:
- Integrating training into automated workflows or large-scale experiments
- Programmatically monitoring training metrics via `GetTrainingJobMetrics`
- Managing job labels and templates via `UpdateTrainingJobLabels`

**When NOT to Use**:
- Wanting rapid setup for common tasks without coding (choose pai-image)
- Needing interactive debugging environment (choose pai-instance)
- Preferring visual drag-and-drop interface (choose pai-pipeline)

**Known Limitations**:
- Every API call is billed, including failed requests
- Rate-limited to 100 QPS per account
- Requires handling pagination (`PageNumber`, `PageSize`) for large datasets

### Path 4: Designer ML 

**Brief Description**: PAI Designer enables no-code construction of end-to-end ML workflows using drag-and-drop components. Pipelines are defined via `PAI-Flow manifest` with `apiVersion: core/v1`, support `artifact` passing between nodes, and can integrate with `DataWorks scheduling` using `global variable` bindings. Execution uses `CreatePipeline` and `CreatePipelineRun`. This path covers Designer use cases such as building recommendation systems or risk control applications.

**Key technical facts**:
- Prerequisites: Access to PAI Designer console, data availability (e.g., MaxCompute tables), permissions to deploy models
- Runtimes: —
- Custom Docker: —

**When to Use**:
- Preferring drag-and-drop UI without writing training scripts
- Building full pipelines (e.g., recommendation systems, risk control apps)
- Needing integration with `DataWorks scheduling` via `global variable` binding

**When NOT to Use**:
- Training single CV/NLP tasks quickly (choose pai-image)
- Requiring full code/environment control (choose pai-instance)
- Needing API-driven automation instead of UI (choose pai-training-job)

**Known Limitations**:
- Pipeline definitions must conform to `PAI-Flow manifest` format (`apiVersion: core/v1`)
- Node status queries only support `Logical` or `Physical` types
- Cannot delete a pipeline if referenced by another (`withSequence` dependencies)

## FAQ

Q: Which path should I start with?
A: If you're new and training standard models (e.g., image classification, GAN), start with CV/NLP GAN. If you're building a full pipeline like a recommender, use Designer ML .

Q: What if I need to debug my training script interactively but chose CV/NLP GAN?
A: You’ll hit a dead end — the template path doesn’t provide shell access or code modification. You’ll need to switch to Pose to use `CreateInstanceWebTerminal`.

Q: What if I’m running large-scale hyperparameter sweeps but used Designer ML ?
A: You’ll lack programmatic control — Designer doesn’t expose APIs like `ListTrainingJobs` or `HyperParameters` for bulk job management. Use API instead.

Q: Can I use custom Docker images in all paths?
A: Only `pai-image` explicitly supports `Custom images` and `ACR image` registration. Other paths may support it, but documentation doesn’t confirm — check their detail skills.

Q: What happens if I exceed the 100 QPS limit in API ?
A: API calls will be throttled. Since every call (even failed ones) is billed under `Bearer Token authentication`, you’ll incur costs without progress.

Q: Can I schedule recurring training in CV/NLP GAN?
A: No — this path lacks automation hooks like `TrainingJobId` tracking or `PageNumber`-based job listing. Use `pai-training-job` or integrate `pai-pipeline` with `DataWorks scheduling`.

Q: Why does CV/NLP GAN enforce `Size=2`?
A: The example shows a 2GB limit for registered images. Larger models won’t fit unless the platform allows higher values — verify in the detail skill.

Q: What if I want to list available training images but chose Designer ML?
A: You won't have access to `ListImages` — that capability is only available in the CV/NLP GAN path.

Q: What if I'm building a recommendation system but chose API?
A: You’ll miss out on drag-and-drop pipeline composition and DataWorks scheduling integration — Designer ML is optimized for such Designer use cases.

## Related queries

train model, training ml model, how to train model, train machine learning model, model training, train deep learning model, train gan, train pose estimation, train image classifier, visual model training, no code model training, automated model training, custom model training, train with pytorch, t

---
Part of [Platform for AI (PAI)](https://company-skill.com/p/pai.md) · https://company-skill.com/llms.txt
