---
Title: Troubleshoot failure
URL Source: https://company-skill.com/p/airec/airec-troubleshoot-failure
Language: en
Description: You’re trying to diagnose and resolve a failure that occurred during the deployment of an AIRec system or instance. The failure could be at the configuration validation stage, during instance…
---

# Troubleshoot failure

Part of **AIRec**. Route queries via `POST https://company-skill.com/api/route`.

## What You Want to Do

You’re trying to diagnose and resolve a failure that occurred during the deployment of an AIRec system or instance. The failure could be at the configuration validation stage, during instance provisioning, or while services are converging to their target state.

- Why did my AIRec deployment fail?
- How to debug AIRec installation errors?

## Decision Tree

Pick the best path for your situation:

- **If** your deployment fails with `AccessDeniedException`, `ValidationError`, or `Deployment timed out after 30 minutes` → Use AIRec (go to *airec/airec-deployment*)
- **If** your deployment passes initial validation but shows `NOT YET FOUND`, `PrepareResource failed`, `ServiceNotInDesiredState`, or `RollingTaskFailed` → Use (go to *airec/airec-instance*)
- **If** the health check endpoint returns `503` or you need to validate configuration against schema → Use AIRec (go to *airec/airec-deployment*)
- **Otherwise (default)** → Start with **AIRec**, as most early-stage failures (e.g., permissions, config, model format) are handled there.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| AIRec | medium | No | No | ModelFormat must be ONNX or PMML; InstanceType limited to standard or high-memory | `airec/troubleshooting/airec-deployment` |
| Console / Dashboard | high | No | No | Requires access to Apsara Infrastructure Management Framework console and Operation Logs | `airec/troubleshooting/airec-instance` |

## Path Details

### Path 1: AIRec

**Brief Description**: This path diagnoses system-level deployment failures using tools like `aliyun airec DescribeInstanceLogs`, `ValidateDeploymentConfig`, and the Deployment Planner. It addresses issues such as missing IAM roles (e.g., `AliyunAIRecFullAccess`), invalid parameters, and health check failures where the endpoint returns `503`. **InstanceType must be standard or high-memory**.

**Key technical facts**:
- Runtimes: ONNX, PMML
- Supported instance types: standard, high-memory

**When to Use**:
- Deployment fails with `AccessDeniedException` due to missing `AliyunAIRecFullAccess` policy
- Configuration fails schema validation (`ValidationError`)
- `Deployment timed out after 30 minutes` (common with large models)
- `Health check endpoint returns 503` after deployment completes

**When NOT to Use**:
- Problem is specific to PXE boot or hardware failure on a physical machine
- Rolling task is stuck or service state won’t converge (`ServiceNotInDesiredState`)
- You need to inspect cluster resource requests in the Apsara Infrastructure Management Framework

**Known Limitations**:
- Only supports standard and high-memory instance types
- Model format must be ONNX or PMML
- Default deployment timeout is 30 minutes; large models require manual adjustment

### Path 2: Console / Dashboard
**Brief Description**: This path focuses on deep diagnostics within the `Apsara Infrastructure Management Framework`, using the `Cluster Dashboard`, `Server Role List`, and `Operation Logs`. It helps when deployment progresses past validation but stalls during instance provisioning or service activation, often showing errors like `NOT YET FOUND` or `PrepareResource failed`.

**Key technical facts**:
- Prerequisites include successful `IDC_CHECK`, accessible OPS1 server, and installation package mounted at `/mnt`
- Password login to Linux instances is blocked by RAM policy (`ecs:PasswordCustomized` denied); SSH key pairs required

**When to Use**:
- Error logs contain `NOT YET FOUND` or `PrepareResource failed`
- Service status shows `ServiceNotInDesiredState` or `RollingTaskFailed`
- Installation progress stalls at specific percentages (e.g., 30%, 70%)
- You need to inspect `dmesg | grep -i error` or verify DNS resolution (`nslookup ais-deploy.internal`)

**When NOT to Use**:
- Failure occurs during configuration validation (e.g., `Invalid value for parameter`)
- Error explicitly states `AccessDeniedException` (indicates missing IAM permissions)

**Known Limitations**:
- Only applicable in Apsara Infrastructure Management Framework environments
- Does not handle model format or configuration parameter validation (use airec-deployment path instead)
- Requires physical/virtual machine console access to run low-level diagnostics
- Password-based login is explicitly denied by RAM policy

## FAQ

Q: Which path should I start with?
A: Start with **AIRec** unless you already see signs of partial deployment success (e.g., instances created, rolling tasks started). Most permission, config, and timeout errors are caught here.

Q: What if I get `AccessDeniedException` but use the instance-level path?
A: You’ll waste time inspecting cluster dashboards when the real issue is missing the `AliyunAIRecFullAccess` IAM policy — which only the system-level path addresses.

Q: What if my model is in TensorFlow format but I follow the system-level path?
A: You’ll hit a hard failure because the system-level path only supports `ModelFormat: ONNX or PMML` — TensorFlow models aren’t accepted, per its limitations.

Q: Can I use the instance-level path if my deployment fails before any instances are created?
A: No. If no instances exist, there’s nothing to inspect in the `Server Role List` or `Cluster Dashboard`. That’s a system-level failure (e.g., `ValidationError`).

Q: Why does the system-level path mention `Deployment timed out after 30 minutes`?
A: AIRec enforces a 30-minute default timeout for deployments. Large models may exceed this, causing failure — the system-level path guides you to adjust timeouts manually.

Q: What happens if I try to troubleshoot a `RollingTaskFailed` error using the system-level path?
A: You won’t find relevant logs — rolling task status is only visible in the `Apsara Infrastructure Management Framework`’s `Operation Logs`, which the instance-level path uses.

Q: Is `Health check endpoint returns 503` always a system-level issue?
A: Yes. A 503 indicates the service was deployed but isn’t healthy — often due to misconfiguration or runtime incompatibility, which the system-level path validates via `DescribeInstanceLogs` and schema checks.

Q: What if I specify an unsupported InstanceType like 'gpu' but choose the AIRec path?
A: You’ll encounter a validation error because the AIRec path only supports InstanceType: standard or high-memory — other types are rejected during deployment planning.

Q: What if I try to use password authentication on a Linux instance in the instance-level path?
A: You’ll be denied access because RAM policies explicitly block `ecs:PasswordCustomized`; you must use SSH key pairs instead.

## Related queries

troubleshoot deployment failure, deployment failed, deployment error, deployment stuck, how to debug AIRec, why did deployment fail, AIRec install error, deployment timeout, service not ready, health check failed, AccessDeniedException, ValidationError, RollingTaskFailed, ServiceNotInDesiredState, N

---
Part of [AIRec](https://company-skill.com/p/airec.md) · https://company-skill.com/llms.txt
