---
Title: Transcribe speech
URL Source: https://company-skill.com/p/bailian/bailian-transcribe-speech
Language: en
Description: You want to convert spoken audio into text, translate speech across languages in real-time or in pre-recorded files, or integrate speech recognition directly into a native mobile application. Typical…
---

# Transcribe speech

Part of **Bailian (Alibaba Cloud Model Studio)**. Route queries via `POST https://company-skill.com/api/route`.

## What You Want to Do

You want to convert spoken audio into text, translate speech across languages in real-time or in pre-recorded files, or integrate speech recognition directly into a native mobile application.

**Typical User Questions**:
- How to transcribe audio files?
- (How does Bailian do real-time speech recognition?)
- Real-time speech translation API
- SDK (How to integrate speech recognition SDK on mobile?)
- Translate live audio streams
- (How to add custom hotwords to speech recognition?)

## Decision Tree

Pick the best path for your situation:

- **If** your primary goal is single-language speech-to-text transcription using models like `paraformer-v2` or `fun-asr`, and you need to process pre-recorded files up to 2GB via **Async Task** or live streams via **WebSocket** → Use **ASR Transcription API** (go to *bailian/bailian-asr*)
- **If** you need cross-language speech-to-speech translation, live stream localization, or multi-language file dubbing using models like `qwen3.5-livetranslate-flash-realtime` or `gummy-realtime-v1` → Use **Speech Translation & Dubbing API** (go to *bailian/bailian-translation*)
- **If** you are building a native Android or iOS app and need to embed pre-compiled SDKs (like `nuisdk.framework` or AAR files) with mobile-optimized security → Use **Mobile SDK Integration** (go to *bailian/bailian-asr*)
- **Otherwise (default)** → Use **ASR Transcription API**, as it is the most versatile backend solution for general speech-to-text tasks and custom hotword management.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| ASR Transcription API | Core speech-to-text transcription, custom hotword management, and live streaming recognition. | Medium | Yes | Yes | Billed per second of audio processed (e.g., ~0.00022 CNY/sec for fun-asr). | `bailian/api/bailian-asr` |
| Speech Translation & Dubbing API | Cross-language speech-to-speech translation, live stream localization, and multi-language file dubbing. | Medium | Yes | Yes | Real-time translation limits image input to a max of 2 images per second for visual context. | `bailian/api/bailian-translation` |
| Mobile SDK Integration | Embedding on-device or mobile-optimized speech recognition into Android/iOS applications. | High | Yes | No | Uses short-lived temporary API keys (valid for 60 seconds) for secure mobile auth. | `bailian/guide/bailian-asr` |

## Path Details

### Path 1: ASR Transcription API

**Best For**: Core speech-to-text transcription, custom hotword management, and live streaming recognition.

**Brief Description**: A stateless HTTP and **WebSocket** API service for transcribing live audio streams or pre-recorded audio files into text using models like **paraformer-v2** and **fun-asr**. It supports custom hotwords (speech-biasing) and speaker diarization to improve domain-specific accuracy.

**Key technical facts**:
- Billing: Billed per second of audio processed (e.g., ~0.00022 CNY/sec for fun-asr) or per 1,000 tokens.
- Max concurrency: 100 QPS per model for REST APIs; up to 10 concurrent WebSocket connections per host.
- Regions available: China (Beijing), International (Singapore).
- Prerequisites: **DASHSCOPE_API_KEY** environment variable configured; publicly accessible URLs for async file transcription (max 100 URLs, max 2GB per file).

**When to Use**:
- Need to transcribe pre-recorded audio files up to 2GB and 12 hours in duration using the **Async Task** API (requires header `X-DashScope-Async: enable`).
- Require speaker diarization or custom hotword management to improve domain-specific accuracy.
- Building a backend service that processes live audio streams via WebSocket with low latency.

**When NOT to Use**:
- Need cross-language speech-to-speech translation or live stream localization.
- Building a native Android/iOS app and want to embed on-device or mobile-optimized SDKs without managing raw WebSocket frames.

**Known Limitations**:
- **Async Task** file transcription requires audio files to be hosted on publicly accessible URLs.
- **WebSocket** connections may time out if there is prolonged silence without a heartbeat or finish-task event.
- API keys are region-specific; a China (Beijing) key will not work on the International (Singapore) endpoint.

### Path 2: Speech Translation & Dubbing API

**Best For**: Cross-language speech-to-speech translation, live stream localization, and multi-language file dubbing.

**Brief Description**: An API service for real-time speech-to-speech translation, audio/video file dubbing, and cross-language live stream localization using models like **qwen3.5-livetranslate-flash-realtime** and **gummy-realtime-v1**. It handles multiple **modalities** including audio and visual context to disambiguate terms.

**Key technical facts**:
- Billing: Billed per 1,000 tokens (e.g., 0.002 CNY/1K tokens for qwen-mt-plus) or per second of active connection (e.g., 0.00015 CNY/sec for gummy-realtime-v1).
- Max concurrency: 100 QPS per model; max 10 concurrent WebSocket connections for real-time translation.
- Regions available: China (Beijing), International (Singapore).
- Prerequisites: **DASHSCOPE_API_KEY** environment variable configured; OpenAI SDK >= 1.0.0 or DashScope SDK >= 1.14.0.

**When to Use**:
- Need to perform real-time speech-to-speech translation with voice cloning (`session.enable_voice_clone`) to preserve the original speaker's voice.
- Translating live video streams where visual context (image frames) is needed to improve translation accuracy.
- Require cross-language dubbing for pre-recorded audio/video files using the OpenAI-compatible streaming API.

**When NOT to Use**:
- Only need single-language speech recognition without cross-language translation.
- Need to translate text embedded within static images while preserving the original layout.

**Known Limitations**:
- Gummy models only support translation into one target language at a time (`translationLanguages` max length 1).
- When using the OpenAI Python SDK for file translation, custom parameters like `translation_options` must be wrapped in the `extra_body` dictionary.
- Real-time audio and video translation limits image input to a maximum of 2 images per second for visual context.
- Real-time streaming requires managing buffers like `input_audio_buffer.append`.

### Path 3: Mobile SDK Integration

**Best For**: Embedding on-device or mobile-optimized speech recognition into Android/iOS applications.

**Brief Description**: A console-guided integration path for embedding Alibaba Cloud's speech recognition capabilities into native Android and iOS applications using pre-compiled SDKs. It involves configuring native build phases like **Embed & Sign** and **Link Binary With Libraries** to ensure proper framework loading.

**Key technical facts**:
- Billing: Billed per minute or per 1,000 tokens depending on the underlying ASR model selected in the console (e.g., 0.002 CNY/min for qwen3-asr-flash-realtime).
- Runtimes: Android (AAR / C++ via android_libs), iOS (via Xcode).
- Auth method: **short-lived temporary API key** (valid for 60 seconds) recommended for mobile applications.
- Prerequisites: Android Studio or Xcode installed; API key obtained from Model Management console.

**When to Use**:
- Building a native Android or iOS application and need pre-compiled SDKs (AAR/framework) to handle audio streaming and microphone access.
- Need to implement secure mobile authentication using a **short-lived temporary API key** to prevent long-term key compromise in client apps.
- Want to use the Bailian Console's 'SDK Download and Integration' wizard to quickly scaffold mobile speech recognition features.

**When NOT to Use**:
- Building a backend Python/Java service to process large pre-recorded audio files via Async Tasks.
- Need cross-language speech-to-speech translation or live video stream localization.

**Known Limitations**:
- Requires manual SDK integration steps, such as adding AAR files to `app/libs` or setting `nuisdk.framework` to 'Embed & Sign' in Xcode Build Phases.
- Does not provide the full backend Async Task file transcription API directly; focuses on real-time mobile streaming and UI integration (e.g., using `DashGummySpeechRecognizerActivity.java` or `DashGummySpeechTranscriberViewController`).
- High-concurrency TTS optimization requires specific Java SDK environment variables which are backend-focused, not mobile-native.

## FAQ

Q: Which path should I start with?
A: If you are building a backend service and just need accurate speech-to-text transcription with custom hotwords, start with the **ASR Transcription API**. It is the most versatile default for processing both live WebSocket streams and large pre-recorded files via Async Tasks.

Q: What if I need to translate a live video stream but chose the ASR Transcription API?
A: If you need cross-language translation with visual context but chose the ASR API, you'll hit a dead end because the ASR API only outputs single-language text and does not support image frame inputs or `session.enable_voice_clone`. You must use the **Speech Translation & Dubbing API** instead.

Q: What if I am building a native Android app but chose the ASR Transcription API backend approach?
A: If you embed the raw backend API into a mobile app, you'll risk exposing your long-term `DASHSCOPE_API_KEY` in the client code, and you'll have to manually manage raw WebSocket frames. Use the **Mobile SDK Integration** path to leverage pre-compiled SDKs and secure 60-second temporary API keys.

Q: Can I use the Speech Translation API to translate text inside a static image?
A: No. The Speech Translation & Dubbing API is designed for audio/video streams and dubbing. If you need to translate text embedded within static images while preserving the layout, you should look into Qwen-MT-Image Async Task instead.

Q: How do I pass custom translation parameters when using the OpenAI Python SDK for file dubbing?
A: When using the OpenAI-compatible endpoint for translation, you cannot pass custom parameters directly. You must wrap parameters like `translation_options` inside the `extra_body` dictionary in your API call.

Q: What happens if my audio file is 3GB and I try to use the ASR Async Task API?
A: The Async Task API has a strict maximum file size limit of 2GB per file. For files larger than 2GB, your request will be rejected. You will need to split the audio into smaller chunks before uploading them to your publicly accessible URL.

Q: What if I use a China (Beijing) API key on the International (Singapore) endpoint?
A: API keys are strictly region-specific. If you use a Beijing key on the Singapore endpoint, your authentication will fail with an invalid token error. Ensure your environment variable matches the region of the endpoint you are calling.

## Related queries

transcribe audio, speech to text, ASR, automatic speech recognition, translate speech, speech translation, audio dubbing, live stream translation, real-time speech recognition, how to transcribe, how to translate audio, voice to text, audio transcription, speech recognition SDK, mobile ASR, on-devic

---
Part of [Bailian (Alibaba Cloud Model Studio)](https://company-skill.com/p/bailian.md) · https://company-skill.com/llms.txt