---
Title: Extract documents
URL Source: https://company-skill.com/p/bailian/bailian-extract-documents
Language: en
Description: You want to extract text, structured data, or visual information from documents (like PDFs) and images, or translate text embedded within visual media while maintaining the original context and…
---

# Extract documents

Part of **Bailian (Alibaba Cloud Model Studio)**. Route queries via `POST https://company-skill.com/api/route`.

## What You Want to Do

You want to extract text, structured data, or visual information from documents (like PDFs) and images, or translate text embedded within visual media while maintaining the original context and layout.

**Typical User Questions**:
- How to extract data from PDF?
- OCR text extraction from images
- GUI automation using vision models
- Image translation preserving layout

## Decision Tree

Pick the best path for your situation:

- **If** you need to perform complex visual reasoning using `qvq-max` with `enable_thinking`, execute GUI automation using `gui-plus`, or mine structured data from complex PDF layouts using `qwen-doc-turbo` → Use **Multimodal Vision & Document Mining** (go to *bailian/bailian-multimodal*)
- **If** you need to translate text embedded in images while preserving the original layout using `qwen-mt-image`, or perform real-time speech-to-speech translation using `qwen3.5-livetranslate-flash-realtime` → Use **Specialized OCR & Image Translation** (go to *bailian/bailian-translation*)
- **Otherwise (default)** → Use **Multimodal Vision & Document Mining**. It is the most versatile path for general document understanding, raw OCR, and visual question-answering tasks.

## Path Comparison

| Path | Best For | Complexity | Code Required | Automation | Key Fact | Detail Skill |
|------|----------|------------|---------------|------------|----------|-------------|
| Multimodal Vision & Document Mining | Complex visual reasoning, GUI automation, and extracting structured data from complex PDF layouts using large vision models. | Medium | Yes | Yes | Document Data Mining is limited to max 253,952 input tokens per request. | `bailian/api/bailian-multimodal` |
| Specialized OCR & Image Translation | High-precision raw text extraction, layout preservation, and translating text embedded within images using dedicated OCR models. | Low | Yes | Yes | Image translation query API has a default rate limit of 1 RPS for polling task status. | `bailian/api/bailian-translation` |

## Path Details

### Path 1: Multimodal Vision & Document Mining

**Best For**: Complex visual reasoning, GUI automation, and extracting structured data from complex PDF layouts using large vision models.

**Brief Description**: 
This path leverages Bailian Multimodal Understanding and Interaction APIs to process images, videos, and documents. It utilizes advanced vision models like `qvq-max` for step-by-step visual reasoning, `qwen-doc-turbo` for deep document data mining, and `gui-plus` for UI interaction. You can fine-tune extraction behavior using parameters like `vl_high_resolution_images`, `file_parsing_strategy`, and `ocr_options` to handle complex layouts and high-resolution inputs.

**Key technical facts**:
- Billing: Per-token billing model. Input tokens (including text, image, video, and audio tokens) and output tokens are priced separately.
- Concurrency: 100 QPS per model, with a maximum of 10 concurrent requests per model.
- Auth: Bearer Token (Header: Authorization: Bearer $DASHSCOPE_API_KEY)
- Regions: China (Beijing), Singapore (International), US (Virginia)

**When to Use**:
- User needs to extract structured data, tables, and specific fields from complex PDF and document files using `qwen-doc-turbo`.
- User requires GUI automation and interaction based on UI screenshots using `gui-plus` models.
- User needs complex visual reasoning and step-by-step image analysis using `qvq-max` or thinking models.

**When NOT to Use**:
- User needs to translate text embedded within images while preserving the original layout (use `qwen-mt-image` in Specialized OCR & Image Translation).
- User needs real-time speech-to-speech translation or live audio/video stream translation with voice cloning (use `qwen3.5-livetranslate-flash-realtime` in Specialized OCR & Image Translation).

**Known Limitations**:
- Video maximum duration is 2 hours for qwen3.6 series, 1 hour for qwen3-vl series, and 10 minutes for other models.
- OCR is limited to max 8K tokens per request and max image size 10MB.
- Document Data Mining is limited to max 253,952 input tokens per request, max 32,768 output tokens, and max 9,000 tokens per message.
- Audio understanding is limited to max 40 minutes of audio per request.

### Path 2: Specialized OCR & Image Translation

**Best For**: High-precision raw text extraction, layout preservation, and translating text embedded within images using dedicated OCR models.

**Brief Description**: 
This path utilizes Bailian Translation and Localization APIs specifically designed for machine translation and image localization. It features `qwen-mt-image` for layout-preserving image translation, `qwen-mt-plus` for domain-specific text translation, and `gummy-realtime-v1` for live audio streams. You can control translation behavior using `translation_options`, `domainHint`, and `imageSegment`, and handle long-running image tasks by passing the `X-DashScope-Async: enable` header to retrieve a `task_id` for polling.

**Key technical facts**:
- Billing: Text/OCR billed per 1,000 tokens; Audio/Video billed per 1,000 tokens or per second; Image translation billed per successfully generated image; Real-time Speech (Gummy) billed per second of active connection.
- Concurrency: 100 QPS per model; Max 10 concurrent WebSocket connections for real-time translation; Default 1 QPS for polling image translation task status API.
- Auth: Bearer Token (Header: Authorization: Bearer $DASHSCOPE_API_KEY)
- Regions: China (Beijing), Singapore (International), US (Virginia)

**When to Use**:
- User needs to translate text embedded within images while preserving the original layout using `qwen-mt-image`.
- User requires real-time speech-to-speech translation or live audio/video stream translation with voice cloning using `qwen3.5-livetranslate-flash-realtime`.
- User needs machine translation with custom terminology, translation memory, and domain prompting using `qwen-mt-plus`.

**When NOT to Use**:
- User needs to extract structured data, tables, and specific fields from complex PDF and document files (use `qwen-doc-turbo` in Multimodal Vision & Document Mining).
- User needs GUI automation and interaction based on UI screenshots (use `gui-plus` in Multimodal Vision & Document Mining).

**Known Limitations**:
- Image translation limits: Max 100 MB per image, dimensions between 15x15 and 8192x8192 pixels, and URL cannot contain Chinese characters.
- Image translation query API has a default rate limit of 1 RPS; requires async task callback for more frequent queries.
- Qwen-MT text translation models have a max of 8,192 tokens per request.
- Gummy short-sentence models have a max 1 minute audio duration per task.
- Real-time translation (Gummy) only supports one target language for translation at a time.

## FAQ

Q: Which path should I start with?
A: Start with **Multimodal Vision & Document Mining** if your primary goal is to read, understand, or extract data from PDFs and images into text or JSON formats. It is the most robust path for general document intelligence and visual reasoning.

Q: What if I need to extract structured tables from a 50-page PDF but chose Specialized OCR & Image Translation?
A: You'll hit a wall because the translation path lacks document data mining capabilities and is limited to 8,192 tokens for text models. You must use `qwen-doc-turbo` in the Multimodal path, which supports up to 253,952 input tokens per request specifically for complex PDF layout extraction.

Q: What if I need to translate a UI screenshot while preserving the original layout but chose Multimodal Vision & Document Mining?
A: You'll fail to preserve the layout. The Multimodal path is for extraction and reasoning, not layout-preserving image generation. You must use `qwen-mt-image` in the Specialized OCR & Image Translation path to generate a translated image with the original visual layout intact.

Q: Can I use the multimodal path for real-time speech-to-speech translation during a live video stream?
A: No. The Multimodal path only supports audio understanding (up to 40 minutes per request) for analysis, not live translation. For real-time speech-to-speech translation with voice cloning, you must use `qwen3.5-livetranslate-flash-realtime` in the Specialized OCR & Image Translation path.

Q: What are the concurrency limits if I process a large batch of document translations?
A: Both paths support 100 QPS per model for standard API calls. However, if you are using the async image translation API in the Translation path, polling the task status endpoint is strictly rate-limited to 1 RPS by default. You should implement async task callbacks instead of aggressive polling to avoid throttling.

Q: How do I handle high-resolution images for OCR in the Multimodal path?
A: You should enable the `vl_high_resolution_images` parameter in your API request to ensure the model processes the full resolution rather than downscaling it. Keep in mind the hard limit of 10MB per image and 8K tokens per OCR request.

## Related queries

extract data from PDF, document data mining, image text extraction, OCR text extraction, extract information from documents, image translation, document understanding, how to extract data from PDF, how to do OCR, how to translate image text, how to read PDF, Qwen-VL, Qwen-OCR, Qwen-Doc-Turbo, Qwen-M

---
Part of [Bailian (Alibaba Cloud Model Studio)](https://company-skill.com/p/bailian.md) · https://company-skill.com/llms.txt
