DaaS / Products / Document Extraction to Searchable Index Pipeline

Document Extraction to Searchable Index Pipeline

A developer extracts text and structured data from PDFs, scanned images, and other documents using Bailian's document understanding capabilities, then ingests the processed content into Elasticsearch to build a searchable full-text index over previously unsearchable document archives.

Products involved

Scenario

When developers must unlock full-text search across legacy PDFs, scanned invoices, or image-heavy archives, they need a pipeline that converts unstructured pixels into query-ready text. This workflow combines Bailian’s vision-language extraction with Elasticsearch’s high-throughput ingestion and relevance tuning to transform static document dumps into a production-grade, low-latency search index.

Integration steps

Configure Bailian Extraction: Set DASHSCOPE_API_KEY and target the qvq-max model. Call POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions with {"model": "qvq-max", "enable_thinking": true} to preserve complex layouts during OCR.
Extract & Normalize Payloads: Send base64-encoded documents. Parse the JSON response to isolate {"doc_id": "inv_001", "content": "...", "tables": [...]}. Strip headers/footers using regex before indexing.
Define ES Index Mapping: Run PUT /document-archive with mappings.properties.content.type: "text" and mappings.properties.doc_id.type: "keyword". Set "settings.index.refresh_interval": "30s" to optimize write throughput.
Batch Ingest via Bulk API: Use POST /_bulk with {"index": {"_index": "document-archive", "_id": "inv_001"}} followed by the document JSON. Keep payloads under 5MB per request to sustain up to 100 QPS.
Stage & Commit Changes: Ingest with ?refresh=false to buffer writes. Once the batch completes, trigger POST /document-archive/_refresh to atomically commit staged documents to the searchable index.
Apply Relevance Tuning: Deploy neural reranking via POST /_search with "rank": {"type": "rrf", "window_size": 100} or load domain terms using PUT /_ingest/pipeline/synonym-pipeline to expand query matching.

Architecture

Raw files enter a client-side processor or message queue. Bailian acts as the extraction layer, running multimodal inference to convert PDFs/images into structured JSON. The normalized output is pushed to Elasticsearch via the Bulk API. Elasticsearch handles inverted index construction, storage, and query routing, while optional ingest pipelines and rerankers intercept queries to refine result ordering before returning to the client.

Prerequisites

Active Bailian workspace with qvq-max model quota and API key
Elasticsearch/OpenSearch cluster (v7.10+ or OpenSearch 2.x) with REST API access
Python/Node.js runtime with requests or official elasticsearch client
Cluster storage sized at ~2x raw document volume to accommodate index overhead

Common pitfalls

Context Overflow: Large multi-page PDFs exceed qvq-max limits; chunk by page or section before sending to Bailian.
Refresh Throttling: Setting refresh=true per document drops throughput below 100 QPS; always use refresh=false and batch 500–1000 docs.
Layout Degradation: Disabling enable_thinking strips table boundaries and reading order; validate extraction output against complex forms before indexing.
Noise Pollution: Indexing raw OCR artifacts (page numbers, watermarks) dilutes BM25 scores; filter via ingest_pipeline remove processors before bulk ingestion.

Typical questions

extract PDF text and make it searchable
OCR documents then index in elasticsearch
build search over scanned documents
从PDF提取数据并建立搜索索引
文档抽取后导入Elasticsearch
how to index scanned PDFs in elasticsearch
document processing pipeline to search
批量文档OCR识别后入库ES

FAQ

Q: How do I extract text from PDFs or scanned documents and index them in Elasticsearch to make them searchable? A: You can build a searchable index by using Bailian's vision-language extraction to convert documents into structured JSON and then ingesting the normalized payloads into Elasticsearch via the Bulk API. Configure the qvq-max model with enable_thinking set to true to preserve complex layouts, then batch ingest the cleaned text with refresh=false to maintain high throughput. Finally, trigger a manual refresh and apply relevance tuning like neural reranking or synonym pipelines to optimize search results.