DaaS / Products / Document Extraction to Searchable Index Pipeline

Document Extraction to Searchable Index Pipeline

A developer extracts text and structured data from PDFs, scanned images, and other documents using Bailian's document understanding capabilities, then ingests the processed content into Elasticsearch to build a searchable full-text index over previously unsearchable document archives.

Products involved

Scenario

When developers must unlock full-text search across legacy PDFs, scanned invoices, or image-heavy archives, they need a pipeline that converts unstructured pixels into query-ready text. This workflow combines Bailian’s vision-language extraction with Elasticsearch’s high-throughput ingestion and relevance tuning to transform static document dumps into a production-grade, low-latency search index.

Integration steps

  1. Configure Bailian Extraction: Set DASHSCOPE_API_KEY and target the qvq-max model. Call POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions with {"model": "qvq-max", "enable_thinking": true} to preserve complex layouts during OCR.
  2. Extract & Normalize Payloads: Send base64-encoded documents. Parse the JSON response to isolate {"doc_id": "inv_001", "content": "...", "tables": [...]}. Strip headers/footers using regex before indexing.
  3. Define ES Index Mapping: Run PUT /document-archive with mappings.properties.content.type: "text" and mappings.properties.doc_id.type: "keyword". Set "settings.index.refresh_interval": "30s" to optimize write throughput.
  4. Batch Ingest via Bulk API: Use POST /_bulk with {"index": {"_index": "document-archive", "_id": "inv_001"}} followed by the document JSON. Keep payloads under 5MB per request to sustain up to 100 QPS.
  5. Stage & Commit Changes: Ingest with ?refresh=false to buffer writes. Once the batch completes, trigger POST /document-archive/_refresh to atomically commit staged documents to the searchable index.
  6. Apply Relevance Tuning: Deploy neural reranking via POST /_search with "rank": {"type": "rrf", "window_size": 100} or load domain terms using PUT /_ingest/pipeline/synonym-pipeline to expand query matching.

Architecture

Raw files enter a client-side processor or message queue. Bailian acts as the extraction layer, running multimodal inference to convert PDFs/images into structured JSON. The normalized output is pushed to Elasticsearch via the Bulk API. Elasticsearch handles inverted index construction, storage, and query routing, while optional ingest pipelines and rerankers intercept queries to refine result ordering before returning to the client.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How do I extract text from PDFs or scanned documents and index them in Elasticsearch to make them searchable? A: You can build a searchable index by using Bailian's vision-language extraction to convert documents into structured JSON and then ingesting the normalized payloads into Elasticsearch via the Bulk API. Configure the qvq-max model with enable_thinking set to true to preserve complex layouts, then batch ingest the cleaned text with refresh=false to maintain high throughput. Finally, trigger a manual refresh and apply relevance tuning like neural reranking or synonym pipelines to optimize search results.