DaaS / Products / Document AI RAG Pipeline

Document AI RAG Pipeline

A developer extracts text and structured data from unstructured source documents (PDFs, scanned images) using Bailian's document extraction, ingests the processed content into Elasticsearch as a searchable index, and then builds a RAG knowledge base and retrieval pipeline on top to power an AI-driven document Q&A application.

Products involved

Scenario

Use this pipeline when you need to transform unstructured enterprise documents (PDFs, scanned invoices, technical manuals) into a searchable, AI-ready knowledge base. It bridges high-accuracy visual/text extraction with scalable vector search, enabling low-latency, context-grounded Q&A applications without manual data preprocessing.

Integration steps

  1. Extract document content: Call Bailian’s extraction API with qvq-max for complex layout parsing:
   curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-process/extract \
     -H "Authorization: Bearer $BAILIAN_API_KEY" \
     -d '{"model": "qvq-max", "input": {"url": "https://docs.example.com/manual.pdf"}, "parameters": {"enable_thinking": true, "output_format": "json"}}'
  1. Parse & chunk output: Extract text_blocks and tables from the JSON response. Split text into 512-token chunks with 10% overlap using tiktoken.
  2. Generate embeddings: Vectorize each chunk via Bailian’s text-embedding-v3:
   import dashscope
   dashscope.api_key = os.getenv("BAILIAN_API_KEY")
   resp = dashscope.TextEmbedding.call(model="text-embedding-v3", input=chunks)
   vectors = [item["embedding"] for item in resp["output"]["embeddings"]]
  1. Index in Elasticsearch: Push chunks + vectors using _bulk for high-throughput writes (up to 100 QPS):
   curl -X POST "https://$ES_HOST:9200/doc-rag-index/_bulk?refresh=wait_for" \
     -H "Content-Type: application/json" \
     -d @bulk_payload.json

Map embedding as dense_vector with dims: 1024 and index: true.

  1. Register ES as a RAG data source: Attach the index to Bailian’s knowledge base:
   curl -X POST https://dashscope.aliyuncs.com/api/v1/knowledge-bases \
     -H "Authorization: Bearer $BAILIAN_API_KEY" \
     -d '{"name": "doc-rag-kb", "data_source": {"type": "elasticsearch", "endpoint": "$ES_HOST", "index": "doc-rag-index", "vector_field": "embedding"}}'
  1. Configure retrieval & reranking: Set top_k: 5, enable hybrid search (BM25 + KNN), and attach gte-rerank to filter context before LLM generation.

Architecture

Data flows unidirectionally: Bailian’s qvq-max extracts and structures raw documents → chunks are embedded via text-embedding-v3 → payloads are batched into Elasticsearch for scalable hybrid search → Bailian’s RAG engine queries ES, reranks results, and injects context into the generation prompt. ES handles storage and retrieval; Bailian manages AI reasoning, embedding, and orchestration.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How do I build a RAG pipeline from PDF and scanned documents? A: The Document AI RAG Pipeline transforms unstructured enterprise documents like PDFs and scanned invoices into a searchable, AI-ready knowledge base by combining high-accuracy visual extraction with scalable vector search. You can implement this workflow by extracting content via Bailian’s qvq-max API, chunking the output, generating embeddings with text-embedding-v3, and indexing the vectors in Elasticsearch before registering it as a Bailian knowledge base.