A developer extracts text and structured data from unstructured source documents (PDFs, scanned images) using Bailian's document extraction, ingests the processed content into Elasticsearch as a searchable index, and then builds a RAG knowledge base and retrieval pipeline on top to power an AI-driven document Q&A application.
Use this pipeline when you need to transform unstructured enterprise documents (PDFs, scanned invoices, technical manuals) into a searchable, AI-ready knowledge base. It bridges high-accuracy visual/text extraction with scalable vector search, enabling low-latency, context-grounded Q&A applications without manual data preprocessing.
qvq-max for complex layout parsing: curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/document-process/extract \
-H "Authorization: Bearer $BAILIAN_API_KEY" \
-d '{"model": "qvq-max", "input": {"url": "https://docs.example.com/manual.pdf"}, "parameters": {"enable_thinking": true, "output_format": "json"}}'
text_blocks and tables from the JSON response. Split text into 512-token chunks with 10% overlap using tiktoken.text-embedding-v3: import dashscope
dashscope.api_key = os.getenv("BAILIAN_API_KEY")
resp = dashscope.TextEmbedding.call(model="text-embedding-v3", input=chunks)
vectors = [item["embedding"] for item in resp["output"]["embeddings"]]
_bulk for high-throughput writes (up to 100 QPS): curl -X POST "https://$ES_HOST:9200/doc-rag-index/_bulk?refresh=wait_for" \
-H "Content-Type: application/json" \
-d @bulk_payload.json
Map embedding as dense_vector with dims: 1024 and index: true.
curl -X POST https://dashscope.aliyuncs.com/api/v1/knowledge-bases \
-H "Authorization: Bearer $BAILIAN_API_KEY" \
-d '{"name": "doc-rag-kb", "data_source": {"type": "elasticsearch", "endpoint": "$ES_HOST", "index": "doc-rag-index", "vector_field": "embedding"}}'
top_k: 5, enable hybrid search (BM25 + KNN), and attach gte-rerank to filter context before LLM generation.Data flows unidirectionally: Bailian’s qvq-max extracts and structures raw documents → chunks are embedded via text-embedding-v3 → payloads are batched into Elasticsearch for scalable hybrid search → Bailian’s RAG engine queries ES, reranks results, and injects context into the generation prompt. ES handles storage and retrieval; Bailian manages AI reasoning, embedding, and orchestration.
BAILIAN_API_KEY)dense_vector and knn supportdashscope and elasticsearch SDKs installedtext-embedding-v3 to truncate; enforce strict 512-token limits with overlap.?refresh=wait_for on bulk writes delays retrieval; always use explicit refresh or schedule background sync.dense_vector dims must exactly match the model output (1024), or KNN queries will throw mapping errors.alpha (0.5–0.7) in Bailian’s retrieval config.Q: How do I build a RAG pipeline from PDF and scanned documents? A: The Document AI RAG Pipeline transforms unstructured enterprise documents like PDFs and scanned invoices into a searchable, AI-ready knowledge base by combining high-accuracy visual extraction with scalable vector search. You can implement this workflow by extracting content via Bailian’s qvq-max API, chunking the output, generating embeddings with text-embedding-v3, and indexing the vectors in Elasticsearch before registering it as a Bailian knowledge base.