DaaS / Products / Deploy Complete RAG System with AI Models

Deploy Complete RAG System with AI Models

Deploy AI models (embeddings and LLM) on Alibaba Cloud Linux for inference serving, then deploy a RAG application using Elasticsearch as the vector knowledge base that calls these models to build an end-to-end enterprise AI chatbot with retrieval-augmented generation.

Products involved

Scenario

Developers use this workflow when they need a self-hosted, enterprise-grade RAG chatbot that processes proprietary documents without sending data to third-party APIs. By deploying embedding and LLM models directly on Alibaba Cloud Linux and using Elasticsearch as a low-latency vector knowledge base, teams achieve full data sovereignty, customizable retrieval pipelines, and scalable inference.

Integration steps

Prepare Alibaba Cloud Linux: Launch an ECS instance with alinux3 image. Install NVIDIA drivers and Docker: sudo yum install -y nvidia-driver docker-ce.
Deploy Inference Server: Run vLLM containers for your LLM and embedding models:

   docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
     --model Qwen/Qwen-7B-Chat --tensor-parallel-size 1 --api-key sk-xxx

Configure Elasticsearch Index: Create a vector index with dense_vector mapping for 1024-dim embeddings:

   PUT /rag-knowledge
   { "mappings": { "properties": { "text": { "type": "text" }, "embedding": { "type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine" } } }

Ingest & Embed Documents: Chunk documents, call http://localhost:8000/v1/embeddings, and bulk-index into ES:

   es = elasticsearch.Elasticsearch("https://<es-endpoint>:9200", basic_auth=("elastic", "<pwd>"))
   es.bulk(index="rag-knowledge", operations=[{"index": {"_id": i}}, {"text": chunk, "embedding": vec}] for i, (chunk, vec) in enumerate(batch))

Build Retrieval Pipeline: Query ES with knn search using the query embedding:

   POST /rag-knowledge/_search
   { "knn": { "field": "embedding", "query_vector": [0.12, ...], "k": 5, "num_candidates": 100 } }

Synthesize Response: Pass top-k chunks to http://localhost:8000/v1/chat/completions with temperature: 0.1 and max_tokens: 2048.

Architecture

Alibaba Cloud Linux hosts the inference endpoints (LLM + embeddings) via GPU-accelerated containers. The RAG application acts as the orchestrator: it routes user queries to the local embedding API, performs vector similarity search against Elasticsearch’s knn engine, and feeds the top-k retrieved passages back to the local LLM for answer generation. Elasticsearch exclusively manages document storage, metadata filtering, and sub-millisecond vector retrieval.

Prerequisites

Alibaba Cloud ECS instance running Alibaba Cloud Linux 3.2104 LTS with ≥1 NVIDIA A10 GPU (or CPU fallback with quantized models)
Elasticsearch 8.x cluster with security enabled and knn plugin active
Valid model weights (Hugging Face or ModelScope) and ≥50GB storage
Python 3.10+ with elasticsearch, requests, and tiktoken installed

Common pitfalls

Dimension mismatch: Embedding output dims must exactly match the ES dense_vector mapping; otherwise, knn queries fail with mapper_parsing_exception.
GPU memory fragmentation: Running both embedding and LLM containers on the same GPU without CUDA_VISIBLE_DEVICES isolation causes OOM crashes.
ES security handshake: Self-signed certificates block Python ES clients; explicitly set verify_certs=False or mount CA bundles.
Context window overflow: Retrieving too many chunks exceeds the LLM’s context limit; implement strict max_tokens and chunk-size caps (≤512 tokens).

Typical questions

deploy rag system with models
build ai chatbot with knowledge base
部署rag系统
搭建知识库问答系统
deploy embedding model and rag
ai model serving with vector search
complete rag pipeline deployment
企业级rag部署

FAQ

Q: How do I deploy a complete RAG system with AI models and a knowledge base? A: You can deploy an enterprise-grade RAG chatbot by hosting vLLM containers for LLM and embedding models on Alibaba Cloud Linux and configuring Elasticsearch as a vector knowledge base. This workflow requires creating a dense vector index, bulk-ingesting document embeddings, and using k-nearest neighbor searches to retrieve top-k passages for the LLM to synthesize responses.