DaaS / Products / Deploy Complete RAG System with AI Models

Deploy Complete RAG System with AI Models

Deploy AI models (embeddings and LLM) on Alibaba Cloud Linux for inference serving, then deploy a RAG application using Elasticsearch as the vector knowledge base that calls these models to build an end-to-end enterprise AI chatbot with retrieval-augmented generation.

Products involved

Scenario

Developers use this workflow when they need a self-hosted, enterprise-grade RAG chatbot that processes proprietary documents without sending data to third-party APIs. By deploying embedding and LLM models directly on Alibaba Cloud Linux and using Elasticsearch as a low-latency vector knowledge base, teams achieve full data sovereignty, customizable retrieval pipelines, and scalable inference.

Integration steps

  1. Prepare Alibaba Cloud Linux: Launch an ECS instance with alinux3 image. Install NVIDIA drivers and Docker: sudo yum install -y nvidia-driver docker-ce.
  2. Deploy Inference Server: Run vLLM containers for your LLM and embedding models:
   docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
     --model Qwen/Qwen-7B-Chat --tensor-parallel-size 1 --api-key sk-xxx
  1. Configure Elasticsearch Index: Create a vector index with dense_vector mapping for 1024-dim embeddings:
   PUT /rag-knowledge
   { "mappings": { "properties": { "text": { "type": "text" }, "embedding": { "type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine" } } }
  1. Ingest & Embed Documents: Chunk documents, call http://localhost:8000/v1/embeddings, and bulk-index into ES:
   es = elasticsearch.Elasticsearch("https://<es-endpoint>:9200", basic_auth=("elastic", "<pwd>"))
   es.bulk(index="rag-knowledge", operations=[{"index": {"_id": i}}, {"text": chunk, "embedding": vec}] for i, (chunk, vec) in enumerate(batch))
  1. Build Retrieval Pipeline: Query ES with knn search using the query embedding:
   POST /rag-knowledge/_search
   { "knn": { "field": "embedding", "query_vector": [0.12, ...], "k": 5, "num_candidates": 100 } }
  1. Synthesize Response: Pass top-k chunks to http://localhost:8000/v1/chat/completions with temperature: 0.1 and max_tokens: 2048.

Architecture

Alibaba Cloud Linux hosts the inference endpoints (LLM + embeddings) via GPU-accelerated containers. The RAG application acts as the orchestrator: it routes user queries to the local embedding API, performs vector similarity search against Elasticsearch’s knn engine, and feeds the top-k retrieved passages back to the local LLM for answer generation. Elasticsearch exclusively manages document storage, metadata filtering, and sub-millisecond vector retrieval.

Prerequisites

Common pitfalls

Typical questions

FAQ

Q: How do I deploy a complete RAG system with AI models and a knowledge base? A: You can deploy an enterprise-grade RAG chatbot by hosting vLLM containers for LLM and embedding models on Alibaba Cloud Linux and configuring Elasticsearch as a vector knowledge base. This workflow requires creating a dense vector index, bulk-ingesting document embeddings, and using k-nearest neighbor searches to retrieve top-k passages for the LLM to synthesize responses.