vLLM生产部署指南：企业自建推理服务的完整方案

老张2026/6/6大约 15 分钟vLLM推理服务GPU生产部署DockerJava

vLLM生产部署指南：企业自建推理服务的完整方案

吞吐量提升8倍，成本降低60%：一次vLLM迁移的真实记录

2025年11月，某AI公司的算法团队负责人林梓航遇到了一个让他头疼的问题：他们自研了一套文档智能处理系统，部署在4张A100上，跑的是HuggingFace官方的transformers推理接口。

系统上线半年，用户量从0增长到3200家企业客户，并发请求从每天几百次涨到了峰值每分钟800次。

问题来了：4张A100，每张显存80G，理论上绰绰有余，但实际表现让人崩溃：

GPU利用率：峰值只有35%，平均更是只有18%
平均响应时间：从上线时的1.2秒，爬到了4.8秒
请求队列：高峰期积压超过200个请求
每月服务器成本：4张A100云租费，2.8万元

林梓航把这个问题带进了一个AI工程师群，群里的前辈给了他一个字："换vLLM。"

他不信邪，拉着两个工程师做了48小时的对比测试。测试结果让他沉默了：

指标	transformers推理	vLLM
吞吐量（tokens/s）	1,200	9,800
P50延迟	1.8s	0.6s
P99延迟	4.8s	1.4s
GPU利用率	18%	76%
并发处理能力	8请求	64请求

同样的硬件，vLLM的吞吐量是原来的8倍多。迁移后，他们从4张A100缩减到了2张，月均服务器成本从2.8万降到了1.1万，降幅超过60%。

这篇文章，是林梓航团队迁移经验的完整技术沉淀。

vLLM核心原理：PagedAttention为什么这么快

理解vLLM的性能优势，必须先理解它解决了什么问题。

传统推理的内存浪费

大语言模型推理的核心计算是KV Cache（Key-Value缓存）。每处理一个token，模型需要缓存当前和历史token的K、V矩阵，用于后续计算。

传统方案的问题：为每个请求预分配连续的显存块。

比如，上下文长度设为4096，每个请求预分配500MB显存。但实际用户的对话平均只有800个token，只用了约100MB。其余400MB处于"预分配但空置"状态。

当并发20个请求时，有效利用率可能只有20%，80%的显存被白白占用。这就是GPU利用率只有18%的根本原因。

PagedAttention：受操作系统虚拟内存启发

vLLM的PagedAttention借鉴了操作系统的分页内存管理思想：

物理显存切成固定大小的页（Block），每页可存K个token的KV Cache
按需分配：请求生成几个token，就分配几页，不预留
逻辑-物理地址映射：每个请求有虚拟的连续地址空间，物理上可以不连续
跨请求共享：多个请求的相同前缀（如System Prompt）可以共享同一物理Block，无需重复存储

传统方案（预分配连续内存）：
Request A: [████████████████████░░░░░░░░░░] 实际使用50%
Request B: [█████████████░░░░░░░░░░░░░░░░░] 实际使用40%
Request C: [████████░░░░░░░░░░░░░░░░░░░░░░] 实际使用25%
显存浪费大量

PagedAttention（按需分配）：
Request A: [Block1][Block2][Block3]
Request B: [Block4][Block5]
Request C: [Block6]
[Block7][Block8]... （空闲，等待新请求）
GPU利用率显著提升

这种设计让vLLM可以在同等显存下承载更多并发请求，同时通过Continuous Batching（持续批处理）进一步提升吞吐量——不等所有请求都到齐才批处理，而是请求到来立即插入当前批次。

安装与环境：版本配套很重要

vLLM对CUDA、PyTorch、Python版本有严格要求，版本不匹配是安装失败的首因。

版本配套表（2026年最新稳定版）

vLLM版本	Python	CUDA	PyTorch	推荐显卡
0.6.x	3.10-3.12	12.1	2.3.x	A100/H100/RTX4090
0.5.x	3.9-3.11	11.8	2.1.x	A100/V100
0.4.x	3.9-3.11	11.8	2.0.x	V100/A10

安装步骤

# 1. 验证CUDA版本
nvidia-smi | grep "CUDA Version"
# CUDA Version: 12.1

# 2. 创建Python虚拟环境（推荐）
conda create -n vllm python=3.11 -y
conda activate vllm

# 3. 安装PyTorch（务必与CUDA版本匹配）
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu121

# 4. 安装vLLM
pip install vllm==0.6.2

# 5. 验证安装
python -c "import vllm; print(vllm.__version__)"
# 0.6.2

# 6. 快速测试
python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='facebook/opt-125m')  # 极小模型，仅用于测试
outputs = llm.generate(['Hello'], SamplingParams(max_tokens=10))
print(outputs[0].outputs[0].text)
"

常见安装问题

# 问题1：CUDA版本不匹配
# 报错：RuntimeError: CUDA error: no kernel image is available for execution on the device
# 解决：重新安装对应CUDA版本的PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu$(nvcc --version | grep -oP '\d+\.\d+' | head -1 | tr -d '.')

# 问题2：显存不足
# 报错：torch.cuda.OutOfMemoryError
# 解决：使用量化模型或减小max-model-len
vllm serve model_name --max-model-len 4096 --gpu-memory-utilization 0.85

# 问题3：缺少flash-attn
# 报错：ImportError: No module named 'flash_attn'
pip install flash-attn --no-build-isolation

单GPU和多GPU部署配置

单GPU启动（命令行方式）

# 基础启动（OpenAI兼容API）
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/deepseek-r1-7b \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name deepseek-r1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --dtype bfloat16

# 验证启动成功
curl http://localhost:8000/v1/models
# {"object":"list","data":[{"id":"deepseek-r1","object":"model",...}]}

# 测试调用
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1",
    "messages": [{"role": "user", "content": "你好，请介绍一下vLLM"}],
    "max_tokens": 200
  }'

多GPU部署（张量并行）

# 双GPU（推荐用于14B以上模型）
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/deepseek-r1-14b \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \        # 2张GPU张量并行
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --dtype bfloat16 \
    --enable-prefix-caching            # 开启前缀缓存（System Prompt复用）

# 4张GPU（适合70B大模型）
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/qwen2.5-72b \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 1 \      # 流水线并行（超大模型时用）
    --max-model-len 131072 \
    --gpu-memory-utilization 0.90

关键参数说明

参数	说明	推荐值
`--max-model-len`	最大上下文长度（影响显存）	根据显存调整，8192起步
`--gpu-memory-utilization`	GPU显存使用比例	0.85-0.90
`--tensor-parallel-size`	张量并行GPU数量	等于GPU数量
`--max-num-seqs`	最大并发序列数	64-256
`--max-num-batched-tokens`	每批最大token数	4096-32768
`--enable-prefix-caching`	前缀KV Cache缓存	强烈推荐开启
`--dtype`	计算精度	bfloat16（A100）/float16

模型加载：HuggingFace与本地权重

从HuggingFace加载

# 配置HuggingFace镜像（国内加速）
export HF_ENDPOINT=https://hf-mirror.com

# 下载模型到本地
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --local-dir /data/models/deepseek-r1-7b \
    --local-dir-use-symlinks False

# 直接用HuggingFace模型ID启动（自动下载）
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --download-dir /data/models

从本地权重加载

# 本地权重目录结构
/data/models/deepseek-r1-7b/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
└── model-00004-of-00004.safetensors

# 直接加载本地路径
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/deepseek-r1-7b \
    --tokenizer /data/models/deepseek-r1-7b  # tokenizer路径（通常与模型相同）

Spring AI无缝接入vLLM的OpenAI兼容API

vLLM提供完全兼容OpenAI的API接口，Spring AI可以直接用OpenAI的starter接入，无需额外开发。

项目结构

vllm-spring-ai/
├── pom.xml
├── src/main/
│   ├── java/com/laozhang/vllm/
│   │   ├── VllmApplication.java
│   │   ├── config/
│   │   │   ├── VllmAIConfig.java
│   │   │   └── LoadBalancerConfig.java
│   │   ├── controller/
│   │   │   └── InferenceController.java
│   │   ├── service/
│   │   │   ├── VllmChatService.java
│   │   │   └── VllmHealthService.java
│   │   └── monitor/
│   │       └── VllmMetricsCollector.java
│   └── resources/
│       └── application.yml

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.3.4</version>
        <relativePath/>
    </parent>

    <groupId>com.laozhang</groupId>
    <artifactId>vllm-spring-ai</artifactId>
    <version>1.0.0</version>
    <description>vLLM + Spring AI 企业推理服务集成</description>

    <properties>
        <java.version>17</java.version>
        <spring-ai.version>1.0.0</spring-ai.version>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.ai</groupId>
                <artifactId>spring-ai-bom</artifactId>
                <version>${spring-ai.version}</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-webflux</artifactId>
        </dependency>

        <!-- 使用OpenAI Starter接入vLLM的OpenAI兼容API -->
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        </dependency>

        <!-- Resilience4j（熔断+限流）-->
        <dependency>
            <groupId>io.github.resilience4j</groupId>
            <artifactId>resilience4j-spring-boot3</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>io.github.resilience4j</groupId>
            <artifactId>resilience4j-reactor</artifactId>
            <version>2.2.0</version>
        </dependency>

        <!-- Prometheus监控 -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
        </dependency>

        <!-- HTTP客户端（调用vLLM Prometheus指标）-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-webflux</artifactId>
        </dependency>

        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
</project>

application.yml

server:
  port: 8080

spring:
  application:
    name: vllm-spring-ai

  # 通过OpenAI Starter接入vLLM
  ai:
    openai:
      # vLLM的OpenAI兼容端点
      base-url: http://vllm-server:8000/v1
      # vLLM默认不需要API Key，可以填任意值
      api-key: "not-needed"
      chat:
        options:
          model: deepseek-r1        # 与vLLM --served-model-name一致
          temperature: 0.7
          max-tokens: 2048
          # vLLM特有参数（通过additional-model-request-fields传递）

# 多个vLLM实例（负载均衡）
vllm:
  instances:
    - url: http://vllm-server-1:8000
      weight: 50
      model: deepseek-r1
    - url: http://vllm-server-2:8000
      weight: 50
      model: deepseek-r1

# Resilience4j配置
resilience4j:
  circuitbreaker:
    instances:
      vllm:
        sliding-window-size: 20
        failure-rate-threshold: 50
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 5
  bulkhead:
    instances:
      vllm:
        max-concurrent-calls: 32
        max-wait-duration: 10s
  timelimiter:
    instances:
      vllm:
        timeout-duration: 60s

management:
  endpoints:
    web:
      exposure:
        include: health,prometheus,metrics,info
  metrics:
    tags:
      application: ${spring.application.name}

logging:
  level:
    com.laozhang: DEBUG

VllmAIConfig.java

package com.laozhang.vllm.config;

import lombok.Data;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.openai.OpenAiChatModel;
import org.springframework.ai.openai.OpenAiChatOptions;
import org.springframework.ai.openai.api.OpenAiApi;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.util.List;

/**
 * vLLM接入配置
 * 利用vLLM的OpenAI兼容API，直接复用Spring AI的OpenAI Starter
 */
@Configuration
public class VllmAIConfig {

    @Value("${spring.ai.openai.base-url}")
    private String vllmBaseUrl;

    @Value("${spring.ai.openai.chat.options.model}")
    private String modelName;

    /**
     * 构建指向vLLM的ChatModel（复用OpenAI接口）
     */
    @Bean
    public OpenAiChatModel vllmChatModel() {
        // 创建OpenAI API实例，但base-url指向vLLM
        OpenAiApi openAiApi = new OpenAiApi(vllmBaseUrl, "not-needed");

        OpenAiChatOptions options = OpenAiChatOptions.builder()
                .withModel(modelName)
                .withTemperature(0.7f)
                .withMaxTokens(2048)
                .withTopP(0.9f)
                .build();

        return new OpenAiChatModel(openAiApi, options);
    }

    @Bean
    public ChatClient vllmChatClient(OpenAiChatModel vllmChatModel) {
        return ChatClient.builder(vllmChatModel)
                .defaultSystem("你是一位专业的AI助手，请提供准确、有价值的回答。")
                .build();
    }
}

VllmChatService.java（生产级服务实现）

package com.laozhang.vllm.service;

import io.github.resilience4j.bulkhead.annotation.Bulkhead;
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.timelimiter.annotation.TimeLimiter;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import java.time.Duration;
import java.util.concurrent.CompletableFuture;

/**
 * vLLM推理服务
 * 包含熔断、超时、舱壁隔离等生产级可靠性保障
 */
@Slf4j
@Service
@RequiredArgsConstructor
public class VllmChatService {

    private final ChatClient vllmChatClient;

    /**
     * 带完整可靠性保障的对话调用
     */
    @CircuitBreaker(name = "vllm", fallbackMethod = "chatFallback")
    @Bulkhead(name = "vllm", type = Bulkhead.Type.SEMAPHORE)
    @TimeLimiter(name = "vllm")
    public CompletableFuture<String> chat(String message) {
        return CompletableFuture.supplyAsync(() -> {
            long start = System.currentTimeMillis();
            try {
                String response = vllmChatClient.prompt()
                        .user(message)
                        .call()
                        .content();
                log.info("vLLM inference success, latency={}ms", System.currentTimeMillis() - start);
                return response;
            } catch (Exception e) {
                log.error("vLLM inference failed: {}", e.getMessage(), e);
                throw e;
            }
        });
    }

    /**
     * 流式推理
     */
    @CircuitBreaker(name = "vllm", fallbackMethod = "streamFallback")
    public Flux<String> chatStream(String message) {
        return Flux.defer(() -> vllmChatClient.prompt()
                .user(message)
                .stream()
                .content())
                .timeout(Duration.ofSeconds(60))
                .doOnError(e -> log.error("Stream error: {}", e.getMessage()));
    }

    /**
     * 批量推理（利用vLLM的Continuous Batching优化）
     */
    public Flux<String> chatBatch(java.util.List<String> messages) {
        return Flux.fromIterable(messages)
                .flatMap(msg ->
                        Mono.fromCompletableFuture(chat(msg))
                                .onErrorReturn("推理失败，请重试"),
                        8  // 最大并发8
                );
    }

    /**
     * 熔断降级方法
     */
    public CompletableFuture<String> chatFallback(String message, Exception e) {
        log.warn("vLLM circuit breaker triggered, using fallback. Error: {}", e.getMessage());
        return CompletableFuture.completedFuture(
                "AI服务当前繁忙，请稍后重试。（技术信息：" + e.getClass().getSimpleName() + "）"
        );
    }

    public Flux<String> streamFallback(String message, Exception e) {
        log.warn("vLLM stream circuit breaker triggered: {}", e.getMessage());
        return Flux.just("AI服务当前繁忙，请稍后重试。");
    }
}

量化部署：INT8/INT4节省显存

量化是在有限显存下运行大模型的关键技术，用精度换显存。

显存需求对比

模型	FP32	BF16/FP16	INT8（W8A8）	INT4（GPTQ）
7B	28GB	14GB	7GB	4GB
14B	56GB	28GB	14GB	8GB
70B	280GB	140GB	70GB	40GB

AWQ量化（推荐，精度损失最小）

# 安装量化工具
pip install autoawq

# 量化脚本
python quantize_awq.py

# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# 原始模型路径
model_path = "/data/models/deepseek-r1-7b"
# 量化后保存路径
quant_path = "/data/models/deepseek-r1-7b-awq-int4"

# 量化配置（INT4，组大小128）
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

print("加载模型...")
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    low_cpu_mem_usage=True,
    use_cache=False
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

print("开始量化（约需30-60分钟）...")
model.quantize(tokenizer, quant_config=quant_config)

print("保存量化模型...")
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"量化完成，保存到: {quant_path}")

# 使用量化模型启动vLLM
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/deepseek-r1-7b-awq-int4 \
    --quantization awq \
    --dtype half \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

# 量化效果对比
# BF16原版：14GB显存，42 tokens/s
# INT4 AWQ：4.5GB显存，38 tokens/s（精度损失约3%，速度相当）

监控：vLLM内置Prometheus指标

vLLM内置了完整的Prometheus指标端点，无需额外开发。

vLLM关键Prometheus指标

# 获取vLLM指标
curl http://vllm-server:8000/metrics

# 关键指标说明：
# vllm:num_requests_running          当前运行中的请求数
# vllm:num_requests_waiting          等待队列中的请求数
# vllm:gpu_cache_usage_perc          GPU KV Cache使用率（>0.9需扩容）
# vllm:generation_tokens_total       生成token总数
# vllm:prompt_tokens_total           输入prompt总token数
# vllm:time_to_first_token_seconds   首token延迟分布
# vllm:time_per_output_token_seconds 每个输出token耗时
# vllm:request_success_total         成功请求数
# vllm:request_params_n              请求参数N（批量生成数）

VllmMetricsCollector.java（Spring侧采集vLLM指标）

package com.laozhang.vllm.monitor;

import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.web.reactive.function.client.WebClient;

import java.util.concurrent.atomic.AtomicDouble;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * 采集vLLM的Prometheus指标并同步到应用自身的Micrometer
 */
@Slf4j
@Component
public class VllmMetricsCollector {

    private final WebClient webClient;
    private final AtomicDouble requestsRunning = new AtomicDouble(0);
    private final AtomicDouble requestsWaiting = new AtomicDouble(0);
    private final AtomicDouble gpuCacheUsage = new AtomicDouble(0);

    @Value("${spring.ai.openai.base-url}")
    private String vllmBaseUrl;

    public VllmMetricsCollector(MeterRegistry registry) {
        this.webClient = WebClient.builder().build();

        // 注册到Micrometer
        Gauge.builder("vllm.requests.running", requestsRunning, AtomicDouble::get)
                .description("vLLM当前运行请求数")
                .register(registry);
        Gauge.builder("vllm.requests.waiting", requestsWaiting, AtomicDouble::get)
                .description("vLLM等待队列请求数")
                .register(registry);
        Gauge.builder("vllm.gpu.cache.usage", gpuCacheUsage, AtomicDouble::get)
                .description("vLLM GPU KV Cache使用率")
                .register(registry);
    }

    /**
     * 每15秒采集一次vLLM指标
     */
    @Scheduled(fixedDelay = 15000)
    public void collectMetrics() {
        String metricsUrl = vllmBaseUrl.replace("/v1", "") + "/metrics";

        webClient.get()
                .uri(metricsUrl)
                .retrieve()
                .bodyToMono(String.class)
                .subscribe(
                        metrics -> parseAndUpdateMetrics(metrics),
                        error -> log.warn("采集vLLM指标失败: {}", error.getMessage())
                );
    }

    private void parseAndUpdateMetrics(String metricsText) {
        updateGauge(metricsText, "vllm:num_requests_running", requestsRunning);
        updateGauge(metricsText, "vllm:num_requests_waiting", requestsWaiting);
        updateGauge(metricsText, "vllm:gpu_cache_usage_perc", gpuCacheUsage);
    }

    private void updateGauge(String text, String metricName, AtomicDouble gauge) {
        Pattern pattern = Pattern.compile(metricName + "\\{[^}]*\\}\\s+([\\d.]+)");
        Matcher matcher = pattern.matcher(text);
        if (matcher.find()) {
            gauge.set(Double.parseDouble(matcher.group(1)));
        }
    }
}

InferenceController.java

package com.laozhang.vllm.controller;

import com.laozhang.vllm.service.VllmChatService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;

import java.util.List;
import java.util.Map;
import java.util.concurrent.CompletableFuture;

@Slf4j
@RestController
@RequestMapping("/inference")
@RequiredArgsConstructor
public class InferenceController {

    private final VllmChatService vllmChatService;

    /**
     * 单次推理
     */
    @PostMapping
    public CompletableFuture<ResponseEntity<Map<String, String>>> infer(
            @RequestBody Map<String, String> request) {
        String message = request.get("message");
        return vllmChatService.chat(message)
                .thenApply(result -> ResponseEntity.ok(Map.of("result", result)));
    }

    /**
     * 流式推理
     */
    @PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> inferStream(@RequestBody Map<String, String> request) {
        return vllmChatService.chatStream(request.get("message"));
    }

    /**
     * 批量推理
     */
    @PostMapping("/batch")
    public Flux<String> inferBatch(@RequestBody Map<String, List<String>> request) {
        return vllmChatService.chatBatch(request.get("messages"));
    }
}

高可用：多实例负载均衡

Nginx配置（vLLM实例负载均衡）

# /etc/nginx/conf.d/vllm.conf

upstream vllm_backend {
    least_conn;  # 最少连接策略（适合长连接推理）

    server vllm-server-1:8000 weight=1 max_fails=3 fail_timeout=30s;
    server vllm-server-2:8000 weight=1 max_fails=3 fail_timeout=30s;

    keepalive 32;  # 保持长连接
}

server {
    listen 80;
    server_name vllm-api.internal;

    # 推理API代理
    location /v1/ {
        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";  # 启用keepalive

        # 超时设置（推理可能较慢）
        proxy_connect_timeout 10s;
        proxy_send_timeout 120s;
        proxy_read_timeout 120s;

        # 流式输出必须禁用缓冲
        proxy_buffering off;
        proxy_cache off;
    }

    # 指标聚合
    location /metrics {
        # 返回所有实例指标聚合（可配合Prometheus federation）
        proxy_pass http://vllm_backend/metrics;
    }
}

LoadBalancerConfig.java（Spring侧负载均衡）

package com.laozhang.vllm.config;

import lombok.Data;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.openai.OpenAiChatModel;
import org.springframework.ai.openai.OpenAiChatOptions;
import org.springframework.ai.openai.api.OpenAiApi;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
import org.springframework.stereotype.Component;

import java.util.List;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * Spring侧vLLM实例轮询负载均衡
 * 适用于不使用Nginx的场景
 */
@Slf4j
@Component
public class LoadBalancerConfig {

    @Data
    public static class VllmInstance {
        private String url;
        private int weight;
        private String model;
        private volatile boolean healthy = true;
    }

    private final List<ChatClient> clientPool;
    private final AtomicInteger counter = new AtomicInteger(0);

    public LoadBalancerConfig(VllmProperties properties) {
        this.clientPool = properties.getInstances().stream()
                .map(instance -> {
                    OpenAiApi api = new OpenAiApi(instance.getUrl() + "/v1", "not-needed");
                    OpenAiChatModel model = new OpenAiChatModel(api,
                            OpenAiChatOptions.builder()
                                    .withModel(instance.getModel())
                                    .build());
                    return ChatClient.builder(model).build();
                })
                .toList();
        log.info("已初始化 {} 个vLLM实例", clientPool.size());
    }

    /**
     * 轮询获取下一个可用Client
     */
    public ChatClient nextClient() {
        int idx = counter.getAndIncrement() % clientPool.size();
        return clientPool.get(Math.abs(idx));
    }
}

@Data
@Component
@ConfigurationProperties(prefix = "vllm")
class VllmProperties {
    private List<LoadBalancerConfig.VllmInstance> instances;
}

性能数据汇总

vLLM vs 其他推理框架对比（A100 80G，DeepSeek-R1-7B，并发32）

框架	吞吐量(tokens/s)	P50延迟	P99延迟	GPU利用率	内存效率
Transformers（naive）	1,200	1.8s	6.2s	18%	低
Transformers + batching	3,500	1.2s	3.8s	45%	中
TGI（HuggingFace）	6,200	0.9s	2.1s	65%	中高
vLLM 0.6.x	9,800	0.6s	1.4s	76%	高
vLLM + AWQ INT4	11,200	0.5s	1.2s	82%	极高

FAQ

Q1：vLLM支持哪些模型？能运行国内模型吗？

vLLM支持几乎所有主流开源LLM，包括DeepSeek-R1系列、Qwen2.5系列、Llama3系列、ChatGLM4等。判断标准：只要HuggingFace上有权重，且模型架构是Transformer系列，vLLM基本都支持。遇到不支持的情况，通常是因为该模型使用了自定义架构，查看vLLM GitHub Issues寻找解决方案。

Q2：vLLM和Ollama如何选择？

Ollama更适合：开发者本地测试、单机小团队、需要简单易用。vLLM更适合：企业生产环境、需要高吞吐、多实例部署、精细化性能调优。简单说：Ollama是"开箱即用的瑞士军刀"，vLLM是"可以精细调校的生产级发动机"。

Q3：--max-model-len设置多少合适？

计算公式：max_model_len × layers × heads × 2 × 2 bytes ≈ KV Cache显存。建议从业务实际需求出发：如果90%的对话不超过4096 token，就设4096而不是32768。更小的max-model-len = 更多并发请求。用 --gpu-memory-utilization 0.90 --max-model-len 4096 先试，通过监控调整。

Q4：vLLM的OpenAI兼容接口和真正的OpenAI有什么区别？

功能上99%兼容，主要差异：①不支持OpenAI的文件上传API；②部分高级参数（如logit_bias）行为可能略有不同；③函数调用（Function Calling/Tool Use）支持依赖模型本身能力。对于Spring AI的标准用法（对话、流式、嵌入），完全透明兼容。

Q5：生产环境vLLM的监控告警如何设置？

关键告警阈值：vllm:gpu_cache_usage_perc > 0.95（KV Cache接近满，需扩容）、vllm:num_requests_waiting > 50（队列积压）、vllm:time_to_first_token_seconds P95 > 3s（响应变慢）。使用Prometheus AlertManager配置告警，配合飞书/钉钉Webhook实现及时通知。

Q6：vLLM如何处理多租户场景（不同业务方共用一套推理服务）？

推荐方案：在Spring AI网关层做租户路由，不同租户打不同标签，通过请求头传递租户信息。vLLM本身不感知租户，由网关层做限流（每个租户每分钟最多N个请求）和计费统计（记录每租户的token消耗）。如果需要模型级别的隔离，为不同租户部署不同vLLM实例。