vLLM生产部署指南:企业自建推理服务的完整方案
vLLM生产部署指南:企业自建推理服务的完整方案
吞吐量提升8倍,成本降低60%:一次vLLM迁移的真实记录
2025年11月,某AI公司的算法团队负责人林梓航遇到了一个让他头疼的问题:他们自研了一套文档智能处理系统,部署在4张A100上,跑的是HuggingFace官方的transformers推理接口。
系统上线半年,用户量从0增长到3200家企业客户,并发请求从每天几百次涨到了峰值每分钟800次。
问题来了:4张A100,每张显存80G,理论上绰绰有余,但实际表现让人崩溃:
- GPU利用率:峰值只有35%,平均更是只有18%
- 平均响应时间:从上线时的1.2秒,爬到了4.8秒
- 请求队列:高峰期积压超过200个请求
- 每月服务器成本:4张A100云租费,2.8万元
林梓航把这个问题带进了一个AI工程师群,群里的前辈给了他一个字:"换vLLM。"
他不信邪,拉着两个工程师做了48小时的对比测试。测试结果让他沉默了:
| 指标 | transformers推理 | vLLM |
|---|---|---|
| 吞吐量(tokens/s) | 1,200 | 9,800 |
| P50延迟 | 1.8s | 0.6s |
| P99延迟 | 4.8s | 1.4s |
| GPU利用率 | 18% | 76% |
| 并发处理能力 | 8请求 | 64请求 |
同样的硬件,vLLM的吞吐量是原来的8倍多。迁移后,他们从4张A100缩减到了2张,月均服务器成本从2.8万降到了1.1万,降幅超过60%。
这篇文章,是林梓航团队迁移经验的完整技术沉淀。
vLLM核心原理:PagedAttention为什么这么快
理解vLLM的性能优势,必须先理解它解决了什么问题。
传统推理的内存浪费
大语言模型推理的核心计算是KV Cache(Key-Value缓存)。每处理一个token,模型需要缓存当前和历史token的K、V矩阵,用于后续计算。
传统方案的问题:为每个请求预分配连续的显存块。
比如,上下文长度设为4096,每个请求预分配500MB显存。但实际用户的对话平均只有800个token,只用了约100MB。其余400MB处于"预分配但空置"状态。
当并发20个请求时,有效利用率可能只有20%,80%的显存被白白占用。这就是GPU利用率只有18%的根本原因。
PagedAttention:受操作系统虚拟内存启发
vLLM的PagedAttention借鉴了操作系统的分页内存管理思想:
- 物理显存切成固定大小的页(Block),每页可存K个token的KV Cache
- 按需分配:请求生成几个token,就分配几页,不预留
- 逻辑-物理地址映射:每个请求有虚拟的连续地址空间,物理上可以不连续
- 跨请求共享:多个请求的相同前缀(如System Prompt)可以共享同一物理Block,无需重复存储
传统方案(预分配连续内存):
Request A: [████████████████████░░░░░░░░░░] 实际使用50%
Request B: [█████████████░░░░░░░░░░░░░░░░░] 实际使用40%
Request C: [████████░░░░░░░░░░░░░░░░░░░░░░] 实际使用25%
显存浪费大量
PagedAttention(按需分配):
Request A: [Block1][Block2][Block3]
Request B: [Block4][Block5]
Request C: [Block6]
[Block7][Block8]... (空闲,等待新请求)
GPU利用率显著提升这种设计让vLLM可以在同等显存下承载更多并发请求,同时通过Continuous Batching(持续批处理)进一步提升吞吐量——不等所有请求都到齐才批处理,而是请求到来立即插入当前批次。
安装与环境:版本配套很重要
vLLM对CUDA、PyTorch、Python版本有严格要求,版本不匹配是安装失败的首因。
版本配套表(2026年最新稳定版)
| vLLM版本 | Python | CUDA | PyTorch | 推荐显卡 |
|---|---|---|---|---|
| 0.6.x | 3.10-3.12 | 12.1 | 2.3.x | A100/H100/RTX4090 |
| 0.5.x | 3.9-3.11 | 11.8 | 2.1.x | A100/V100 |
| 0.4.x | 3.9-3.11 | 11.8 | 2.0.x | V100/A10 |
安装步骤
# 1. 验证CUDA版本
nvidia-smi | grep "CUDA Version"
# CUDA Version: 12.1
# 2. 创建Python虚拟环境(推荐)
conda create -n vllm python=3.11 -y
conda activate vllm
# 3. 安装PyTorch(务必与CUDA版本匹配)
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu121
# 4. 安装vLLM
pip install vllm==0.6.2
# 5. 验证安装
python -c "import vllm; print(vllm.__version__)"
# 0.6.2
# 6. 快速测试
python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='facebook/opt-125m') # 极小模型,仅用于测试
outputs = llm.generate(['Hello'], SamplingParams(max_tokens=10))
print(outputs[0].outputs[0].text)
"常见安装问题
# 问题1:CUDA版本不匹配
# 报错:RuntimeError: CUDA error: no kernel image is available for execution on the device
# 解决:重新安装对应CUDA版本的PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu$(nvcc --version | grep -oP '\d+\.\d+' | head -1 | tr -d '.')
# 问题2:显存不足
# 报错:torch.cuda.OutOfMemoryError
# 解决:使用量化模型或减小max-model-len
vllm serve model_name --max-model-len 4096 --gpu-memory-utilization 0.85
# 问题3:缺少flash-attn
# 报错:ImportError: No module named 'flash_attn'
pip install flash-attn --no-build-isolation单GPU和多GPU部署配置
单GPU启动(命令行方式)
# 基础启动(OpenAI兼容API)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/deepseek-r1-7b \
--host 0.0.0.0 \
--port 8000 \
--served-model-name deepseek-r1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype bfloat16
# 验证启动成功
curl http://localhost:8000/v1/models
# {"object":"list","data":[{"id":"deepseek-r1","object":"model",...}]}
# 测试调用
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1",
"messages": [{"role": "user", "content": "你好,请介绍一下vLLM"}],
"max_tokens": 200
}'多GPU部署(张量并行)
# 双GPU(推荐用于14B以上模型)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/deepseek-r1-14b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \ # 2张GPU张量并行
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--dtype bfloat16 \
--enable-prefix-caching # 开启前缀缓存(System Prompt复用)
# 4张GPU(适合70B大模型)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/qwen2.5-72b \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \ # 流水线并行(超大模型时用)
--max-model-len 131072 \
--gpu-memory-utilization 0.90关键参数说明
| 参数 | 说明 | 推荐值 |
|---|---|---|
--max-model-len | 最大上下文长度(影响显存) | 根据显存调整,8192起步 |
--gpu-memory-utilization | GPU显存使用比例 | 0.85-0.90 |
--tensor-parallel-size | 张量并行GPU数量 | 等于GPU数量 |
--max-num-seqs | 最大并发序列数 | 64-256 |
--max-num-batched-tokens | 每批最大token数 | 4096-32768 |
--enable-prefix-caching | 前缀KV Cache缓存 | 强烈推荐开启 |
--dtype | 计算精度 | bfloat16(A100)/float16 |
模型加载:HuggingFace与本地权重
从HuggingFace加载
# 配置HuggingFace镜像(国内加速)
export HF_ENDPOINT=https://hf-mirror.com
# 下载模型到本地
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--local-dir /data/models/deepseek-r1-7b \
--local-dir-use-symlinks False
# 直接用HuggingFace模型ID启动(自动下载)
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--download-dir /data/models从本地权重加载
# 本地权重目录结构
/data/models/deepseek-r1-7b/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
└── model-00004-of-00004.safetensors
# 直接加载本地路径
python -m vllm.entrypoints.openai.api_server \
--model /data/models/deepseek-r1-7b \
--tokenizer /data/models/deepseek-r1-7b # tokenizer路径(通常与模型相同)Spring AI无缝接入vLLM的OpenAI兼容API
vLLM提供完全兼容OpenAI的API接口,Spring AI可以直接用OpenAI的starter接入,无需额外开发。
项目结构
vllm-spring-ai/
├── pom.xml
├── src/main/
│ ├── java/com/laozhang/vllm/
│ │ ├── VllmApplication.java
│ │ ├── config/
│ │ │ ├── VllmAIConfig.java
│ │ │ └── LoadBalancerConfig.java
│ │ ├── controller/
│ │ │ └── InferenceController.java
│ │ ├── service/
│ │ │ ├── VllmChatService.java
│ │ │ └── VllmHealthService.java
│ │ └── monitor/
│ │ └── VllmMetricsCollector.java
│ └── resources/
│ └── application.ymlpom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.3.4</version>
<relativePath/>
</parent>
<groupId>com.laozhang</groupId>
<artifactId>vllm-spring-ai</artifactId>
<version>1.0.0</version>
<description>vLLM + Spring AI 企业推理服务集成</description>
<properties>
<java.version>17</java.version>
<spring-ai.version>1.0.0</spring-ai.version>
</properties>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>${spring-ai.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
<!-- 使用OpenAI Starter接入vLLM的OpenAI兼容API -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<!-- Resilience4j(熔断+限流)-->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-reactor</artifactId>
<version>2.2.0</version>
</dependency>
<!-- Prometheus监控 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<!-- HTTP客户端(调用vLLM Prometheus指标)-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>application.yml
server:
port: 8080
spring:
application:
name: vllm-spring-ai
# 通过OpenAI Starter接入vLLM
ai:
openai:
# vLLM的OpenAI兼容端点
base-url: http://vllm-server:8000/v1
# vLLM默认不需要API Key,可以填任意值
api-key: "not-needed"
chat:
options:
model: deepseek-r1 # 与vLLM --served-model-name一致
temperature: 0.7
max-tokens: 2048
# vLLM特有参数(通过additional-model-request-fields传递)
# 多个vLLM实例(负载均衡)
vllm:
instances:
- url: http://vllm-server-1:8000
weight: 50
model: deepseek-r1
- url: http://vllm-server-2:8000
weight: 50
model: deepseek-r1
# Resilience4j配置
resilience4j:
circuitbreaker:
instances:
vllm:
sliding-window-size: 20
failure-rate-threshold: 50
wait-duration-in-open-state: 30s
permitted-number-of-calls-in-half-open-state: 5
bulkhead:
instances:
vllm:
max-concurrent-calls: 32
max-wait-duration: 10s
timelimiter:
instances:
vllm:
timeout-duration: 60s
management:
endpoints:
web:
exposure:
include: health,prometheus,metrics,info
metrics:
tags:
application: ${spring.application.name}
logging:
level:
com.laozhang: DEBUGVllmAIConfig.java
package com.laozhang.vllm.config;
import lombok.Data;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.openai.OpenAiChatModel;
import org.springframework.ai.openai.OpenAiChatOptions;
import org.springframework.ai.openai.api.OpenAiApi;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.util.List;
/**
* vLLM接入配置
* 利用vLLM的OpenAI兼容API,直接复用Spring AI的OpenAI Starter
*/
@Configuration
public class VllmAIConfig {
@Value("${spring.ai.openai.base-url}")
private String vllmBaseUrl;
@Value("${spring.ai.openai.chat.options.model}")
private String modelName;
/**
* 构建指向vLLM的ChatModel(复用OpenAI接口)
*/
@Bean
public OpenAiChatModel vllmChatModel() {
// 创建OpenAI API实例,但base-url指向vLLM
OpenAiApi openAiApi = new OpenAiApi(vllmBaseUrl, "not-needed");
OpenAiChatOptions options = OpenAiChatOptions.builder()
.withModel(modelName)
.withTemperature(0.7f)
.withMaxTokens(2048)
.withTopP(0.9f)
.build();
return new OpenAiChatModel(openAiApi, options);
}
@Bean
public ChatClient vllmChatClient(OpenAiChatModel vllmChatModel) {
return ChatClient.builder(vllmChatModel)
.defaultSystem("你是一位专业的AI助手,请提供准确、有价值的回答。")
.build();
}
}VllmChatService.java(生产级服务实现)
package com.laozhang.vllm.service;
import io.github.resilience4j.bulkhead.annotation.Bulkhead;
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.timelimiter.annotation.TimeLimiter;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
import java.time.Duration;
import java.util.concurrent.CompletableFuture;
/**
* vLLM推理服务
* 包含熔断、超时、舱壁隔离等生产级可靠性保障
*/
@Slf4j
@Service
@RequiredArgsConstructor
public class VllmChatService {
private final ChatClient vllmChatClient;
/**
* 带完整可靠性保障的对话调用
*/
@CircuitBreaker(name = "vllm", fallbackMethod = "chatFallback")
@Bulkhead(name = "vllm", type = Bulkhead.Type.SEMAPHORE)
@TimeLimiter(name = "vllm")
public CompletableFuture<String> chat(String message) {
return CompletableFuture.supplyAsync(() -> {
long start = System.currentTimeMillis();
try {
String response = vllmChatClient.prompt()
.user(message)
.call()
.content();
log.info("vLLM inference success, latency={}ms", System.currentTimeMillis() - start);
return response;
} catch (Exception e) {
log.error("vLLM inference failed: {}", e.getMessage(), e);
throw e;
}
});
}
/**
* 流式推理
*/
@CircuitBreaker(name = "vllm", fallbackMethod = "streamFallback")
public Flux<String> chatStream(String message) {
return Flux.defer(() -> vllmChatClient.prompt()
.user(message)
.stream()
.content())
.timeout(Duration.ofSeconds(60))
.doOnError(e -> log.error("Stream error: {}", e.getMessage()));
}
/**
* 批量推理(利用vLLM的Continuous Batching优化)
*/
public Flux<String> chatBatch(java.util.List<String> messages) {
return Flux.fromIterable(messages)
.flatMap(msg ->
Mono.fromCompletableFuture(chat(msg))
.onErrorReturn("推理失败,请重试"),
8 // 最大并发8
);
}
/**
* 熔断降级方法
*/
public CompletableFuture<String> chatFallback(String message, Exception e) {
log.warn("vLLM circuit breaker triggered, using fallback. Error: {}", e.getMessage());
return CompletableFuture.completedFuture(
"AI服务当前繁忙,请稍后重试。(技术信息:" + e.getClass().getSimpleName() + ")"
);
}
public Flux<String> streamFallback(String message, Exception e) {
log.warn("vLLM stream circuit breaker triggered: {}", e.getMessage());
return Flux.just("AI服务当前繁忙,请稍后重试。");
}
}量化部署:INT8/INT4节省显存
量化是在有限显存下运行大模型的关键技术,用精度换显存。
显存需求对比
| 模型 | FP32 | BF16/FP16 | INT8(W8A8) | INT4(GPTQ) |
|---|---|---|---|---|
| 7B | 28GB | 14GB | 7GB | 4GB |
| 14B | 56GB | 28GB | 14GB | 8GB |
| 70B | 280GB | 140GB | 70GB | 40GB |
AWQ量化(推荐,精度损失最小)
# 安装量化工具
pip install autoawq
# 量化脚本
python quantize_awq.py# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# 原始模型路径
model_path = "/data/models/deepseek-r1-7b"
# 量化后保存路径
quant_path = "/data/models/deepseek-r1-7b-awq-int4"
# 量化配置(INT4,组大小128)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
print("加载模型...")
model = AutoAWQForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
use_cache=False
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
print("开始量化(约需30-60分钟)...")
model.quantize(tokenizer, quant_config=quant_config)
print("保存量化模型...")
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"量化完成,保存到: {quant_path}")# 使用量化模型启动vLLM
python -m vllm.entrypoints.openai.api_server \
--model /data/models/deepseek-r1-7b-awq-int4 \
--quantization awq \
--dtype half \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
# 量化效果对比
# BF16原版:14GB显存,42 tokens/s
# INT4 AWQ:4.5GB显存,38 tokens/s(精度损失约3%,速度相当)监控:vLLM内置Prometheus指标
vLLM内置了完整的Prometheus指标端点,无需额外开发。
架构图
vLLM关键Prometheus指标
# 获取vLLM指标
curl http://vllm-server:8000/metrics
# 关键指标说明:
# vllm:num_requests_running 当前运行中的请求数
# vllm:num_requests_waiting 等待队列中的请求数
# vllm:gpu_cache_usage_perc GPU KV Cache使用率(>0.9需扩容)
# vllm:generation_tokens_total 生成token总数
# vllm:prompt_tokens_total 输入prompt总token数
# vllm:time_to_first_token_seconds 首token延迟分布
# vllm:time_per_output_token_seconds 每个输出token耗时
# vllm:request_success_total 成功请求数
# vllm:request_params_n 请求参数N(批量生成数)VllmMetricsCollector.java(Spring侧采集vLLM指标)
package com.laozhang.vllm.monitor;
import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.web.reactive.function.client.WebClient;
import java.util.concurrent.atomic.AtomicDouble;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* 采集vLLM的Prometheus指标并同步到应用自身的Micrometer
*/
@Slf4j
@Component
public class VllmMetricsCollector {
private final WebClient webClient;
private final AtomicDouble requestsRunning = new AtomicDouble(0);
private final AtomicDouble requestsWaiting = new AtomicDouble(0);
private final AtomicDouble gpuCacheUsage = new AtomicDouble(0);
@Value("${spring.ai.openai.base-url}")
private String vllmBaseUrl;
public VllmMetricsCollector(MeterRegistry registry) {
this.webClient = WebClient.builder().build();
// 注册到Micrometer
Gauge.builder("vllm.requests.running", requestsRunning, AtomicDouble::get)
.description("vLLM当前运行请求数")
.register(registry);
Gauge.builder("vllm.requests.waiting", requestsWaiting, AtomicDouble::get)
.description("vLLM等待队列请求数")
.register(registry);
Gauge.builder("vllm.gpu.cache.usage", gpuCacheUsage, AtomicDouble::get)
.description("vLLM GPU KV Cache使用率")
.register(registry);
}
/**
* 每15秒采集一次vLLM指标
*/
@Scheduled(fixedDelay = 15000)
public void collectMetrics() {
String metricsUrl = vllmBaseUrl.replace("/v1", "") + "/metrics";
webClient.get()
.uri(metricsUrl)
.retrieve()
.bodyToMono(String.class)
.subscribe(
metrics -> parseAndUpdateMetrics(metrics),
error -> log.warn("采集vLLM指标失败: {}", error.getMessage())
);
}
private void parseAndUpdateMetrics(String metricsText) {
updateGauge(metricsText, "vllm:num_requests_running", requestsRunning);
updateGauge(metricsText, "vllm:num_requests_waiting", requestsWaiting);
updateGauge(metricsText, "vllm:gpu_cache_usage_perc", gpuCacheUsage);
}
private void updateGauge(String text, String metricName, AtomicDouble gauge) {
Pattern pattern = Pattern.compile(metricName + "\\{[^}]*\\}\\s+([\\d.]+)");
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
gauge.set(Double.parseDouble(matcher.group(1)));
}
}
}InferenceController.java
package com.laozhang.vllm.controller;
import com.laozhang.vllm.service.VllmChatService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;
import java.util.List;
import java.util.Map;
import java.util.concurrent.CompletableFuture;
@Slf4j
@RestController
@RequestMapping("/inference")
@RequiredArgsConstructor
public class InferenceController {
private final VllmChatService vllmChatService;
/**
* 单次推理
*/
@PostMapping
public CompletableFuture<ResponseEntity<Map<String, String>>> infer(
@RequestBody Map<String, String> request) {
String message = request.get("message");
return vllmChatService.chat(message)
.thenApply(result -> ResponseEntity.ok(Map.of("result", result)));
}
/**
* 流式推理
*/
@PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> inferStream(@RequestBody Map<String, String> request) {
return vllmChatService.chatStream(request.get("message"));
}
/**
* 批量推理
*/
@PostMapping("/batch")
public Flux<String> inferBatch(@RequestBody Map<String, List<String>> request) {
return vllmChatService.chatBatch(request.get("messages"));
}
}高可用:多实例负载均衡
Nginx配置(vLLM实例负载均衡)
# /etc/nginx/conf.d/vllm.conf
upstream vllm_backend {
least_conn; # 最少连接策略(适合长连接推理)
server vllm-server-1:8000 weight=1 max_fails=3 fail_timeout=30s;
server vllm-server-2:8000 weight=1 max_fails=3 fail_timeout=30s;
keepalive 32; # 保持长连接
}
server {
listen 80;
server_name vllm-api.internal;
# 推理API代理
location /v1/ {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection ""; # 启用keepalive
# 超时设置(推理可能较慢)
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
proxy_read_timeout 120s;
# 流式输出必须禁用缓冲
proxy_buffering off;
proxy_cache off;
}
# 指标聚合
location /metrics {
# 返回所有实例指标聚合(可配合Prometheus federation)
proxy_pass http://vllm_backend/metrics;
}
}LoadBalancerConfig.java(Spring侧负载均衡)
package com.laozhang.vllm.config;
import lombok.Data;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.openai.OpenAiChatModel;
import org.springframework.ai.openai.OpenAiChatOptions;
import org.springframework.ai.openai.api.OpenAiApi;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
import org.springframework.stereotype.Component;
import java.util.List;
import java.util.concurrent.atomic.AtomicInteger;
/**
* Spring侧vLLM实例轮询负载均衡
* 适用于不使用Nginx的场景
*/
@Slf4j
@Component
public class LoadBalancerConfig {
@Data
public static class VllmInstance {
private String url;
private int weight;
private String model;
private volatile boolean healthy = true;
}
private final List<ChatClient> clientPool;
private final AtomicInteger counter = new AtomicInteger(0);
public LoadBalancerConfig(VllmProperties properties) {
this.clientPool = properties.getInstances().stream()
.map(instance -> {
OpenAiApi api = new OpenAiApi(instance.getUrl() + "/v1", "not-needed");
OpenAiChatModel model = new OpenAiChatModel(api,
OpenAiChatOptions.builder()
.withModel(instance.getModel())
.build());
return ChatClient.builder(model).build();
})
.toList();
log.info("已初始化 {} 个vLLM实例", clientPool.size());
}
/**
* 轮询获取下一个可用Client
*/
public ChatClient nextClient() {
int idx = counter.getAndIncrement() % clientPool.size();
return clientPool.get(Math.abs(idx));
}
}@Data
@Component
@ConfigurationProperties(prefix = "vllm")
class VllmProperties {
private List<LoadBalancerConfig.VllmInstance> instances;
}性能数据汇总
vLLM vs 其他推理框架对比(A100 80G,DeepSeek-R1-7B,并发32)
| 框架 | 吞吐量(tokens/s) | P50延迟 | P99延迟 | GPU利用率 | 内存效率 |
|---|---|---|---|---|---|
| Transformers(naive) | 1,200 | 1.8s | 6.2s | 18% | 低 |
| Transformers + batching | 3,500 | 1.2s | 3.8s | 45% | 中 |
| TGI(HuggingFace) | 6,200 | 0.9s | 2.1s | 65% | 中高 |
| vLLM 0.6.x | 9,800 | 0.6s | 1.4s | 76% | 高 |
| vLLM + AWQ INT4 | 11,200 | 0.5s | 1.2s | 82% | 极高 |
FAQ
Q1:vLLM支持哪些模型?能运行国内模型吗?
vLLM支持几乎所有主流开源LLM,包括DeepSeek-R1系列、Qwen2.5系列、Llama3系列、ChatGLM4等。判断标准:只要HuggingFace上有权重,且模型架构是Transformer系列,vLLM基本都支持。遇到不支持的情况,通常是因为该模型使用了自定义架构,查看vLLM GitHub Issues寻找解决方案。
Q2:vLLM和Ollama如何选择?
Ollama更适合:开发者本地测试、单机小团队、需要简单易用。vLLM更适合:企业生产环境、需要高吞吐、多实例部署、精细化性能调优。简单说:Ollama是"开箱即用的瑞士军刀",vLLM是"可以精细调校的生产级发动机"。
Q3:--max-model-len设置多少合适?
计算公式:max_model_len × layers × heads × 2 × 2 bytes ≈ KV Cache显存。建议从业务实际需求出发:如果90%的对话不超过4096 token,就设4096而不是32768。更小的max-model-len = 更多并发请求。用 --gpu-memory-utilization 0.90 --max-model-len 4096 先试,通过监控调整。
Q4:vLLM的OpenAI兼容接口和真正的OpenAI有什么区别?
功能上99%兼容,主要差异:①不支持OpenAI的文件上传API;②部分高级参数(如logit_bias)行为可能略有不同;③函数调用(Function Calling/Tool Use)支持依赖模型本身能力。对于Spring AI的标准用法(对话、流式、嵌入),完全透明兼容。
Q5:生产环境vLLM的监控告警如何设置?
关键告警阈值:vllm:gpu_cache_usage_perc > 0.95(KV Cache接近满,需扩容)、vllm:num_requests_waiting > 50(队列积压)、vllm:time_to_first_token_seconds P95 > 3s(响应变慢)。使用Prometheus AlertManager配置告警,配合飞书/钉钉Webhook实现及时通知。
Q6:vLLM如何处理多租户场景(不同业务方共用一套推理服务)?
推荐方案:在Spring AI网关层做租户路由,不同租户打不同标签,通过请求头传递租户信息。vLLM本身不感知租户,由网关层做限流(每个租户每分钟最多N个请求)和计费统计(记录每租户的token消耗)。如果需要模型级别的隔离,为不同租户部署不同vLLM实例。
