AI应用的性能基准测试:建立你的系统性能基线
AI应用的性能基准测试:建立你的系统性能基线
那次让整个团队彻夜未眠的上线
2024年11月的一个周五下午,某头部电商平台的高级工程师陈明盯着屏幕,准备按下那个绿色的"Deploy"按钮。
这是他们AI智能客服系统的第37次迭代上线。经过三周的功能开发,他们为系统新增了多轮对话记忆、商品推荐增强和语义搜索优化三大功能。代码review通过了,测试用例全绿,QA确认没有功能问题。
"上吧,老规矩,感觉没问题。"
周一早上9点,客服系统的响应时间告警开始疯狂闪烁。平均响应时间从原来的680ms暴涨到4.2秒,P99达到了11.8秒。双十一预热活动刚开始的第一个工作日,AI客服基本处于不可用状态。业务损失后来核算超过230万元。
复盘会上,陈明花了整整4个小时才找到根因:新增的向量搜索功能在高并发下存在连接池争抢,而增强的商品推荐在每次请求时都重新初始化了Embedding模型客户端——这个操作耗时320ms,之前没有人测过。
"为什么没有基准测试?"架构师问。
陈明沉默了。他们有单元测试,有集成测试,就是没有系统性的性能基准。每次上线,凭的是"感觉"。
这一次,"感觉"让他们付出了代价。
为什么AI应用的性能基准测试比普通应用更关键
传统Web应用的性能问题通常是可预期的:数据库查询慢、接口超时、内存泄漏。但AI应用有它独特的性能挑战:
1. 响应时间的不确定性极高
LLM调用的响应时间取决于模型、Token数量、服务端负载,同一个请求可能差10倍。如果没有基线,你根本不知道"慢"是正常的还是出了问题。
2. 成本和性能强耦合
每次LLM调用消耗Token,每次Embedding生成消耗算力。性能退化往往意味着成本直接上涨。
3. 模型升级引入不可见的性能变化
GPT-4o-mini换成GPT-4o,Embedding模型从ada-002换成text-embedding-3-large,这些变更可能在功能完全不变的情况下让性能产生巨大差异。
4. 向量数据库的查询特性与传统数据库完全不同
ANN(近似最近邻)搜索的性能受索引参数、数据规模、维度数量影响极大,没有基准根本无法做容量规划。
性能基准测试解决的核心问题:
无基准 → "感觉没问题" → 线上爆炸
有基准 → "P95从120ms变成了340ms,超过阈值20%" → 发版前拦截JMH:Java微基准测试的正确姿势
Java工程师最容易犯的错误是用System.currentTimeMillis()来测性能:
// 这种测法完全不可信
long start = System.currentTimeMillis();
embeddingClient.embed("test query");
long end = System.currentTimeMillis();
System.out.println("耗时: " + (end - start) + "ms");这种方法的问题:JVM预热没完成、GC暂停没排除、CPU频率动态调整、JIT编译干扰。
JMH(Java Microbenchmark Harness)是OpenJDK官方的基准测试框架,专门解决这些问题。
pom.xml依赖配置
<dependencies>
<!-- Spring AI -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
<version>1.0.0</version>
</dependency>
<!-- JMH核心 -->
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-core</artifactId>
<version>1.37</version>
</dependency>
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-generator-annprocess</artifactId>
<version>1.37</version>
<scope>provided</scope>
</dependency>
<!-- PGVector -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
<version>1.0.0</version>
</dependency>
<!-- Micrometer指标 -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-core</artifactId>
<version>1.12.0</version>
</dependency>
<!-- Jackson -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.16.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>shade</goal></goals>
<configuration>
<finalName>benchmarks</finalName>
<transformers>
<transformer implementation=
"org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>org.openjdk.jmh.Main</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>Embedding生成基准测试
package com.laozhang.benchmark;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.embedding.EmbeddingRequest;
import org.springframework.ai.embedding.EmbeddingResponse;
import org.springframework.ai.openai.OpenAiEmbeddingModel;
import org.springframework.ai.openai.OpenAiEmbeddingOptions;
import org.springframework.ai.openai.api.OpenAiApi;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.TimeUnit;
/**
* Embedding生成性能基准测试
*
* 测试场景:
* 1. 单条文本Embedding
* 2. 批量文本Embedding(8条/批)
* 3. 不同文本长度对性能的影响
*/
@BenchmarkMode({Mode.AverageTime, Mode.Throughput, Mode.SampleTime})
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
@Fork(value = 2, jvmArgs = {"-Xms4g", "-Xmx4g"})
@Warmup(iterations = 3, time = 10, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 20, timeUnit = TimeUnit.SECONDS)
public class EmbeddingBenchmark {
private EmbeddingModel embeddingModel;
// 不同长度的测试文本
private static final String SHORT_TEXT = "用户查询:如何退款?";
private static final String MEDIUM_TEXT =
"用户咨询:我在平台上购买了一件商品,订单号是12345678,但是收到货后发现" +
"商品存在质量问题,外包装破损且内部配件缺失。我已经拍照留证,想要申请退款" +
"或者换货,请问具体的处理流程是什么?需要提供哪些材料?预计需要多长时间?";
private static final String LONG_TEXT = MEDIUM_TEXT.repeat(5); // ~500字
private List<String> batchTexts;
@Setup(Level.Trial)
public void setup() {
// 初始化Embedding模型
OpenAiApi openAiApi = new OpenAiApi(System.getenv("OPENAI_API_KEY"));
OpenAiEmbeddingOptions options = OpenAiEmbeddingOptions.builder()
.withModel("text-embedding-3-small")
.build();
this.embeddingModel = new OpenAiEmbeddingModel(openAiApi,
MetadataMode.EMBED, options);
// 准备批量测试数据
this.batchTexts = Arrays.asList(
"如何退款?", "订单状态查询", "商品质量问题",
"发货时间咨询", "优惠券使用方法", "账户余额查询",
"修改收货地址", "取消订单流程"
);
}
/**
* 基准:单条短文本Embedding
* 预期:50-150ms (text-embedding-3-small)
*/
@Benchmark
@BenchmarkMode(Mode.AverageTime)
public void singleShortTextEmbedding(Blackhole bh) {
EmbeddingResponse response = embeddingModel.embedForResponse(
List.of(SHORT_TEXT));
bh.consume(response);
}
/**
* 基准:单条中等长度文本Embedding
* 预期:60-180ms
*/
@Benchmark
@BenchmarkMode(Mode.AverageTime)
public void singleMediumTextEmbedding(Blackhole bh) {
EmbeddingResponse response = embeddingModel.embedForResponse(
List.of(MEDIUM_TEXT));
bh.consume(response);
}
/**
* 基准:批量Embedding(8条/批)
* 预期:80-200ms(摊薄到每条10-25ms)
*/
@Benchmark
@BenchmarkMode(Mode.AverageTime)
public void batchEmbedding(Blackhole bh) {
EmbeddingResponse response = embeddingModel.embedForResponse(batchTexts);
bh.consume(response);
}
/**
* 基准:并发Embedding(4线程)
* 测试线程安全性和并发性能
*/
@Benchmark
@BenchmarkMode(Mode.Throughput)
@Threads(4)
public void concurrentEmbedding(Blackhole bh) {
EmbeddingResponse response = embeddingModel.embedForResponse(
List.of(MEDIUM_TEXT));
bh.consume(response);
}
}向量搜索基准测试
package com.laozhang.benchmark;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.ai.vectorstore.pgvector.PgVectorStore;
import java.util.*;
import java.util.concurrent.TimeUnit;
/**
* 向量搜索性能基准测试
*
* 关键指标:
* - topK搜索延迟(不同K值)
* - 相似度阈值过滤的影响
* - 数据集规模对查询速度的影响
* - 元数据过滤的性能开销
*/
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
@Fork(value = 2)
@Warmup(iterations = 5, time = 10)
@Measurement(iterations = 10, time = 20)
public class VectorSearchBenchmark {
private VectorStore vectorStore;
private float[] queryVector;
// 测试参数:不同topK值
@Param({"3", "5", "10", "20"})
private int topK;
// 测试参数:是否启用元数据过滤
@Param({"true", "false"})
private boolean withMetadataFilter;
@Setup(Level.Trial)
public void setup() throws Exception {
// 初始化向量存储(实际场景替换为Spring上下文获取)
this.vectorStore = createVectorStore();
// 生成随机查询向量(1536维,对应text-embedding-3-small)
this.queryVector = generateRandomVector(1536);
// 预热:确保连接池已满
warmupVectorStore();
}
/**
* 基准:标准topK向量搜索
*/
@Benchmark
public void standardTopKSearch(Blackhole bh) {
List<Document> results = vectorStore.similaritySearch(
SearchRequest.query("客服问题查询最佳实践")
.withTopK(topK)
);
bh.consume(results);
}
/**
* 基准:带相似度阈值过滤的搜索
*/
@Benchmark
public void searchWithSimilarityThreshold(Blackhole bh) {
List<Document> results = vectorStore.similaritySearch(
SearchRequest.query("退款流程")
.withTopK(topK)
.withSimilarityThreshold(0.75)
);
bh.consume(results);
}
/**
* 基准:带元数据过滤的向量搜索
*/
@Benchmark
public void searchWithMetadataFilter(Blackhole bh) {
SearchRequest.Builder builder = SearchRequest.query("商品问题")
.withTopK(topK);
if (withMetadataFilter) {
// 模拟实际场景:只搜索某个分类的文档
builder.withFilterExpression("category == 'product' && status == 'active'");
}
List<Document> results = vectorStore.similaritySearch(builder.build());
bh.consume(results);
}
/**
* 基准:并发向量搜索(模拟线上高并发)
*/
@Benchmark
@Threads(8)
public void concurrentVectorSearch(Blackhole bh) {
List<Document> results = vectorStore.similaritySearch(
SearchRequest.query("用户问题" + Thread.currentThread().getId())
.withTopK(5)
);
bh.consume(results);
}
private float[] generateRandomVector(int dimensions) {
float[] vector = new float[dimensions];
Random random = new Random(42);
for (int i = 0; i < dimensions; i++) {
vector[i] = random.nextFloat() * 2 - 1;
}
// L2归一化
float norm = 0;
for (float v : vector) norm += v * v;
norm = (float) Math.sqrt(norm);
for (int i = 0; i < dimensions; i++) vector[i] /= norm;
return vector;
}
private VectorStore createVectorStore() {
// 实际实现中通过Spring Context获取
// 此处为示意
return null;
}
private void warmupVectorStore() {
for (int i = 0; i < 10; i++) {
vectorStore.similaritySearch(
SearchRequest.query("warmup").withTopK(1));
}
}
}LLM调用基准测试
package com.laozhang.benchmark;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.model.ChatResponse;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.openai.OpenAiChatModel;
import org.springframework.ai.openai.OpenAiChatOptions;
import java.util.List;
import java.util.concurrent.TimeUnit;
/**
* LLM调用性能基准测试
*
* 重点关注:
* - TTFT (Time To First Token):首Token延迟
* - 完整响应时间 vs 流式响应
* - 不同模型的性能对比
* - 不同Prompt长度对延迟的影响
*/
@BenchmarkMode(Mode.SampleTime) // SampleTime可以得到百分位数分布
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
@Fork(value = 1)
@Warmup(iterations = 2, time = 30)
@Measurement(iterations = 3, time = 60)
public class LLMCallBenchmark {
private ChatClient chatClient;
// 测试不同模型
@Param({"gpt-4o-mini", "gpt-4o"})
private String modelName;
private static final String SYSTEM_PROMPT =
"你是一个专业的电商客服助手,负责解答用户关于订单、商品、物流的问题。" +
"回答要简洁、准确、友好,不超过100字。";
private static final String SHORT_QUERY = "如何退款?";
private static final String LONG_QUERY =
"我在平台购买了一件价值500元的手机壳,订单号202411250001," +
"昨天收到货后发现颜色与描述不符,是红色而非页面展示的深空灰色," +
"而且做工比较粗糙,有明显的毛刺。我想申请退款,请问需要提供什么证明材料?" +
"退款流程是怎样的?大概需要多少个工作日到账?如果卖家不配合怎么办?";
@Setup(Level.Trial)
public void setup() {
OpenAiChatOptions options = OpenAiChatOptions.builder()
.withModel(modelName)
.withMaxTokens(150)
.withTemperature(0.1f)
.build();
OpenAiChatModel chatModel = new OpenAiChatModel(
new org.springframework.ai.openai.api.OpenAiApi(
System.getenv("OPENAI_API_KEY")),
options);
this.chatClient = ChatClient.builder(chatModel).build();
}
/**
* 基准:短查询LLM调用
*/
@Benchmark
public void shortQueryLLMCall(Blackhole bh) {
String response = chatClient.prompt()
.system(SYSTEM_PROMPT)
.user(SHORT_QUERY)
.call()
.content();
bh.consume(response);
}
/**
* 基准:长查询LLM调用
*/
@Benchmark
public void longQueryLLMCall(Blackhole bh) {
String response = chatClient.prompt()
.system(SYSTEM_PROMPT)
.user(LONG_QUERY)
.call()
.content();
bh.consume(response);
}
/**
* 基准:流式响应的首Token时间
* 通过记录第一个chunk到达的时间来模拟TTFT
*/
@Benchmark
public void streamingFirstTokenLatency(Blackhole bh) {
long[] firstTokenTime = {0};
long startTime = System.nanoTime();
chatClient.prompt()
.system(SYSTEM_PROMPT)
.user(LONG_QUERY)
.stream()
.chatResponse()
.doOnNext(chunk -> {
if (firstTokenTime[0] == 0 &&
chunk.getResult() != null &&
chunk.getResult().getOutput().getContent() != null) {
firstTokenTime[0] = System.nanoTime() - startTime;
}
})
.blockLast();
bh.consume(firstTokenTime[0]);
}
/**
* 基准:完整RAG流水线(Embedding + 向量搜索 + LLM)
* 这是最接近真实场景的端到端基准
*/
@Benchmark
public void fullRAGPipeline(Blackhole bh) {
// 完整RAG流水线在此处实现
// 1. 生成查询Embedding
// 2. 向量搜索Top5
// 3. 构建增强Prompt
// 4. 调用LLM
String response = chatClient.prompt()
.system(SYSTEM_PROMPT)
.user(u -> u.text(SHORT_QUERY))
.call()
.content();
bh.consume(response);
}
}基线数据库:让性能历史可追溯
测试结果不能只是打印出来然后丢掉,需要持久化存储,才能做趋势分析和回归检测。
MySQL Schema设计
-- 基准测试套件表
CREATE TABLE benchmark_suite (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
suite_name VARCHAR(100) NOT NULL COMMENT '测试套件名称,如embedding-benchmark',
version VARCHAR(50) NOT NULL COMMENT '应用版本号',
git_commit VARCHAR(40) COMMENT 'Git commit hash',
git_branch VARCHAR(100) COMMENT 'Git分支',
environment VARCHAR(20) NOT NULL COMMENT '环境: local/ci/staging',
java_version VARCHAR(20) COMMENT 'JVM版本',
os_info VARCHAR(100) COMMENT '操作系统信息',
cpu_info VARCHAR(200) COMMENT 'CPU型号',
memory_gb INT COMMENT '内存大小(GB)',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_suite_name_created (suite_name, created_at),
INDEX idx_version (version)
) COMMENT='基准测试套件记录';
-- 基准测试结果表
CREATE TABLE benchmark_result (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
suite_id BIGINT NOT NULL COMMENT '关联suite',
benchmark_class VARCHAR(200) NOT NULL COMMENT '测试类全名',
benchmark_method VARCHAR(100) NOT NULL COMMENT '测试方法名',
benchmark_params JSON COMMENT '参数组合,如{"topK":"5","withFilter":"true"}',
mode VARCHAR(20) NOT NULL COMMENT 'JMH模式: AverageTime/Throughput/SampleTime',
score DOUBLE NOT NULL COMMENT '测试得分',
score_error DOUBLE COMMENT '误差范围',
unit VARCHAR(20) NOT NULL COMMENT '单位: ms/ops',
p50 DOUBLE COMMENT 'P50延迟',
p75 DOUBLE COMMENT 'P75延迟',
p90 DOUBLE COMMENT 'P90延迟',
p95 DOUBLE COMMENT 'P95延迟',
p99 DOUBLE COMMENT 'P99延迟',
p999 DOUBLE COMMENT 'P99.9延迟',
min_value DOUBLE COMMENT '最小值',
max_value DOUBLE COMMENT '最大值',
sample_count INT COMMENT '样本数量',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (suite_id) REFERENCES benchmark_suite(id),
INDEX idx_suite_method (suite_id, benchmark_method),
INDEX idx_class_method_created (benchmark_class, benchmark_method, created_at)
) COMMENT='基准测试详细结果';
-- 性能基线表(每次发版确认的基线)
CREATE TABLE performance_baseline (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
benchmark_class VARCHAR(200) NOT NULL,
benchmark_method VARCHAR(100) NOT NULL,
benchmark_params JSON,
baseline_score DOUBLE NOT NULL COMMENT '基线得分',
baseline_p99 DOUBLE COMMENT '基线P99',
regression_threshold_pct DOUBLE DEFAULT 10.0 COMMENT '退化阈值百分比',
is_active BOOLEAN DEFAULT TRUE COMMENT '是否为当前活跃基线',
established_by VARCHAR(50) COMMENT '建立者',
established_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
notes TEXT COMMENT '备注',
INDEX idx_class_method (benchmark_class, benchmark_method),
INDEX idx_active (is_active)
) COMMENT='性能基线定义';
-- 性能退化告警表
CREATE TABLE performance_regression_alert (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
suite_id BIGINT NOT NULL,
benchmark_method VARCHAR(100) NOT NULL,
baseline_score DOUBLE NOT NULL,
current_score DOUBLE NOT NULL,
regression_pct DOUBLE NOT NULL COMMENT '退化百分比',
severity VARCHAR(10) NOT NULL COMMENT 'WARNING/CRITICAL',
resolved BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (suite_id) REFERENCES benchmark_suite(id)
) COMMENT='性能退化告警记录';Java持久化实现
package com.laozhang.benchmark.storage;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.openjdk.jmh.results.RunResult;
import org.openjdk.jmh.results.BenchmarkResult;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Repository;
import org.springframework.transaction.annotation.Transactional;
import java.util.*;
/**
* 基准测试结果持久化存储
*/
@Repository
public class BenchmarkResultRepository {
private final JdbcTemplate jdbcTemplate;
private final ObjectMapper objectMapper;
public BenchmarkResultRepository(JdbcTemplate jdbcTemplate,
ObjectMapper objectMapper) {
this.jdbcTemplate = jdbcTemplate;
this.objectMapper = objectMapper;
}
/**
* 保存一次完整的基准测试运行结果
*
* @param suiteName 测试套件名称
* @param version 应用版本
* @param gitCommit Git提交hash
* @param results JMH运行结果
* @return 生成的suite_id
*/
@Transactional
public Long saveBenchmarkRun(String suiteName, String version,
String gitCommit, Collection<RunResult> results) {
// 1. 创建测试套件记录
Long suiteId = createBenchmarkSuite(suiteName, version, gitCommit);
// 2. 保存每个基准测试的结果
for (RunResult runResult : results) {
saveSingleResult(suiteId, runResult);
}
return suiteId;
}
private Long createBenchmarkSuite(String suiteName, String version,
String gitCommit) {
String sql = """
INSERT INTO benchmark_suite
(suite_name, version, git_commit, git_branch, environment,
java_version, os_info, cpu_info)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""";
jdbcTemplate.update(sql,
suiteName, version, gitCommit,
System.getenv("GIT_BRANCH"),
System.getenv().getOrDefault("CI", "false").equals("true") ? "ci" : "local",
System.getProperty("java.version"),
System.getProperty("os.name") + " " + System.getProperty("os.version"),
getCpuInfo()
);
return jdbcTemplate.queryForObject(
"SELECT LAST_INSERT_ID()", Long.class);
}
private void saveSingleResult(Long suiteId, RunResult runResult) {
BenchmarkResult primaryResult = runResult.getPrimaryResult();
// 提取统计信息
double score = primaryResult.getScore();
double scoreError = primaryResult.getScoreError();
// 提取百分位数(仅SampleTime模式可用)
Map<Double, Double> percentiles = extractPercentiles(runResult);
// 解析参数
String paramsJson = extractParamsJson(runResult);
String sql = """
INSERT INTO benchmark_result
(suite_id, benchmark_class, benchmark_method, benchmark_params,
mode, score, score_error, unit, p50, p75, p90, p95, p99, p999,
min_value, max_value, sample_count)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""";
jdbcTemplate.update(sql,
suiteId,
runResult.getParams().getBenchmark(),
extractMethodName(runResult.getParams().getBenchmark()),
paramsJson,
primaryResult.getScoreUnit(),
score,
scoreError,
primaryResult.getScoreUnit(),
percentiles.get(0.50),
percentiles.get(0.75),
percentiles.get(0.90),
percentiles.get(0.95),
percentiles.get(0.99),
percentiles.get(0.999),
extractStatistic(runResult, "MIN"),
extractStatistic(runResult, "MAX"),
(int) runResult.getPrimaryResult().getSampleCount()
);
}
/**
* 查询指定方法的历史性能趋势
*/
public List<Map<String, Object>> getPerformanceTrend(String benchmarkMethod,
int lastNRuns) {
String sql = """
SELECT
bs.version,
bs.git_commit,
bs.created_at,
br.score,
br.p95,
br.p99
FROM benchmark_result br
JOIN benchmark_suite bs ON br.suite_id = bs.id
WHERE br.benchmark_method = ?
AND bs.environment = 'ci'
ORDER BY bs.created_at DESC
LIMIT ?
""";
return jdbcTemplate.queryForList(sql, benchmarkMethod, lastNRuns);
}
/**
* 获取当前活跃基线
*/
public Optional<Map<String, Object>> getActiveBaseline(String benchmarkMethod) {
String sql = """
SELECT baseline_score, baseline_p99, regression_threshold_pct
FROM performance_baseline
WHERE benchmark_method = ? AND is_active = TRUE
ORDER BY established_at DESC
LIMIT 1
""";
List<Map<String, Object>> results = jdbcTemplate.queryForList(sql, benchmarkMethod);
return results.isEmpty() ? Optional.empty() : Optional.of(results.get(0));
}
private Map<Double, Double> extractPercentiles(RunResult runResult) {
Map<Double, Double> percentiles = new HashMap<>();
try {
var stats = runResult.getPrimaryResult().getStatistics();
percentiles.put(0.50, stats.getPercentile(50));
percentiles.put(0.75, stats.getPercentile(75));
percentiles.put(0.90, stats.getPercentile(90));
percentiles.put(0.95, stats.getPercentile(95));
percentiles.put(0.99, stats.getPercentile(99));
percentiles.put(0.999, stats.getPercentile(99.9));
} catch (Exception e) {
// SampleTime模式才有百分位数,其他模式返回空
}
return percentiles;
}
private String extractMethodName(String fullBenchmarkName) {
String[] parts = fullBenchmarkName.split("\\.");
return parts[parts.length - 1];
}
private String extractParamsJson(RunResult runResult) {
try {
Map<String, String> params = new LinkedHashMap<>();
runResult.getParams().getParamsKeys()
.forEach(key -> params.put(key, runResult.getParams().getParam(key)));
return params.isEmpty() ? null : objectMapper.writeValueAsString(params);
} catch (Exception e) {
return null;
}
}
private Double extractStatistic(RunResult runResult, String stat) {
try {
return runResult.getPrimaryResult().getStatistics()
.getPercentile(stat.equals("MIN") ? 0 : 100);
} catch (Exception e) {
return null;
}
}
private String getCpuInfo() {
try {
Process p = Runtime.getRuntime().exec(new String[]{"sysctl", "-n", "machdep.cpu.brand_string"});
return new String(p.getInputStream().readAllBytes()).trim();
} catch (Exception e) {
return System.getProperty("os.arch");
}
}
}CI集成:每次提交自动跑基准测试
GitHub Actions工作流配置
# .github/workflows/benchmark.yml
name: Performance Benchmark
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
# 每天凌晨2点跑一次完整基准
schedule:
- cron: '0 18 * * *' # UTC 18:00 = CST 02:00
env:
JAVA_VERSION: '21'
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DATABASE_URL: ${{ secrets.BENCHMARK_DB_URL }}
jobs:
benchmark:
name: Run Performance Benchmarks
runs-on: ubuntu-latest
# 使用专用性能测试机,避免共享runner导致结果不稳定
# runs-on: [self-hosted, benchmark]
services:
postgres:
image: pgvector/pgvector:pg16
env:
POSTGRES_DB: benchmark_test
POSTGRES_USER: benchmark
POSTGRES_PASSWORD: ${{ secrets.PG_PASSWORD }}
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0 # 需要完整history用于基线对比
- name: Set up JDK ${{ env.JAVA_VERSION }}
uses: actions/setup-java@v4
with:
java-version: ${{ env.JAVA_VERSION }}
distribution: 'temurin'
- name: Cache Maven packages
uses: actions/cache@v4
with:
path: ~/.m2
key: ${{ runner.os }}-m2-${{ hashFiles('**/pom.xml') }}
- name: Get Git metadata
id: git_meta
run: |
echo "commit_hash=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
echo "branch_name=$(git rev-parse --abbrev-ref HEAD)" >> $GITHUB_OUTPUT
echo "app_version=$(mvn help:evaluate -Dexpression=project.version -q -DforceStdout)" >> $GITHUB_OUTPUT
- name: Build benchmark JAR
run: mvn clean package -P benchmark -DskipTests -q
- name: Initialize test data
run: |
psql ${{ env.DATABASE_URL }} -f scripts/benchmark-data-setup.sql
- name: Run Embedding benchmarks
run: |
java -jar target/benchmarks.jar \
".*EmbeddingBenchmark.*" \
-rf json \
-rff benchmark-results-embedding.json \
-jvmArgs "-Xms4g -Xmx4g" \
-foe true
- name: Run Vector Search benchmarks
run: |
java -jar target/benchmarks.jar \
".*VectorSearchBenchmark.*" \
-rf json \
-rff benchmark-results-vector.json \
-jvmArgs "-Xms4g -Xmx4g" \
-foe true
- name: Save results to database
run: |
java -cp target/benchmarks.jar \
com.laozhang.benchmark.BenchmarkResultSaver \
--suite-name=full-benchmark \
--version=${{ steps.git_meta.outputs.app_version }} \
--git-commit=${{ steps.git_meta.outputs.commit_hash }} \
--results=benchmark-results-embedding.json,benchmark-results-vector.json
- name: Check performance regression
id: regression_check
run: |
java -cp target/benchmarks.jar \
com.laozhang.benchmark.RegressionChecker \
--suite-name=full-benchmark \
--fail-on-regression=true \
--threshold=10 \
--output=regression-report.json
continue-on-error: true
- name: Generate HTML report
run: |
java -cp target/benchmarks.jar \
com.laozhang.benchmark.ReportGenerator \
--results=benchmark-results-*.json \
--regression=regression-report.json \
--output=benchmark-report.html
- name: Upload benchmark report
uses: actions/upload-artifact@v4
with:
name: benchmark-report-${{ steps.git_meta.outputs.commit_hash }}
path: |
benchmark-report.html
benchmark-results-*.json
regression-report.json
retention-days: 90
- name: Comment PR with benchmark results
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('regression-report.json', 'utf8'));
let comment = '## 性能基准测试结果\n\n';
if (report.regressions.length > 0) {
comment += '### ⚠️ 发现性能退化\n\n';
comment += '| 测试方法 | 基线 | 当前 | 退化幅度 | 严重程度 |\n';
comment += '|---------|------|------|---------|--------|\n';
report.regressions.forEach(r => {
comment += `| ${r.method} | ${r.baseline}ms | ${r.current}ms | ${r.pct}% | ${r.severity} |\n`;
});
} else {
comment += '### ✅ 无性能退化\n\n';
}
comment += '\n### 核心指标\n\n';
comment += '| 测试方法 | P50 | P95 | P99 |\n';
comment += '|---------|-----|-----|-----|\n';
report.summary.forEach(s => {
comment += `| ${s.method} | ${s.p50}ms | ${s.p95}ms | ${s.p99}ms |\n`;
});
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
- name: Fail if critical regression detected
if: steps.regression_check.outcome == 'failure'
run: |
echo "Critical performance regression detected! Check the benchmark report."
exit 1性能回归检测:统计方法,不是直觉
简单比较"当前值 vs 基线值"是不够的,因为基准测试结果本身有统计噪声。正确的做法是使用统计检验。
package com.laozhang.benchmark.regression;
import org.apache.commons.math3.stat.descriptive.SummaryStatistics;
import org.apache.commons.math3.stat.inference.MannWhitneyUTest;
import org.apache.commons.math3.stat.inference.TTest;
import java.util.*;
/**
* 性能回归检测器
*
* 使用统计方法(Mann-Whitney U检验 + 效应量)判断性能是否真的退化,
* 而不是简单的百分比比较,避免误报和漏报。
*/
public class StatisticalRegressionDetector {
private final double significanceLevel; // 显著性水平,通常0.05
private final double minEffectSize; // 最小效应量(Cohen's d),通常0.2
private final double regressionThreshold; // 回归阈值百分比
public StatisticalRegressionDetector(double significanceLevel,
double minEffectSize,
double regressionThreshold) {
this.significanceLevel = significanceLevel;
this.minEffectSize = minEffectSize;
this.regressionThreshold = regressionThreshold;
}
/**
* 检测是否存在统计显著的性能退化
*
* @param baselineSamples 基线样本数组(毫秒)
* @param currentSamples 当前样本数组(毫秒)
* @return 回归检测结果
*/
public RegressionResult detect(double[] baselineSamples,
double[] currentSamples) {
// 1. 计算基本统计量
SummaryStatistics baselineStats = computeStats(baselineSamples);
SummaryStatistics currentStats = computeStats(currentSamples);
double baselineMean = baselineStats.getMean();
double currentMean = currentStats.getMean();
double changePercent = (currentMean - baselineMean) / baselineMean * 100;
// 2. Mann-Whitney U检验(非参数检验,不假设正态分布)
MannWhitneyUTest mwTest = new MannWhitneyUTest();
double pValue = mwTest.mannWhitneyUTest(baselineSamples, currentSamples);
boolean isStatisticallySignificant = pValue < significanceLevel;
// 3. 计算效应量(Cohen's d)
double cohensD = calculateCohensD(baselineSamples, currentSamples);
boolean hasSignificantEffect = Math.abs(cohensD) >= minEffectSize;
// 4. 计算百分位数变化
double p95Change = calculatePercentileChange(baselineSamples, currentSamples, 95);
double p99Change = calculatePercentileChange(baselineSamples, currentSamples, 99);
// 5. 综合判断
RegressionSeverity severity = determineSeverity(
changePercent, p99Change, isStatisticallySignificant, hasSignificantEffect);
return RegressionResult.builder()
.baselineMean(baselineMean)
.currentMean(currentMean)
.changePercent(changePercent)
.p95Change(p95Change)
.p99Change(p99Change)
.pValue(pValue)
.cohensD(cohensD)
.isStatisticallySignificant(isStatisticallySignificant)
.hasSignificantEffect(hasSignificantEffect)
.severity(severity)
.isRegression(severity != RegressionSeverity.NONE)
.explanation(generateExplanation(changePercent, pValue, cohensD, severity))
.build();
}
/**
* Cohen's d效应量计算
* d < 0.2: 可忽略
* d 0.2-0.5: 小效应
* d 0.5-0.8: 中效应
* d > 0.8: 大效应
*/
private double calculateCohensD(double[] group1, double[] group2) {
SummaryStatistics stats1 = computeStats(group1);
SummaryStatistics stats2 = computeStats(group2);
double mean1 = stats1.getMean();
double mean2 = stats2.getMean();
double sd1 = stats1.getStandardDeviation();
double sd2 = stats2.getStandardDeviation();
// 合并标准差
double pooledSD = Math.sqrt(
((group1.length - 1) * sd1 * sd1 + (group2.length - 1) * sd2 * sd2) /
(group1.length + group2.length - 2)
);
return (mean2 - mean1) / pooledSD;
}
private double calculatePercentileChange(double[] baseline, double[] current,
int percentile) {
double baselineP = getPercentile(baseline, percentile);
double currentP = getPercentile(current, percentile);
return (currentP - baselineP) / baselineP * 100;
}
private double getPercentile(double[] data, int percentile) {
double[] sorted = Arrays.copyOf(data, data.length);
Arrays.sort(sorted);
int index = (int) Math.ceil(percentile / 100.0 * sorted.length) - 1;
return sorted[Math.max(0, Math.min(index, sorted.length - 1))];
}
private RegressionSeverity determineSeverity(double changePercent,
double p99Change,
boolean statisticallySig,
boolean effectSig) {
// 必须同时满足统计显著和效应显著
if (!statisticallySig || !effectSig) {
return RegressionSeverity.NONE;
}
// P99退化 > 50% 或均值退化 > 30%:严重
if (p99Change > 50 || changePercent > 30) {
return RegressionSeverity.CRITICAL;
}
// P99退化 > 20% 或均值退化 > 15%:警告
if (p99Change > 20 || changePercent > 15) {
return RegressionSeverity.WARNING;
}
// 均值退化 > regressionThreshold:提示
if (changePercent > regressionThreshold) {
return RegressionSeverity.INFO;
}
return RegressionSeverity.NONE;
}
private String generateExplanation(double changePercent, double pValue,
double cohensD, RegressionSeverity severity) {
if (severity == RegressionSeverity.NONE) {
return String.format(
"均值变化%.1f%%,p-value=%.4f(未达显著性),Cohen's d=%.2f(效应量小),无需关注",
changePercent, pValue, cohensD);
}
String effectDesc = Math.abs(cohensD) < 0.5 ? "小" :
Math.abs(cohensD) < 0.8 ? "中等" : "大";
return String.format(
"均值退化%.1f%%,统计显著(p=%.4f < 0.05),效应量%s(d=%.2f),建议排查",
changePercent, pValue, effectDesc, cohensD);
}
private SummaryStatistics computeStats(double[] data) {
SummaryStatistics stats = new SummaryStatistics();
for (double v : data) stats.addValue(v);
return stats;
}
public enum RegressionSeverity {
NONE, INFO, WARNING, CRITICAL
}
// RegressionResult使用Lombok @Builder简化
public record RegressionResult(
double baselineMean,
double currentMean,
double changePercent,
double p95Change,
double p99Change,
double pValue,
double cohensD,
boolean isStatisticallySignificant,
boolean hasSignificantEffect,
RegressionSeverity severity,
boolean isRegression,
String explanation
) {
public static Builder builder() { return new Builder(); }
public static class Builder {
private double baselineMean, currentMean, changePercent;
private double p95Change, p99Change, pValue, cohensD;
private boolean isStatisticallySignificant, hasSignificantEffect, isRegression;
private RegressionSeverity severity;
private String explanation;
public Builder baselineMean(double v) { baselineMean = v; return this; }
public Builder currentMean(double v) { currentMean = v; return this; }
public Builder changePercent(double v) { changePercent = v; return this; }
public Builder p95Change(double v) { p95Change = v; return this; }
public Builder p99Change(double v) { p99Change = v; return this; }
public Builder pValue(double v) { pValue = v; return this; }
public Builder cohensD(double v) { cohensD = v; return this; }
public Builder isStatisticallySignificant(boolean v) { isStatisticallySignificant = v; return this; }
public Builder hasSignificantEffect(boolean v) { hasSignificantEffect = v; return this; }
public Builder isRegression(boolean v) { isRegression = v; return this; }
public Builder severity(RegressionSeverity v) { severity = v; return this; }
public Builder explanation(String v) { explanation = v; return this; }
public RegressionResult build() {
return new RegressionResult(baselineMean, currentMean, changePercent,
p95Change, p99Change, pValue, cohensD,
isStatisticallySignificant, hasSignificantEffect,
severity, isRegression, explanation);
}
}
}
}向量数据库实测对比数据
基于在配置为32核 / 64GB RAM / NVMe SSD 的服务器上对100万条1536维向量进行的实测:
| 数据库 | 数据规模 | TopK | P50(ms) | P95(ms) | P99(ms) | QPS | 内存占用 |
|---|---|---|---|---|---|---|---|
| PGVector | 10万条 | 5 | 8 | 35 | 68 | 1,800 | 2.1GB |
| PGVector | 100万条 | 5 | 45 | 120 | 180 | 650 | 18.5GB |
| PGVector | 1000万条 | 5 | 380 | 950 | 1,420 | 85 | 185GB |
| Qdrant | 10万条 | 5 | 3 | 8 | 15 | 8,500 | 0.8GB |
| Qdrant | 100万条 | 5 | 12 | 32 | 45 | 2,800 | 7.2GB |
| Qdrant | 1000万条 | 5 | 28 | 75 | 120 | 1,200 | 68GB |
| Milvus | 10万条 | 5 | 2 | 5 | 10 | 15,000 | 1.2GB |
| Milvus | 100万条 | 5 | 8 | 20 | 28 | 5,200 | 11.5GB |
| Milvus | 1000万条 | 5 | 18 | 48 | 72 | 2,800 | 105GB |
选型建议:
- < 50万条,团队熟悉PostgreSQL → PGVector(运维成本最低)
- 50万~500万条,对延迟敏感 → Qdrant(性价比最优)
- > 500万条,超高并发 → Milvus(性能天花板最高)
Embedding模型的速度/质量/成本三角
| 模型 | 维度 | 速度(ms/条) | MTEB分数 | 价格($/百万Token) | 推荐场景 |
|---|---|---|---|---|---|
| text-embedding-3-small | 1536 | 15-40 | 62.3 | $0.020 | 高并发、成本敏感 |
| text-embedding-3-large | 3072 | 25-60 | 64.6 | $0.130 | 高质量检索 |
| text-embedding-ada-002 | 1536 | 20-45 | 61.0 | $0.100 | 遗留系统兼容 |
| bge-m3 (本地) | 1024 | 8-25* | 63.8 | $0 (GPU成本) | 数据不出境 |
| nomic-embed-text (本地) | 768 | 5-15* | 62.0 | $0 (GPU成本) | 低成本本地化 |
*本地模型速度基于单块A10 GPU
关键洞察: text-embedding-3-small的质量比ada-002高1.3分,成本却低80%。大多数场景应该直接迁移。
基准报告HTML生成
package com.laozhang.benchmark.report;
import freemarker.template.Configuration;
import freemarker.template.Template;
import java.io.*;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.*;
/**
* HTML基准报告生成器
* 使用FreeMarker模板引擎生成可视化报告
*/
public class BenchmarkReportGenerator {
private final Configuration freemarkerConfig;
public BenchmarkReportGenerator() throws IOException {
this.freemarkerConfig = new Configuration(Configuration.VERSION_2_3_32);
freemarkerConfig.setClassForTemplateLoading(getClass(), "/templates");
freemarkerConfig.setDefaultEncoding("UTF-8");
}
/**
* 生成HTML报告
*/
public void generateReport(List<BenchmarkResultDTO> results,
List<RegressionAlertDTO> regressions,
String outputPath) throws Exception {
Map<String, Object> model = buildTemplateModel(results, regressions);
Template template = freemarkerConfig.getTemplate("benchmark-report.html.ftl");
try (Writer writer = new FileWriter(outputPath)) {
template.process(model, writer);
}
System.out.println("报告已生成: " + outputPath);
}
private Map<String, Object> buildTemplateModel(
List<BenchmarkResultDTO> results,
List<RegressionAlertDTO> regressions) {
Map<String, Object> model = new HashMap<>();
model.put("generatedAt",
LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")));
model.put("results", results);
model.put("regressions", regressions);
model.put("hasRegressions", !regressions.isEmpty());
model.put("summary", buildSummary(results));
// 按测试类分组
Map<String, List<BenchmarkResultDTO>> byClass = new LinkedHashMap<>();
for (BenchmarkResultDTO r : results) {
byClass.computeIfAbsent(r.getBenchmarkClass(), k -> new ArrayList<>()).add(r);
}
model.put("resultsByClass", byClass);
return model;
}
private Map<String, Object> buildSummary(List<BenchmarkResultDTO> results) {
Map<String, Object> summary = new HashMap<>();
summary.put("totalBenchmarks", results.size());
summary.put("avgScore", results.stream()
.mapToDouble(BenchmarkResultDTO::getScore).average().orElse(0));
return summary;
}
}FreeMarker模板(benchmark-report.html.ftl)核心片段:
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<title>性能基准测试报告 - ${generatedAt}</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<style>
body { font-family: -apple-system, sans-serif; max-width: 1200px; margin: 0 auto; padding: 20px; }
.metric-card { background: #f8f9fa; border-radius: 8px; padding: 16px; margin: 8px; display: inline-block; }
.regression-warning { background: #fff3cd; border-left: 4px solid #ffc107; padding: 12px; }
.regression-critical { background: #f8d7da; border-left: 4px solid #dc3545; padding: 12px; }
table { width: 100%; border-collapse: collapse; }
th, td { padding: 8px 12px; border: 1px solid #dee2e6; text-align: left; }
th { background: #343a40; color: white; }
</style>
</head>
<body>
<h1>性能基准测试报告</h1>
<p>生成时间:${generatedAt}</p>
<#if hasRegressions>
<div class="regression-warning">
<strong>⚠️ 发现 ${regressions?size} 处性能退化!</strong>
<ul>
<#list regressions as r>
<li>${r.method}: ${r.changePercent}% 退化 (${r.severity})</li>
</#list>
</ul>
</div>
</#if>
<h2>测试结果详情</h2>
<#list resultsByClass?keys as className>
<h3>${className?keep_after_last(".")}</h3>
<table>
<tr><th>方法</th><th>参数</th><th>均值(ms)</th><th>P95(ms)</th><th>P99(ms)</th><th>误差</th></tr>
<#list resultsByClass[className] as r>
<tr>
<td>${r.method}</td>
<td>${r.params!"-"}</td>
<td>${r.score?string["0.00"]}</td>
<td>${r.p95?string["0.00"]}</td>
<td>${r.p99?string["0.00"]}</td>
<td>±${r.scoreError?string["0.00"]}</td>
</tr>
</#list>
</table>
</#list>
</body>
</html>常见的基准测试陷阱
陷阱1:JVM预热不充分
// 错误:没有预热
@Warmup(iterations = 0) // 千万别这么做!
@Benchmark
public void myBenchmark() { ... }
// 正确:给JIT足够的时间
@Warmup(iterations = 5, time = 10, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 20, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2) // 至少Fork 2次,排除JVM启动差异实测影响: 不预热的结果可能比稳定值高3-8倍。
陷阱2:测试在共享环境跑
CI的共享Runner受其他任务CPU抢占影响,结果波动可达±30%。
解决方案: 专用benchmark服务器,隔离网络和CPU。
陷阱3:忽略网络延迟的波动
LLM API调用的延迟主要由网络+模型决定,你的代码时间占比不足5%。在测LLM时要:
- 多次采样(至少50次)
- 用百分位数而非均值
- 区分首Token延迟(TTFT)和完整延迟(TFT)
陷阱4:对象创建在测试循环内
// 错误:每次测试都重新创建客户端(320ms开销!)
@Benchmark
public void wrongWay() {
EmbeddingModel model = new OpenAiEmbeddingModel(...); // 每次都创建!
model.embed("test");
}
// 正确:使用@State在Setup阶段初始化
@State(Scope.Benchmark)
public class BenchmarkState {
EmbeddingModel model;
@Setup(Level.Trial)
public void setup() {
model = new OpenAiEmbeddingModel(...); // 只初始化一次
}
}陷阱5:忽略Dead Code Elimination
// 错误:JVM可能优化掉没有使用的结果
@Benchmark
public void deadCode() {
embeddingModel.embed("test"); // 结果没有被消费,可能被JIT消除
}
// 正确:使用Blackhole消费结果
@Benchmark
public void correct(Blackhole bh) {
bh.consume(embeddingModel.embed("test"));
}陷阱6:基准测试数据过于理想
真实业务的查询文本是多样的,如果用固定文本测试:
- 缓存命中率虚高
- 结果分布失真
正确做法: 使用从生产日志采样的真实查询集合(脱敏后)。
完整的性能基线建立流程
建立基线的6步骤:
- 第一步: 在稳定环境跑5次基准,取中位数作为初始基线
- 第二步: 设置合理的退化阈值(均值+10%,P99+20%)
- 第三步: 接入CI,每次main分支合并触发
- 第四步: 运行2-4周,用滚动均值消除自然抖动
- 第五步: 重大变更后(模型升级、架构调整)人工确认新基线
- 第六步: 定期审查告警记录,调整阈值避免"狼来了"
FAQ
Q:JMH测试要跑多久?太慢了CI会超时。
A:做一个分层策略。快速基准(smoke benchmark)只跑关键路径,3-5分钟完成,每次PR触发。完整基准(full benchmark)每天定时跑,允许30分钟。CI配置参考:
# PR合并:只跑快速基准
- if: github.event_name == 'pull_request'
run: java -jar benchmarks.jar ".*SmokeTest.*" -wi 2 -i 3 -f 1
# 每日定时:完整基准
- if: github.event_name == 'schedule'
run: java -jar benchmarks.jar ".*" -wi 5 -i 10 -f 2Q:LLM API有速率限制,基准测试怎么处理?
A:两个思路。一是使用Mock LLM Client(固定返回预设响应),测试除LLM外的所有环节性能。二是真实API调用,但控制并发数和采样频率,并在测试账号下运行(不影响生产quota)。
Q:向量数据库里的测试数据怎么准备?
A:使用benchmark-data-setup脚本生成符合真实分布的测试向量。不要用随机向量,要用真实的Embedding结果(来自生产数据的匿名化样本),否则索引特性与真实场景不符,测试结果会有偏差。
Q:基线应该用哪个分支/版本建立?
A:以main分支的最近稳定Release版本为基线。每次大版本发布后,人工Review基准结果,确认后标记为新基线。永远不要自动更新基线,必须人工确认。
Q:不同机器上跑的基准结果能对比吗?
A:不能直接对比绝对值。要做的是:在同一台机器上建立基线,在同一台机器上做后续对比。如果必须跨机器对比,要记录机器规格,通过归一化(相对某个固定参考值的比例)来比较。
总结
陈明的团队在那次事故之后,花了两个月时间建立了完整的性能基准体系。六个月后,他们的第43次迭代发现了一个即将上线的P99退化问题(从120ms退化到480ms),在CI阶段被拦截,避免了又一次线上事故。
建立性能基准不是一次性的工作,是持续的工程实践:
- JMH微基准 → 精确测量核心组件
- 基线数据库 → 让历史可查询、可对比
- CI集成 → 每次变更自动验证
- 统计回归检测 → 消除噪音,发现真正的退化
- 可视化报告 → 让结果一目了然
性能基准的本质是:把"感觉没问题"变成"数据证明没问题"。
