RAG系统A/B测试:数据驱动优化你的检索效果
RAG系统A/B测试:数据驱动优化你的检索效果
"明明感觉更好了,用户为什么还在抱怨?"
2025年11月,苏磊在他们公司的代码审查会上被问了一个让他无言以对的问题。
苏磊是一家SaaS公司的AI工程师,负责内部知识库问答系统的优化。3周前,他把系统的分块策略从"固定500字切割"改成了"语义递归分块",同时把chunk数量从top-3提升到top-5,top-5的召回率也从68%提升到了73%——这是他在自己的测试集(30个问题)上跑出来的数据。
他很自信地上线了这个改动。
但上线两周后,客服收到的用户投诉反而多了——用户说"AI好像变笨了","经常给出不相关的答案"。
在代码审查会上,同事老周问他:"你上线前做A/B测试了吗?"
苏磊愣了一下,说:"我有测试啊,在自己的测试集上准确率提升了5%。"
老周追问:"你的测试集是哪来的?有多少个问题?代表真实用户的查询分布吗?"
苏磊哑了。他的30个测试问题是他自己随手想的,全部是"标准型"的技术问题,没有包含用户真实的模糊查询、多轮追问、方言表达。
他的改动确实对他的测试集有效,但对真实用户的查询反而更差了——新的分块策略在处理短文档时效果不佳,而他们的大量用户恰好在查短文档。
苏磊用了接下来两周时间,建立了一套真正的A/B测试框架,把每一次RAG优化都跑到统计显著性才上线。这篇文章是对那套框架的完整记录。
先说结论(TL;DR)
| 方面 | 没有A/B测试 | 有A/B测试 |
|---|---|---|
| 优化决策依据 | 主观感受 + 小样本测试 | 统计显著的用户行为数据 |
| 优化失败风险 | 高(约40%的改动上线后实际变差) | 低(失败的改动在实验阶段就被筛掉) |
| 发现问题速度 | 慢(用户投诉才知道) | 快(实验阶段就能看到指标变化) |
| 多变量优化 | 难(每次只能测一个变量) | 可以(多变量实验) |
| 上线信心 | 低 | 高 |
核心结论:
- RAG优化必须通过A/B测试验证,主观感受不可信
- 核心指标:用户满意度、点击引用率、对话轮次(越少越好)
- 实验流量分配:对照组50%,实验组50%(小流量时可用10/90)
- 统计显著性:p < 0.05,且样本量足够(每组至少200个会话)
- 关注绝对值而非相对值:1%的真实提升 > 20%的统计噪音
为什么RAG优化容易"感觉对了,数据错了"
苏磊的问题有个专业名词:Overfitting on Internal Test Set(内部测试集过拟合)。
这不是他一个人的问题。根据笔者与多个团队的交流,大约60%的RAG工程师在"优化"时用的是自己手工构建的小测试集,这类测试集有几个共同的问题:
问题1:样本不代表真实分布
工程师构建测试问题时,倾向于写"标准、清晰"的问题,而真实用户的查询往往是模糊的、有错别字的、包含方言的。
问题2:样本量太小
30个测试问题,哪怕准确率提升5%,也就是1.5个问题的差异。这在统计上完全可能是随机波动,不具有显著性。
问题3:测试环境与生产不同
测试时用的是静态知识库,生产上知识库在实时更新;测试时是单次问答,生产上有多轮对话。
问题4:指标选择偏差
"召回率"和"准确率"是工程指标,不是用户指标。用户真正关心的是:我的问题被解决了吗?A/B测试必须用用户行为指标(点击率、满意度、继续追问率)。
RAG A/B测试的设计原则
实验单元的选择
RAG系统的A/B测试,实验单元应该是会话(Session),而不是单次查询。
原因:同一个用户的多次查询之间有关联性,如果同一个会话里有时走A策略、有时走B策略,会造成体验割裂,也会影响"对话轮次"这个指标的计算。
正确做法:
用户A的整个会话都走策略A
用户B的整个会话都走策略B
错误做法:
用户A的第1个问题走策略A,第2个问题走策略B流量分配策略
实验框架设计
数据库表设计
-- 实验配置表
CREATE TABLE ab_experiments (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
experiment_id VARCHAR(64) NOT NULL UNIQUE, -- 如 "hyde_vs_baseline_20260501"
name VARCHAR(255) NOT NULL,
description TEXT,
status ENUM('DRAFT', 'RUNNING', 'PAUSED', 'COMPLETED') NOT NULL DEFAULT 'DRAFT',
traffic_split JSON NOT NULL, -- {"control": 50, "treatment": 50}
config_a JSON NOT NULL, -- 控制组配置
config_b JSON NOT NULL, -- 实验组配置
start_time DATETIME,
end_time DATETIME,
created_by VARCHAR(64),
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_status (status),
INDEX idx_start_time (start_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
-- 用户实验分组表
CREATE TABLE experiment_assignments (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
experiment_id VARCHAR(64) NOT NULL,
user_id VARCHAR(64) NOT NULL,
session_id VARCHAR(64) NOT NULL,
group_name VARCHAR(32) NOT NULL, -- "control" or "treatment"
assigned_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY uk_session_experiment (session_id, experiment_id),
INDEX idx_user_experiment (user_id, experiment_id),
INDEX idx_experiment_group (experiment_id, group_name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
-- 查询事件表(每次查询记录一条)
CREATE TABLE query_events (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
event_id VARCHAR(64) NOT NULL UNIQUE,
experiment_id VARCHAR(64),
group_name VARCHAR(32),
session_id VARCHAR(64) NOT NULL,
user_id VARCHAR(64),
query TEXT NOT NULL,
answer TEXT,
retrieved_docs JSON, -- 检索到的文档列表
latency_ms INT,
-- 用户反馈(异步更新)
thumbs_up BOOLEAN, -- 用户点了赞
thumbs_down BOOLEAN, -- 用户点了踩
cited_sources BOOLEAN, -- 用户点击了引用来源
follow_up_query TEXT, -- 用户的追问(说明上次回答不满足需求)
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
INDEX idx_session (session_id),
INDEX idx_experiment_group (experiment_id, group_name),
INDEX idx_created_at (created_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
-- 会话汇总表(每个会话结束后汇总)
CREATE TABLE session_summaries (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
session_id VARCHAR(64) NOT NULL UNIQUE,
experiment_id VARCHAR(64),
group_name VARCHAR(32),
user_id VARCHAR(64),
query_count INT NOT NULL DEFAULT 0, -- 会话中的查询次数
thumbs_up_count INT NOT NULL DEFAULT 0,
thumbs_down_count INT NOT NULL DEFAULT 0,
source_click_count INT NOT NULL DEFAULT 0,
follow_up_count INT NOT NULL DEFAULT 0, -- 追问次数(越少越好)
-- 派生指标
satisfaction_rate DECIMAL(5,4), -- thumbs_up / (thumbs_up + thumbs_down)
source_click_rate DECIMAL(5,4),
follow_up_rate DECIMAL(5,4), -- follow_up / query_count
session_duration_seconds INT,
started_at DATETIME,
ended_at DATETIME,
INDEX idx_experiment_group (experiment_id, group_name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;Java实现:基于Feature Flag的A/B测试系统
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<groupId>com.laozhang.rag</groupId>
<artifactId>rag-ab-testing</artifactId>
<version>1.0.0</version>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.2.5</version>
</parent>
<properties>
<java.version>21</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
<version>1.0.0</version>
</dependency>
<!-- 统计计算 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math3</artifactId>
<version>3.6.1</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<scope>runtime</scope>
</dependency>
</dependencies>
</project>实验配置模型
package com.laozhang.rag.experiment;
import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;
import java.time.LocalDateTime;
import java.util.Map;
/**
* 实验配置
* 定义了A/B测试的两组配置
*/
@Data
public class ExperimentConfig {
private String experimentId;
private String name;
private ExperimentStatus status;
/**
* 控制组(A组):当前生产配置
*/
private RagConfig controlConfig;
/**
* 实验组(B组):待测试的新配置
*/
private RagConfig treatmentConfig;
/**
* 流量分配:控制组百分比(0-100)
*/
private int controlTrafficPercent;
private LocalDateTime startTime;
private LocalDateTime endTime;
/**
* RAG策略配置
*/
@Data
public static class RagConfig {
// 检索配置
@JsonProperty("top_k")
private int topK = 5;
@JsonProperty("similarity_threshold")
private double similarityThreshold = 0.65;
// 检索策略
@JsonProperty("retrieval_strategy")
private String retrievalStrategy = "vector"; // vector / hybrid / hyde
// HyDE配置
@JsonProperty("hyde_enabled")
private boolean hydeEnabled = false;
// 分块配置(影响索引构建,通常不在A/B测试中实时切换)
@JsonProperty("chunk_size")
private int chunkSize = 512;
@JsonProperty("chunk_overlap")
private int chunkOverlap = 50;
// Reranker配置
@JsonProperty("reranker_enabled")
private boolean rerankerEnabled = false;
@JsonProperty("reranker_top_k")
private int rerankerTopK = 3;
// 其他可选配置
private Map<String, Object> additionalConfig;
}
public enum ExperimentStatus {
DRAFT, RUNNING, PAUSED, COMPLETED
}
}实验分配服务
package com.laozhang.rag.experiment;
import lombok.extern.slf4j.Slf4j;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Service;
import java.security.MessageDigest;
import java.time.Duration;
import java.util.Optional;
/**
* 实验分配服务
*
* 负责:
* 1. 决定用户属于哪个实验组
* 2. 确保同一用户在同一实验中始终属于同一组(会话粒度稳定性)
* 3. 支持多个实验同时运行
*/
@Slf4j
@Service
public class ExperimentAssignmentService {
private final StringRedisTemplate redisTemplate;
private final ExperimentRepository experimentRepo;
private final AssignmentRepository assignmentRepo;
// Redis中的key前缀
private static final String ASSIGNMENT_CACHE_PREFIX = "exp:assignment:";
private static final Duration ASSIGNMENT_TTL = Duration.ofDays(7);
public ExperimentAssignmentService(
StringRedisTemplate redisTemplate,
ExperimentRepository experimentRepo,
AssignmentRepository assignmentRepo) {
this.redisTemplate = redisTemplate;
this.experimentRepo = experimentRepo;
this.assignmentRepo = assignmentRepo;
}
/**
* 获取用户在指定实验中的分组
*
* @param experimentId 实验ID
* @param sessionId 会话ID(分配的最小粒度)
* @param userId 用户ID(可选,用于跨会话保持一致性)
*/
public ExperimentGroup getAssignment(String experimentId, String sessionId, String userId) {
// 1. 先查Redis缓存
String cacheKey = ASSIGNMENT_CACHE_PREFIX + experimentId + ":" + sessionId;
String cachedGroup = redisTemplate.opsForValue().get(cacheKey);
if (cachedGroup != null) {
return ExperimentGroup.valueOf(cachedGroup);
}
// 2. 查数据库(已有分配)
Optional<ExperimentAssignment> existing = assignmentRepo
.findByExperimentIdAndSessionId(experimentId, sessionId);
if (existing.isPresent()) {
ExperimentGroup group = ExperimentGroup.valueOf(existing.get().getGroupName());
// 回写缓存
redisTemplate.opsForValue().set(cacheKey, group.name(), ASSIGNMENT_TTL);
return group;
}
// 3. 首次访问,分配实验组
ExperimentConfig config = experimentRepo.findByExperimentId(experimentId)
.orElseThrow(() -> new ExperimentNotFoundException(experimentId));
if (config.getStatus() != ExperimentConfig.ExperimentStatus.RUNNING) {
// 实验未运行,返回控制组
return ExperimentGroup.CONTROL;
}
ExperimentGroup group = allocateGroup(sessionId, userId, config);
// 4. 持久化分配结果
ExperimentAssignment assignment = new ExperimentAssignment();
assignment.setExperimentId(experimentId);
assignment.setSessionId(sessionId);
assignment.setUserId(userId);
assignment.setGroupName(group.name());
assignmentRepo.save(assignment);
// 5. 写入Redis缓存
redisTemplate.opsForValue().set(cacheKey, group.name(), ASSIGNMENT_TTL);
log.debug("Assigned session {} to group {} in experiment {}",
sessionId, group, experimentId);
return group;
}
/**
* 分配实验组
*
* 使用哈希取模法:保证相同输入始终得到相同分组(确定性)
* 同时让分配结果均匀分布
*/
private ExperimentGroup allocateGroup(String sessionId, String userId,
ExperimentConfig config) {
// 使用sessionId的哈希值决定分组(如果userId存在,优先用userId保证跨会话一致性)
String hashInput = (userId != null && !userId.isBlank()) ? userId : sessionId;
int hashValue = Math.abs(murmurHash(hashInput + config.getExperimentId()));
int bucket = hashValue % 100; // 0-99
// 根据流量配置分配:前controlTrafficPercent%的bucket属于控制组
if (bucket < config.getControlTrafficPercent()) {
return ExperimentGroup.CONTROL;
} else {
return ExperimentGroup.TREATMENT;
}
}
/**
* MurmurHash:快速、均匀分布的哈希函数
*/
private int murmurHash(String input) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] bytes = md.digest(input.getBytes());
int result = 0;
for (int i = 0; i < 4; i++) {
result = (result << 8) | (bytes[i] & 0xFF);
}
return result;
} catch (Exception e) {
return input.hashCode();
}
}
public enum ExperimentGroup {
CONTROL, TREATMENT
}
}RAG策略路由器
package com.laozhang.rag.experiment;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;
import org.springframework.web.context.request.RequestContextHolder;
import org.springframework.web.context.request.ServletRequestAttributes;
import jakarta.servlet.http.HttpServletRequest;
/**
* RAG请求路由器
*
* 根据实验分组,将请求路由到不同的RAG策略
* 同时记录查询事件,用于后续统计分析
*/
@Slf4j
@Service
public class RagStrategyRouter {
private final ExperimentAssignmentService assignmentService;
private final ExperimentRepository experimentRepo;
private final QueryEventRepository eventRepo;
private final BaselineRagService baselineRag;
private final HydeRagService hydeRag;
private final HybridRagService hybridRag;
/**
* 路由并执行RAG查询
*/
public RagResult query(String question, String sessionId, String userId) {
long startTime = System.currentTimeMillis();
// 获取所有运行中的实验
List<ExperimentConfig> runningExperiments = experimentRepo.findByStatus(
ExperimentConfig.ExperimentStatus.RUNNING
);
if (runningExperiments.isEmpty()) {
// 没有运行中的实验,使用默认策略
return executeWithDefaultStrategy(question);
}
// 取第一个运行中的实验(生产环境应该避免多实验同时运行)
ExperimentConfig experiment = runningExperiments.get(0);
// 获取用户分组
ExperimentAssignmentService.ExperimentGroup group = assignmentService.getAssignment(
experiment.getExperimentId(), sessionId, userId
);
// 根据分组选择配置
ExperimentConfig.RagConfig ragConfig = group == ExperimentAssignmentService.ExperimentGroup.CONTROL
? experiment.getControlConfig()
: experiment.getTreatmentConfig();
// 执行RAG
RagResult result = executeWithConfig(question, ragConfig);
result.setLatencyMs(System.currentTimeMillis() - startTime);
// 异步记录查询事件
recordQueryEvent(experiment.getExperimentId(), group.name(), sessionId,
userId, question, result);
return result;
}
private RagResult executeWithConfig(String question, ExperimentConfig.RagConfig config) {
return switch (config.getRetrievalStrategy()) {
case "hyde" -> hydeRag.query(question, config.getTopK());
case "hybrid" -> hybridRag.query(question, config.getTopK());
default -> baselineRag.query(question, config.getTopK(),
config.getSimilarityThreshold());
};
}
private void recordQueryEvent(String experimentId, String groupName,
String sessionId, String userId,
String question, RagResult result) {
QueryEvent event = new QueryEvent();
event.setEventId(java.util.UUID.randomUUID().toString());
event.setExperimentId(experimentId);
event.setGroupName(groupName);
event.setSessionId(sessionId);
event.setUserId(userId);
event.setQuery(question);
event.setAnswer(result.getAnswer());
event.setLatencyMs((int) result.getLatencyMs());
event.setRetrievedDocs(result.getSources());
// 异步保存,不阻塞主流程
CompletableFuture.runAsync(() -> eventRepo.save(event));
}
}指标收集:记录用户真实反馈
package com.laozhang.rag.experiment;
import org.springframework.stereotype.Service;
/**
* 用户反馈收集服务
*
* 收集三类信号:
* 1. 显式反馈:点赞/点踩
* 2. 隐式反馈:是否点击了来源链接
* 3. 行为反馈:是否继续追问(说明上次答案不满足需求)
*/
@Service
public class FeedbackCollector {
private final QueryEventRepository eventRepo;
private final SessionSummaryRepository summaryRepo;
/**
* 记录用户点赞
*/
public void recordThumbsUp(String eventId) {
eventRepo.updateFeedback(eventId, true, null, false);
log.info("Thumbs up for event: {}", eventId);
}
/**
* 记录用户点踩(触发更详细的记录)
*/
public void recordThumbsDown(String eventId, String reason) {
eventRepo.updateFeedback(eventId, false, true, false);
// 点踩时记录原因,用于后续分析
if (reason != null && !reason.isBlank()) {
eventRepo.updateFeedbackReason(eventId, reason);
}
log.info("Thumbs down for event: {}, reason: {}", eventId, reason);
}
/**
* 记录用户点击了来源链接
* 这是一个正向信号:说明用户认为答案有参考价值
*/
public void recordSourceClick(String eventId, String sourceDocId) {
eventRepo.updateSourceClick(eventId, true);
log.debug("Source clicked for event: {}, doc: {}", eventId, sourceDocId);
}
/**
* 记录追问
* 当用户在同一会话中对相似问题再次查询,说明上次答案不满足需求
* 这是一个负向信号
*/
public void recordFollowUp(String sessionId, String currentEventId,
String followUpQuery) {
// 找到同一会话中的上一个事件
Optional<QueryEvent> previousEvent = eventRepo
.findPreviousEventInSession(sessionId, currentEventId);
previousEvent.ifPresent(prev -> {
prev.setFollowUpQuery(followUpQuery);
eventRepo.save(prev);
});
}
/**
* 会话结束时汇总指标
*/
public void summarizeSession(String sessionId) {
List<QueryEvent> events = eventRepo.findBySessionId(sessionId);
if (events.isEmpty()) return;
QueryEvent firstEvent = events.get(0);
QueryEvent lastEvent = events.get(events.size() - 1);
long thumbsUpCount = events.stream().filter(e -> Boolean.TRUE.equals(e.getThumbsUp())).count();
long thumbsDownCount = events.stream().filter(e -> Boolean.TRUE.equals(e.getThumbsDown())).count();
long sourceClickCount = events.stream().filter(e -> Boolean.TRUE.equals(e.getCitedSources())).count();
long followUpCount = events.stream().filter(e -> e.getFollowUpQuery() != null).count();
long totalFeedback = thumbsUpCount + thumbsDownCount;
double satisfactionRate = totalFeedback > 0 ? (double) thumbsUpCount / totalFeedback : -1;
double sourceClickRate = !events.isEmpty() ? (double) sourceClickCount / events.size() : 0;
double followUpRate = !events.isEmpty() ? (double) followUpCount / events.size() : 0;
SessionSummary summary = new SessionSummary();
summary.setSessionId(sessionId);
summary.setExperimentId(firstEvent.getExperimentId());
summary.setGroupName(firstEvent.getGroupName());
summary.setUserId(firstEvent.getUserId());
summary.setQueryCount(events.size());
summary.setThumbsUpCount((int) thumbsUpCount);
summary.setThumbsDownCount((int) thumbsDownCount);
summary.setSourceClickCount((int) sourceClickCount);
summary.setFollowUpCount((int) followUpCount);
summary.setSatisfactionRate(satisfactionRate >= 0 ? satisfactionRate : null);
summary.setSourceClickRate(sourceClickRate);
summary.setFollowUpRate(followUpRate);
summaryRepo.save(summary);
}
}统计显著性检验:如何判断A比B真的更好
"实验组的满意度是82%,控制组是80%,提升了2%,是不是就可以上线了?"
不一定。
这2%的差异可能只是随机波动。我们需要统计检验来判断这个差异是否具有显著性。
package com.laozhang.rag.experiment;
import org.apache.commons.math3.stat.inference.ChiSquareTest;
import org.apache.commons.math3.stat.inference.TTest;
import org.springframework.stereotype.Service;
/**
* 统计显著性检验服务
*
* 对于比例型指标(满意度、点击率):使用卡方检验或Z检验
* 对于连续型指标(延迟、查询次数):使用t检验
*
* 显著性水平:p < 0.05(95%置信度)
*/
@Service
public class StatisticalSignificanceService {
private final TTest tTest = new TTest();
private final ChiSquareTest chiSquareTest = new ChiSquareTest();
@Data
public static class SignificanceResult {
private boolean isSignificant;
private double pValue;
private double controlMean;
private double treatmentMean;
private double relativeLift; // 相对提升百分比
private double absoluteLift; // 绝对提升
private int controlSampleSize;
private int treatmentSampleSize;
private String recommendation;
private String statisticalNote;
}
/**
* 对比例型指标进行检验(如满意度、点击率)
*
* 使用双比例Z检验(Two-Proportion Z-Test)
*
* @param controlSuccess 控制组成功次数(如点赞数)
* @param controlTotal 控制组总样本数
* @param treatmentSuccess 实验组成功次数
* @param treatmentTotal 实验组总样本数
*/
public SignificanceResult testProportions(
long controlSuccess, long controlTotal,
long treatmentSuccess, long treatmentTotal) {
SignificanceResult result = new SignificanceResult();
result.setControlSampleSize((int) controlTotal);
result.setTreatmentSampleSize((int) treatmentTotal);
if (controlTotal == 0 || treatmentTotal == 0) {
result.setIsSignificant(false);
result.setStatisticalNote("样本量为0,无法进行统计检验");
return result;
}
double controlRate = (double) controlSuccess / controlTotal;
double treatmentRate = (double) treatmentSuccess / treatmentTotal;
result.setControlMean(controlRate);
result.setTreatmentMean(treatmentRate);
result.setAbsoluteLift(treatmentRate - controlRate);
result.setRelativeLift(controlRate > 0 ? (treatmentRate - controlRate) / controlRate * 100 : 0);
// 使用卡方检验
long[][] observed = {
{controlSuccess, controlTotal - controlSuccess},
{treatmentSuccess, treatmentTotal - treatmentSuccess}
};
try {
double pValue = chiSquareTest.chiSquareTest(observed);
result.setPValue(pValue);
result.setIsSignificant(pValue < 0.05);
// 生成建议
result.setRecommendation(generateRecommendation(result));
result.setStatisticalNote(generateStatisticalNote(result));
} catch (Exception e) {
result.setIsSignificant(false);
result.setStatisticalNote("统计检验失败:" + e.getMessage());
}
return result;
}
/**
* 对连续型指标进行检验(如延迟、对话轮次)
* 使用独立样本t检验
*/
public SignificanceResult testContinuous(double[] controlValues, double[] treatmentValues) {
SignificanceResult result = new SignificanceResult();
result.setControlSampleSize(controlValues.length);
result.setTreatmentSampleSize(treatmentValues.length);
if (controlValues.length < 2 || treatmentValues.length < 2) {
result.setIsSignificant(false);
result.setStatisticalNote("样本量不足,需要至少2个样本");
return result;
}
double controlMean = java.util.Arrays.stream(controlValues).average().orElse(0);
double treatmentMean = java.util.Arrays.stream(treatmentValues).average().orElse(0);
result.setControlMean(controlMean);
result.setTreatmentMean(treatmentMean);
result.setAbsoluteLift(treatmentMean - controlMean);
result.setRelativeLift(controlMean != 0 ? (treatmentMean - controlMean) / controlMean * 100 : 0);
try {
double pValue = tTest.tTest(controlValues, treatmentValues);
result.setPValue(pValue);
result.setIsSignificant(pValue < 0.05);
result.setRecommendation(generateRecommendation(result));
result.setStatisticalNote(generateStatisticalNote(result));
} catch (Exception e) {
result.setIsSignificant(false);
result.setStatisticalNote("t检验失败:" + e.getMessage());
}
return result;
}
private String generateRecommendation(SignificanceResult result) {
if (!result.isSignificant()) {
return String.format(
"差异不显著(p=%.3f > 0.05),建议继续收集数据或放弃本次改动",
result.getPValue()
);
}
if (result.getAbsoluteLift() > 0) {
return String.format(
"实验组显著优于控制组(p=%.3f,相对提升%.1f%%),建议上线实验组配置",
result.getPValue(), result.getRelativeLift()
);
} else {
return String.format(
"实验组显著劣于控制组(p=%.3f,相对下降%.1f%%),建议保留控制组配置",
result.getPValue(), Math.abs(result.getRelativeLift())
);
}
}
private String generateStatisticalNote(SignificanceResult result) {
return String.format(
"控制组:%.1f%% (n=%d),实验组:%.1f%% (n=%d),绝对差异:%.2f%%,p值:%.4f",
result.getControlMean() * 100, result.getControlSampleSize(),
result.getTreatmentMean() * 100, result.getTreatmentSampleSize(),
result.getAbsoluteLift() * 100, result.getPValue()
);
}
/**
* 计算达到统计显著性所需的最小样本量
* 帮助工程师提前规划实验时间
*
* @param baselineRate 控制组的基准转化率(如当前满意度75%则为0.75)
* @param minimumDetectableEffect 最小可检测效应(如想检测1%的提升则为0.01)
*/
public int calculateRequiredSampleSize(double baselineRate, double minimumDetectableEffect) {
// 基于Z检验的样本量公式
// n = (Z_α/2 + Z_β)² * (p1*(1-p1) + p2*(1-p2)) / (p1-p2)²
// α=0.05 (Z=1.96), β=0.2 (Z=0.84), power=80%
double alpha = 0.05;
double beta = 0.20;
double zAlpha = 1.96; // Z score for α=0.05 (two-tailed)
double zBeta = 0.84; // Z score for β=0.20 (power=80%)
double p1 = baselineRate;
double p2 = baselineRate + minimumDetectableEffect;
double numerator = Math.pow(zAlpha + zBeta, 2) * (p1 * (1 - p1) + p2 * (1 - p2));
double denominator = Math.pow(p1 - p2, 2);
int requiredPerGroup = (int) Math.ceil(numerator / denominator);
log.info("Required sample size per group: {} (baseline: {}, MDE: {})",
requiredPerGroup, baselineRate, minimumDetectableEffect);
return requiredPerGroup;
}
}实验结果分析服务
@Service
public class ExperimentAnalysisService {
private final SessionSummaryRepository summaryRepo;
private final QueryEventRepository eventRepo;
private final StatisticalSignificanceService statsService;
/**
* 生成完整的实验分析报告
*/
public ExperimentReport generateReport(String experimentId) {
// 获取两组的会话数据
List<SessionSummary> controlSessions = summaryRepo
.findByExperimentIdAndGroupName(experimentId, "CONTROL");
List<SessionSummary> treatmentSessions = summaryRepo
.findByExperimentIdAndGroupName(experimentId, "TREATMENT");
ExperimentReport report = new ExperimentReport();
report.setExperimentId(experimentId);
report.setControlSessionCount(controlSessions.size());
report.setTreatmentSessionCount(treatmentSessions.size());
report.setGeneratedAt(Instant.now());
// 1. 满意度对比
long controlThumbsUp = controlSessions.stream()
.mapToLong(s -> s.getThumbsUpCount()).sum();
long controlTotalFeedback = controlSessions.stream()
.mapToLong(s -> s.getThumbsUpCount() + s.getThumbsDownCount()).sum();
long treatmentThumbsUp = treatmentSessions.stream()
.mapToLong(s -> s.getThumbsUpCount()).sum();
long treatmentTotalFeedback = treatmentSessions.stream()
.mapToLong(s -> s.getThumbsUpCount() + s.getThumbsDownCount()).sum();
if (controlTotalFeedback > 0 && treatmentTotalFeedback > 0) {
SignificanceResult satisfactionResult = statsService.testProportions(
controlThumbsUp, controlTotalFeedback,
treatmentThumbsUp, treatmentTotalFeedback
);
report.setSatisfactionAnalysis(satisfactionResult);
}
// 2. 来源点击率对比
long controlSourceClicks = controlSessions.stream()
.mapToLong(s -> s.getSourceClickCount()).sum();
long controlTotalQueries = controlSessions.stream()
.mapToLong(s -> s.getQueryCount()).sum();
long treatmentSourceClicks = treatmentSessions.stream()
.mapToLong(s -> s.getSourceClickCount()).sum();
long treatmentTotalQueries = treatmentSessions.stream()
.mapToLong(s -> s.getQueryCount()).sum();
if (controlTotalQueries > 0 && treatmentTotalQueries > 0) {
SignificanceResult sourceClickResult = statsService.testProportions(
controlSourceClicks, controlTotalQueries,
treatmentSourceClicks, treatmentTotalQueries
);
report.setSourceClickAnalysis(sourceClickResult);
}
// 3. 追问率对比(越低越好)
double[] controlFollowUpRates = controlSessions.stream()
.mapToDouble(s -> s.getFollowUpRate() != null ? s.getFollowUpRate() : 0.0)
.toArray();
double[] treatmentFollowUpRates = treatmentSessions.stream()
.mapToDouble(s -> s.getFollowUpRate() != null ? s.getFollowUpRate() : 0.0)
.toArray();
if (controlFollowUpRates.length > 10 && treatmentFollowUpRates.length > 10) {
SignificanceResult followUpResult = statsService.testContinuous(
controlFollowUpRates, treatmentFollowUpRates
);
report.setFollowUpAnalysis(followUpResult);
}
// 4. 生成综合建议
report.setOverallRecommendation(generateOverallRecommendation(report));
return report;
}
private String generateOverallRecommendation(ExperimentReport report) {
int positiveSignals = 0;
int negativeSignals = 0;
int totalSignificantMetrics = 0;
// 满意度:越高越好
if (report.getSatisfactionAnalysis() != null &&
report.getSatisfactionAnalysis().isSignificant()) {
totalSignificantMetrics++;
if (report.getSatisfactionAnalysis().getAbsoluteLift() > 0) positiveSignals++;
else negativeSignals++;
}
// 来源点击率:越高越好
if (report.getSourceClickAnalysis() != null &&
report.getSourceClickAnalysis().isSignificant()) {
totalSignificantMetrics++;
if (report.getSourceClickAnalysis().getAbsoluteLift() > 0) positiveSignals++;
else negativeSignals++;
}
// 追问率:越低越好(负提升是正向的)
if (report.getFollowUpAnalysis() != null &&
report.getFollowUpAnalysis().isSignificant()) {
totalSignificantMetrics++;
if (report.getFollowUpAnalysis().getAbsoluteLift() < 0) positiveSignals++;
else negativeSignals++;
}
if (totalSignificantMetrics == 0) {
return "实验结果尚未达到统计显著性。建议:继续实验直到每组至少收集到" +
statsService.calculateRequiredSampleSize(0.75, 0.02) + "个有效反馈。";
}
if (positiveSignals > negativeSignals) {
return String.format("推荐上线实验组配置!%d项指标显著提升,%d项显著下降。",
positiveSignals, negativeSignals);
} else if (negativeSignals > positiveSignals) {
return String.format("建议保留控制组配置!%d项指标显著下降,%d项显著提升。",
negativeSignals, positiveSignals);
} else {
return "实验结果混合,建议进一步分析哪些用户群体从实验组中受益更多。";
}
}
}多变量测试:同时测试多个维度
真实的优化往往涉及多个变量(分块策略 + 检索数量 + 是否用Reranker)。多变量测试可以同时测试。
@Service
public class MultiVariateExperiment {
/**
* 四组多变量测试配置
*/
public static final Map<String, ExperimentConfig.RagConfig> EXPERIMENT_GROUPS = Map.of(
"A_baseline", ExperimentConfig.RagConfig.builder()
.topK(5).rerankerEnabled(false).hydeEnabled(false).build(),
"B_more_retrieval", ExperimentConfig.RagConfig.builder()
.topK(10).rerankerEnabled(false).hydeEnabled(false).build(),
"C_reranker", ExperimentConfig.RagConfig.builder()
.topK(5).rerankerEnabled(true).rerankerTopK(3).hydeEnabled(false).build(),
"D_full_stack", ExperimentConfig.RagConfig.builder()
.topK(10).rerankerEnabled(true).rerankerTopK(5).hydeEnabled(true).build()
);
/**
* 按用户ID均匀分配到4个组
*/
public String assignGroup(String userId) {
int hash = Math.abs(userId.hashCode()) % 4;
String[] groups = {"A_baseline", "B_more_retrieval", "C_reranker", "D_full_stack"};
return groups[hash];
}
}实验看板:实时监控实验数据
@RestController
@RequestMapping("/api/v1/experiments")
public class ExperimentDashboardController {
private final ExperimentAnalysisService analysisService;
private final ExperimentRepository experimentRepo;
/**
* 获取实验实时指标
*/
@GetMapping("/{experimentId}/metrics")
public ResponseEntity<ExperimentMetrics> getMetrics(
@PathVariable String experimentId,
@RequestParam(defaultValue = "24h") String timeRange) {
ExperimentMetrics metrics = analysisService.getRealtimeMetrics(experimentId, timeRange);
return ResponseEntity.ok(metrics);
}
/**
* 获取完整分析报告
*/
@GetMapping("/{experimentId}/report")
public ResponseEntity<ExperimentReport> getReport(@PathVariable String experimentId) {
ExperimentReport report = analysisService.generateReport(experimentId);
return ResponseEntity.ok(report);
}
/**
* 停止实验(上线或回滚)
*/
@PostMapping("/{experimentId}/stop")
public ResponseEntity<Void> stopExperiment(
@PathVariable String experimentId,
@RequestParam String decision, // "ship_treatment" or "keep_control"
@RequestParam String reason) {
experimentRepo.updateStatus(experimentId,
ExperimentConfig.ExperimentStatus.COMPLETED,
decision, reason);
return ResponseEntity.ok().build();
}
}生产注意事项
1. 实验污染问题
如果控制组和实验组共享同一个向量索引,某些索引变更(如增量更新)会同时影响两组,导致实验结果不纯净。
解决方案:对于影响索引的变量(如分块策略),需要建立独立的索引,而不是A/B测试。A/B测试只适合不影响索引的运行时参数(如topK、HyDE开关、Reranker开关)。
2. 新奇效应(Novelty Effect)
用户可能因为新功能的新鲜感而给出更高评分,但这个效应会随时间消失。
解决方案:实验至少运行7天以上,排除新奇效应。
3. 样本量不足就停止实验
"实验组领先了,赶紧停止实验上线!"这是常见的错误。在达到预设样本量之前停止实验,会导致结论不可信。
解决方案:在实验开始前用calculateRequiredSampleSize方法计算需要的样本量,严格按计划执行。
常见问题解答
Q1:我的系统流量很小(每天100次查询),A/B测试还有意义吗?
有,但需要更长的实验周期。每天100次查询,两组各50次,每组满意度反馈假设有30%(约15条),要收集到200条显著性所需的反馈,大约需要13天。可以接受。
Q2:A/B测试的分组应该按用户ID还是会话ID?
对于有登录系统的产品,用用户ID(跨会话保持一致)。对于匿名系统,用会话ID。关键是同一个"实体"在同一实验中始终属于同一组。
Q3:如何处理实验期间的"脏数据"(系统bug导致的异常数据)?
记录实验期间的所有系统事件(部署、故障、配置变更),在数据分析时排除故障期间的数据。
Q4:p值通过了,但提升幅度很小(0.3%),值得上线吗?
p值只告诉你差异是否显著,不告诉你差异是否重要。需要结合实际业务价值判断。0.3%的满意度提升,对于日活10万用户的产品,每天多300个满意用户,可能有商业价值;对于日活100用户的产品,3个用户,可能不值得维护额外的复杂性。
Q5:多个指标同时显著,但方向不一致(满意度高了,但延迟也高了),怎么决策?
设定指标优先级(通常:满意度 > 来源点击率 > 追问率 > 延迟),按优先级决策。如果高优先级指标提升,低优先级指标下降,仍然建议上线(当然要评估延迟劣化是否超过用户忍受阈值)。
Q6:实验结束后,如何持续优化?
建立"持续实验文化":每次上线前都要有A/B测试计划,每次下线实验都要沉淀结论文档。用"实验历史"作为团队的知识资产,后来者可以从历史实验中学习什么有效、什么无效。
总结
苏磊的故事是很多团队的缩影:靠感觉优化,上线后发现改坏了。
数据驱动的RAG优化,需要的不只是"跑一个测试"的工具,而是一套完整的实验文化——从实验设计、流量分配、指标收集、统计检验,到结论沉淀,每一个环节都要严格执行。
行动清单:
数据说话,不是工程师的软弱,而是工程师对用户负责的体现。
