AI应用的AB实验平台:数据驱动的AI功能迭代
AI应用的AB实验平台:数据驱动的AI功能迭代
凭感觉优化3个月,不如数据驱动2周
刘芳是某招聘平台的AI产品经理,负责简历智能解析功能的质量迭代。
2025年11月,她收到了一批用户反馈:AI对软件工程师简历的技能提取不够精准,特别是对"有项目经验但没明确写技能名称"的候选人,系统经常漏标Python、Go等语言。
她把问题甩给了后端组的Java工程师赵凯。赵凯看了两天,把系统提示词改了改,加了一段:"仔细分析项目经历中隐含的技能,不只提取显式列出的技能标签"。
测试了几个样本,感觉好多了。上线。
一周后,新的反馈来了:技能提取变得"太宽泛",很多用户觉得被标注了自己不熟悉的技能。
赵凯又改回去一点。再上线。
这样来回了11次,历时3个月,他们完全不知道哪个版本是最好的。每次改动,都是凭感觉,凭几个测试样本,凭老板的直觉。
直到2026年2月,平台引入了AI AB实验系统。
第一个实验:新旧两个版本的提示词各承接50%流量,运行2周,用精准率+召回率的F1 Score作为评估指标,收集真实用户的反馈信号(技能确认率)。
实验结果:版本B(赵凯第7次修改的版本)F1 Score 0.847,显著高于当前线上版本(0.812),统计显著性p值0.003,远低于0.05。
这是赵凯第一次知道,他3个月前某个不经意的改动,才是最好的。之前11次迭代,有9次是在往回走。
这篇文章,就是那套AI AB实验平台的完整实现。
一、AI AB实验的特殊挑战
1.1 为什么AI实验比传统AB实验更难
传统AB实验评估的是点击率、转化率这类客观指标,计算简单,A/B一分,结果一目了然。
AI质量评估有3个独特难点:
难点1:输出没有标准答案
"帮我写一封求职信"——什么样的输出算好?很难用一个数字衡量。
难点2:评估本身需要AI介入
人工评估成本高,无法规模化。但自动化评估(用另一个LLM打分)引入了"评委偏见"问题。
难点3:效果的延迟性
LLM输出的质量,有时只有在用户后续行为(继续对话/放弃/反馈)中才能体现,不是一次请求就能判断好坏。
1.2 实验平台系统设计
二、数据模型设计
2.1 实验配置模型
@Entity
@Table(name = "ab_experiments")
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class AbExperiment {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
@Column(name = "experiment_key", unique = true, nullable = false, length = 128)
private String experimentKey; // 唯一标识,如:resume_parse_v2
@Column(name = "name", nullable = false)
private String name; // 可读名称
@Column(name = "description", columnDefinition = "TEXT")
private String description;
@Enumerated(EnumType.STRING)
@Column(name = "status")
private ExperimentStatus status; // DRAFT/RUNNING/PAUSED/COMPLETED
@Column(name = "traffic_percentage")
private Integer trafficPercentage; // 参与实验的总流量百分比,如10(10%用户参与实验)
@Column(name = "allocation_unit")
private String allocationUnit; // USER/SESSION/REQUEST
@Column(name = "started_at")
private Instant startedAt;
@Column(name = "ended_at")
private Instant endedAt;
@Column(name = "min_sample_size")
private Integer minSampleSize; // 最小样本量(用于提前停止判断)
@Column(name = "significance_level")
private Double significanceLevel; // 显著性水平,默认0.05
@OneToMany(mappedBy = "experiment", cascade = CascadeType.ALL, fetch = FetchType.EAGER)
private List<ExperimentVariant> variants;
@OneToMany(mappedBy = "experiment", cascade = CascadeType.ALL)
private List<ExperimentMetric> metrics;
public enum ExperimentStatus {
DRAFT, RUNNING, PAUSED, COMPLETED, ARCHIVED
}
}@Entity
@Table(name = "experiment_variants")
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class ExperimentVariant {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
@ManyToOne
@JoinColumn(name = "experiment_id")
private AbExperiment experiment;
@Column(name = "variant_key", nullable = false, length = 64)
private String variantKey; // control/treatment_a/treatment_b
@Column(name = "name", nullable = false)
private String name;
@Column(name = "is_control")
private Boolean isControl; // 是否是对照组
@Column(name = "traffic_weight")
private Integer trafficWeight; // 流量权重,所有变体权重之和=100
// AI变体的配置:提示词/模型/参数
@Column(name = "config", columnDefinition = "JSON")
@Convert(converter = JsonNodeConverter.class)
private JsonNode config;
// 示例config:
// {
// "model": "gpt-4o-mini",
// "systemPrompt": "...",
// "temperature": 0.7,
// "maxTokens": 2048
// }
}@Entity
@Table(name = "experiment_metrics")
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class ExperimentMetric {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
@ManyToOne
@JoinColumn(name = "experiment_id")
private AbExperiment experiment;
@Column(name = "metric_key", nullable = false)
private String metricKey; // f1_score/user_satisfaction/task_completion_rate
@Column(name = "metric_type")
private String metricType; // PRIMARY/GUARDRAIL/INFORMATIONAL
@Column(name = "direction")
private String direction; // HIGHER_BETTER/LOWER_BETTER
@Column(name = "minimum_detectable_effect")
private Double minimumDetectableEffect; // 最小可检测效果
}2.2 实验数据记录模型
@Entity
@Table(name = "experiment_exposures", indexes = {
@Index(name = "idx_exp_user", columnList = "experiment_key,user_id"),
@Index(name = "idx_exp_session", columnList = "experiment_key,session_id")
})
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class ExperimentExposure {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
@Column(name = "experiment_key", nullable = false)
private String experimentKey;
@Column(name = "variant_key", nullable = false)
private String variantKey;
@Column(name = "user_id")
private String userId;
@Column(name = "session_id")
private String sessionId;
@Column(name = "request_id")
private String requestId;
@Column(name = "exposed_at", nullable = false)
private Instant exposedAt;
}@Entity
@Table(name = "experiment_metric_events", indexes = {
@Index(name = "idx_metric_exp", columnList = "experiment_key,metric_key,variant_key")
})
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class ExperimentMetricEvent {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
@Column(name = "experiment_key", nullable = false)
private String experimentKey;
@Column(name = "variant_key", nullable = false)
private String variantKey;
@Column(name = "user_id")
private String userId;
@Column(name = "metric_key", nullable = false)
private String metricKey;
@Column(name = "metric_value", nullable = false)
private Double metricValue;
@Column(name = "metadata", columnDefinition = "JSON")
private String metadata;
@Column(name = "recorded_at", nullable = false)
private Instant recordedAt;
}三、一致性哈希实现用户分组
3.1 分组引擎
@Service
@RequiredArgsConstructor
@Slf4j
public class ExperimentAssignmentService {
private final AbExperimentRepository experimentRepository;
private final ExperimentExposureRepository exposureRepository;
private final RedisTemplate<String, String> redisTemplate;
/**
* 获取用户在指定实验中的变体
* 保证同一用户在同一实验中始终看到同一变体(粘性分组)
*/
public Optional<ExperimentAssignment> getAssignment(
String experimentKey, String userId, String sessionId) {
AbExperiment experiment = findRunningExperiment(experimentKey);
if (experiment == null) {
return Optional.empty();
}
// 确定分组单元
String allocationId = switch (experiment.getAllocationUnit()) {
case "USER" -> userId;
case "SESSION" -> sessionId;
default -> userId;
};
if (!StringUtils.hasText(allocationId)) {
return Optional.empty();
}
// 检查是否有缓存的分组结果
String cacheKey = String.format("exp:assign:%s:%s", experimentKey, allocationId);
String cachedVariant = redisTemplate.opsForValue().get(cacheKey);
if (StringUtils.hasText(cachedVariant)) {
return Optional.of(ExperimentAssignment.builder()
.experimentKey(experimentKey)
.variantKey(cachedVariant)
.fromCache(true)
.build());
}
// 计算分组
String variantKey = assignVariant(experiment, allocationId);
if (variantKey == null) {
// 用户不在实验流量内
return Optional.empty();
}
// 缓存分组结果(7天,保证分组稳定性)
redisTemplate.opsForValue().set(cacheKey, variantKey, Duration.ofDays(7));
// 异步记录曝光
recordExposureAsync(experimentKey, variantKey, userId, sessionId);
return Optional.of(ExperimentAssignment.builder()
.experimentKey(experimentKey)
.variantKey(variantKey)
.fromCache(false)
.build());
}
/**
* 一致性哈希实现:相同输入 → 相同输出,且分布均匀
*/
private String assignVariant(AbExperiment experiment, String allocationId) {
// Step 1:判断是否进入实验流量桶
// 使用MD5的前8位(16进制)作为哈希值,转为0-9999的整数
int bucket = computeBucket(experiment.getExperimentKey() + ":" + allocationId);
// 如果bucket超过了实验流量百分比,用户不参与实验
if (bucket >= experiment.getTrafficPercentage() * 100) {
return null; // 0-9999的范围,trafficPercentage * 100
}
// Step 2:在实验内部,按权重分配变体
// 归一化到 0-9999 的实验内空间
int inExpBucket = computeBucket(
experiment.getExperimentKey() + ":variant:" + allocationId);
// 按权重分配
List<ExperimentVariant> variants = experiment.getVariants();
int totalWeight = variants.stream()
.mapToInt(ExperimentVariant::getTrafficWeight).sum();
int normalizedBucket = inExpBucket % totalWeight;
int accumulated = 0;
for (ExperimentVariant variant : variants) {
accumulated += variant.getTrafficWeight();
if (normalizedBucket < accumulated) {
return variant.getVariantKey();
}
}
// 不应该到达这里
return variants.get(variants.size() - 1).getVariantKey();
}
/**
* 计算0-9999的均匀哈希值
*/
private int computeBucket(String input) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] hash = md.digest(input.getBytes(StandardCharsets.UTF_8));
// 取前4个字节,转为无符号整数,再取模10000
long value = ((long)(hash[0] & 0xFF) << 24) |
((long)(hash[1] & 0xFF) << 16) |
((long)(hash[2] & 0xFF) << 8) |
((long)(hash[3] & 0xFF));
return (int)(value % 10000);
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
}
}
private void recordExposureAsync(String experimentKey, String variantKey,
String userId, String sessionId) {
CompletableFuture.runAsync(() -> {
ExperimentExposure exposure = ExperimentExposure.builder()
.experimentKey(experimentKey)
.variantKey(variantKey)
.userId(userId)
.sessionId(sessionId)
.exposedAt(Instant.now())
.build();
exposureRepository.save(exposure);
});
}
@Data
@Builder
public static class ExperimentAssignment {
private String experimentKey;
private String variantKey;
private boolean fromCache;
}
}四、Spring AI集成:实验驱动的AI调用
4.1 实验感知的ChatClient包装
@Service
@RequiredArgsConstructor
@Slf4j
public class ExperimentAwareChatService {
private final ChatClient chatClient;
private final ExperimentAssignmentService assignmentService;
private final AbExperimentRepository experimentRepository;
private final ExperimentMetricEventRepository metricEventRepository;
/**
* 执行AI调用,自动应用实验变体配置
*/
public ExperimentAiResponse chat(
String experimentKey,
String userId,
String sessionId,
String userMessage,
String defaultSystemPrompt) {
// 获取实验分组
Optional<ExperimentAssignmentService.ExperimentAssignment> assignment =
assignmentService.getAssignment(experimentKey, userId, sessionId);
// 确定使用的配置
String systemPrompt = defaultSystemPrompt;
String modelId = "gpt-4o-mini";
Double temperature = 0.7;
String variantKey = "control_default";
if (assignment.isPresent()) {
ExperimentVariant variant = getVariantConfig(
experimentKey, assignment.get().getVariantKey());
if (variant != null) {
variantKey = variant.getVariantKey();
JsonNode config = variant.getConfig();
if (config.has("systemPrompt")) {
systemPrompt = config.get("systemPrompt").asText();
}
if (config.has("model")) {
modelId = config.get("model").asText();
}
if (config.has("temperature")) {
temperature = config.get("temperature").asDouble();
}
}
}
// 执行AI调用
long startTime = System.currentTimeMillis();
String requestId = UUID.randomUUID().toString();
final String finalSystemPrompt = systemPrompt;
final String finalModelId = modelId;
final Double finalTemperature = temperature;
ChatResponse response = chatClient.prompt()
.system(finalSystemPrompt)
.user(userMessage)
.options(OpenAiChatOptions.builder()
.model(finalModelId)
.temperature(finalTemperature)
.build())
.call()
.chatResponse();
long latencyMs = System.currentTimeMillis() - startTime;
String content = response.getResult().getOutput().getText();
// 异步触发自动评估
final String finalVariantKey = variantKey;
final String finalContent = content;
CompletableFuture.runAsync(() -> {
triggerAutoEvaluation(experimentKey, finalVariantKey, userId,
requestId, userMessage, finalContent);
});
return ExperimentAiResponse.builder()
.content(content)
.experimentKey(experimentKey)
.variantKey(variantKey)
.requestId(requestId)
.latencyMs(latencyMs)
.build();
}
/**
* 记录用户显式反馈(好评/差评/具体分数)
*/
public void recordUserFeedback(String experimentKey, String variantKey,
String userId, String requestId,
String feedbackType, double score) {
ExperimentMetricEvent event = ExperimentMetricEvent.builder()
.experimentKey(experimentKey)
.variantKey(variantKey)
.userId(userId)
.metricKey("user_" + feedbackType) // user_thumbsup/user_thumbsdown/user_rating
.metricValue(score)
.metadata(String.format("{\"request_id\": \"%s\"}", requestId))
.recordedAt(Instant.now())
.build();
metricEventRepository.save(event);
}
private void triggerAutoEvaluation(String experimentKey, String variantKey,
String userId, String requestId,
String input, String output) {
// 自动评估服务(LLM-as-Judge)
autoEvaluationService.evaluate(AutoEvaluationRequest.builder()
.experimentKey(experimentKey)
.variantKey(variantKey)
.userId(userId)
.requestId(requestId)
.input(input)
.output(output)
.build());
}
}五、LLM-as-Judge:自动化AI质量评估
5.1 自动评估服务
@Service
@RequiredArgsConstructor
@Slf4j
public class AutoEvaluationService {
private final ChatClient evaluatorChatClient; // 使用独立的评估模型客户端
private final ExperimentMetricEventRepository metricEventRepository;
private final EvaluationCriteriaRepository criteriaRepository;
/**
* 使用LLM对AI输出进行多维度评估
* 注意:评估模型建议使用与被评估不同的模型,避免偏见
*/
public EvaluationResult evaluate(AutoEvaluationRequest request) {
// 获取该实验的评估标准
List<EvaluationCriteria> criteriaList = criteriaRepository
.findByExperimentKey(request.getExperimentKey());
if (criteriaList.isEmpty()) {
// 使用默认评估标准
criteriaList = getDefaultCriteria();
}
// 构建评估提示词
String evaluationPrompt = buildEvaluationPrompt(request, criteriaList);
String evaluationResult = evaluatorChatClient.prompt()
.system("""
你是一个严格、客观的AI输出质量评估专家。
对给定的输入和AI输出,按照指定维度进行评分(1-5分)。
评分要严格,4分表示优秀,5分表示极优秀,一般情况给3分。
只输出JSON,格式:
{
"scores": {
"relevance": 4,
"accuracy": 3,
"completeness": 4,
"clarity": 5
},
"overall": 4,
"strengths": ["优点1", "优点2"],
"weaknesses": ["缺点1"],
"reasoning": "简要说明评分理由"
}
""")
.user(evaluationPrompt)
.options(OpenAiChatOptions.builder()
.model("gpt-4o") // 评估用强模型
.temperature(0.1) // 低温,保证评分稳定性
.build())
.call()
.content();
// 解析评估结果
EvaluationResult result = parseEvaluationResult(evaluationResult, request);
// 保存各维度指标
saveMetricEvents(request, result);
return result;
}
private String buildEvaluationPrompt(AutoEvaluationRequest request,
List<EvaluationCriteria> criteriaList) {
StringBuilder sb = new StringBuilder();
sb.append("用户输入:\n").append(request.getInput()).append("\n\n");
sb.append("AI输出:\n").append(request.getOutput()).append("\n\n");
sb.append("评估维度:\n");
for (EvaluationCriteria criteria : criteriaList) {
sb.append(String.format("- %s:%s\n",
criteria.getDimension(), criteria.getDescription()));
}
return sb.toString();
}
private void saveMetricEvents(AutoEvaluationRequest request, EvaluationResult result) {
List<ExperimentMetricEvent> events = new ArrayList<>();
// 保存各维度分数
result.getScores().forEach((dimension, score) ->
events.add(ExperimentMetricEvent.builder()
.experimentKey(request.getExperimentKey())
.variantKey(request.getVariantKey())
.userId(request.getUserId())
.metricKey("auto_eval_" + dimension)
.metricValue(score.doubleValue())
.recordedAt(Instant.now())
.build())
);
// 保存综合分
events.add(ExperimentMetricEvent.builder()
.experimentKey(request.getExperimentKey())
.variantKey(request.getVariantKey())
.userId(request.getUserId())
.metricKey("auto_eval_overall")
.metricValue(result.getOverall().doubleValue())
.recordedAt(Instant.now())
.build());
metricEventRepository.saveAll(events);
}
private List<EvaluationCriteria> getDefaultCriteria() {
return List.of(
new EvaluationCriteria("relevance", "输出与用户请求的相关性"),
new EvaluationCriteria("accuracy", "输出内容的准确性和事实正确性"),
new EvaluationCriteria("completeness", "是否完整回答了用户的问题"),
new EvaluationCriteria("clarity", "表达是否清晰易懂")
);
}
}六、统计显著性计算
6.1 T检验和卡方检验Java实现
@Service
@Slf4j
public class StatisticalSignificanceService {
/**
* 连续型指标的独立样本T检验
* 用于:评分、延迟等连续数值
*/
public TTestResult performTTest(
List<Double> controlValues,
List<Double> treatmentValues,
double significanceLevel) {
int n1 = controlValues.size();
int n2 = treatmentValues.size();
if (n1 < 30 || n2 < 30) {
return TTestResult.builder()
.significant(false)
.reason("样本量不足(需要各组至少30个样本)")
.sampleSizes(Map.of("control", n1, "treatment", n2))
.build();
}
double mean1 = mean(controlValues);
double mean2 = mean(treatmentValues);
double var1 = variance(controlValues, mean1);
double var2 = variance(treatmentValues, mean2);
// Welch's t-test(不假设等方差)
double tStatistic = (mean2 - mean1) / Math.sqrt(var1/n1 + var2/n2);
// Welch-Satterthwaite自由度计算
double df = Math.pow(var1/n1 + var2/n2, 2) /
(Math.pow(var1/n1, 2)/(n1-1) + Math.pow(var2/n2, 2)/(n2-1));
// 计算双尾p值(使用t分布的近似)
double pValue = calculatePValue(Math.abs(tStatistic), df);
double effectSize = cohensD(mean1, mean2, var1, var2, n1, n2);
double relativeImprovement = mean1 != 0 ? (mean2 - mean1) / mean1 * 100 : 0;
return TTestResult.builder()
.controlMean(mean1)
.treatmentMean(mean2)
.tStatistic(tStatistic)
.degreesOfFreedom(df)
.pValue(pValue)
.significant(pValue < significanceLevel)
.confidenceInterval95(calculateCI(mean1, mean2, var1, var2, n1, n2))
.effectSize(effectSize)
.relativeImprovementPercent(relativeImprovement)
.sampleSizes(Map.of("control", n1, "treatment", n2))
.build();
}
/**
* 卡方检验(用于比率类指标:好评率、任务完成率等)
*/
public ChiSquareResult performChiSquareTest(
int controlSuccess, int controlTotal,
int treatmentSuccess, int treatmentTotal,
double significanceLevel) {
// 四格表
int a = controlSuccess;
int b = controlTotal - controlSuccess;
int c = treatmentSuccess;
int d = treatmentTotal - treatmentSuccess;
int n = a + b + c + d;
// 检查样本量是否足够(期望频数都要>5)
double expectedA = (double)(a + c) * (a + b) / n;
double expectedC = (double)(a + c) * (c + d) / n;
if (expectedA < 5 || expectedC < 5) {
return ChiSquareResult.builder()
.significant(false)
.reason("样本量不足,期望频数<5")
.build();
}
// 计算卡方统计量(带Yates连续性修正)
double chiSquare = (double)n * Math.pow(Math.abs((double)a*d - (double)b*c) - (double)n/2, 2) /
((double)(a+b) * (c+d) * (a+c) * (b+d));
// 自由度 = (行数-1)(列数-1) = 1
double pValue = calculateChiSquarePValue(chiSquare, 1);
double controlRate = (double) controlSuccess / controlTotal;
double treatmentRate = (double) treatmentSuccess / treatmentTotal;
double relativeLift = controlRate != 0 ? (treatmentRate - controlRate) / controlRate * 100 : 0;
return ChiSquareResult.builder()
.controlRate(controlRate)
.treatmentRate(treatmentRate)
.chiSquareStatistic(chiSquare)
.pValue(pValue)
.significant(pValue < significanceLevel)
.relativeLiftPercent(relativeLift)
.build();
}
// 数学工具方法
private double mean(List<Double> values) {
return values.stream().mapToDouble(Double::doubleValue).average().orElse(0.0);
}
private double variance(List<Double> values, double mean) {
double sumSquaredDiffs = values.stream()
.mapToDouble(v -> Math.pow(v - mean, 2)).sum();
return sumSquaredDiffs / (values.size() - 1);
}
private double cohensD(double mean1, double mean2, double var1, double var2, int n1, int n2) {
double pooledStd = Math.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1 + n2 - 2));
return pooledStd != 0 ? Math.abs(mean2 - mean1) / pooledStd : 0;
}
/**
* 使用正态近似计算p值(适用于df > 30的情况)
*/
private double calculatePValue(double tStatistic, double df) {
// 正态近似:当df较大时,t分布接近正态分布
if (df > 30) {
// Abramowitz and Stegun近似
double t = Math.abs(tStatistic);
double p = Math.exp(-0.717 * t - 0.416 * t * t);
return Math.min(1.0, 2 * p);
}
// 对于小样本,使用简化版本(生产环境建议用Apache Commons Math)
return tStatistic > 2.0 ? 0.04 : 0.5; // 简化版本
}
private double calculateChiSquarePValue(double chiSquare, int df) {
// 使用近似方法(生产环境建议用Apache Commons Math)
if (df == 1) {
return chiSquare > 3.84 ? 0.04 : 0.5; // 3.84是df=1, p=0.05的临界值
}
return 0.5;
}
private double[] calculateCI(double mean1, double mean2,
double var1, double var2, int n1, int n2) {
double se = Math.sqrt(var1/n1 + var2/n2);
double diff = mean2 - mean1;
double z = 1.96; // 95%置信区间
return new double[]{diff - z*se, diff + z*se};
}
@Data
@Builder
public static class TTestResult {
private double controlMean;
private double treatmentMean;
private double tStatistic;
private double degreesOfFreedom;
private double pValue;
private boolean significant;
private double[] confidenceInterval95;
private double effectSize;
private double relativeImprovementPercent;
private Map<String, Integer> sampleSizes;
private String reason;
public String getSummary() {
if (!significant) {
return String.format("差异不显著 (p=%.3f > 0.05),需要更多样本数据。", pValue);
}
String direction = treatmentMean > controlMean ? "提升" : "下降";
return String.format("实验组相对对照组%s %.1f%% (p=%.3f, Cohen's d=%.2f),统计显著。",
direction, Math.abs(relativeImprovementPercent), pValue, effectSize);
}
}
@Data
@Builder
public static class ChiSquareResult {
private double controlRate;
private double treatmentRate;
private double chiSquareStatistic;
private double pValue;
private boolean significant;
private double relativeLiftPercent;
private String reason;
}
}七、实验报告与自动决策
7.1 实验分析服务
@Service
@RequiredArgsConstructor
@Slf4j
public class ExperimentAnalysisService {
private final ExperimentMetricEventRepository metricEventRepository;
private final ExperimentExposureRepository exposureRepository;
private final StatisticalSignificanceService statisticsService;
private final AbExperimentRepository experimentRepository;
private final ChatClient chatClient;
/**
* 生成完整的实验报告
*/
public ExperimentReport generateReport(String experimentKey) {
AbExperiment experiment = experimentRepository.findByExperimentKey(experimentKey)
.orElseThrow(() -> new IllegalArgumentException("Experiment not found: " + experimentKey));
List<ExperimentVariant> variants = experiment.getVariants();
ExperimentVariant control = variants.stream()
.filter(v -> Boolean.TRUE.equals(v.getIsControl()))
.findFirst()
.orElseThrow();
Map<String, VariantStats> variantStatsMap = new HashMap<>();
// 收集每个变体的统计数据
for (ExperimentVariant variant : variants) {
VariantStats stats = collectVariantStats(experimentKey, variant.getVariantKey());
variantStatsMap.put(variant.getVariantKey(), stats);
}
// 对每个实验组与对照组进行统计检验
VariantStats controlStats = variantStatsMap.get(control.getVariantKey());
Map<String, MetricComparisonResult> comparisons = new HashMap<>();
for (ExperimentVariant variant : variants) {
if (Boolean.TRUE.equals(variant.getIsControl())) continue;
VariantStats treatmentStats = variantStatsMap.get(variant.getVariantKey());
MetricComparisonResult comparison = compareVariants(
controlStats, treatmentStats, experiment.getSignificanceLevel());
comparisons.put(variant.getVariantKey(), comparison);
}
// AI生成实验总结
String aiSummary = generateAiSummary(experiment, variantStatsMap, comparisons);
// 决策建议
DecisionRecommendation recommendation = makeDecisionRecommendation(
comparisons, experiment);
return ExperimentReport.builder()
.experimentKey(experimentKey)
.experimentName(experiment.getName())
.status(experiment.getStatus())
.runDays(calculateRunDays(experiment))
.variantStats(variantStatsMap)
.comparisons(comparisons)
.aiSummary(aiSummary)
.recommendation(recommendation)
.generatedAt(Instant.now())
.build();
}
private VariantStats collectVariantStats(String experimentKey, String variantKey) {
// 曝光数
long exposureCount = exposureRepository
.countByExperimentKeyAndVariantKey(experimentKey, variantKey);
// 各指标平均值
Map<String, Double> metricAverages = metricEventRepository
.findAveragesByExperimentAndVariant(experimentKey, variantKey);
return VariantStats.builder()
.variantKey(variantKey)
.exposureCount(exposureCount)
.metricAverages(metricAverages)
.build();
}
/**
* 自动决策:如果达到统计显著性且效果为正,推荐发布最优变体
*/
public DecisionRecommendation makeDecisionRecommendation(
Map<String, MetricComparisonResult> comparisons,
AbExperiment experiment) {
// 找到所有显著为正的变体
List<Map.Entry<String, MetricComparisonResult>> winners = comparisons.entrySet().stream()
.filter(e -> e.getValue().isPrimaryMetricSignificant() &&
e.getValue().getPrimaryMetricImprovement() > 0 &&
!e.getValue().isGuardrailMetricViolated())
.sorted(Comparator.comparing(e -> -e.getValue().getPrimaryMetricImprovement()))
.collect(Collectors.toList());
if (winners.isEmpty()) {
return DecisionRecommendation.builder()
.action(DecisionAction.CONTINUE_EXPERIMENT)
.reason("尚未发现显著优于对照组的变体,建议继续运行收集更多数据")
.build();
}
Map.Entry<String, MetricComparisonResult> bestWinner = winners.get(0);
return DecisionRecommendation.builder()
.action(DecisionAction.SHIP_WINNER)
.winnerVariantKey(bestWinner.getKey())
.reason(String.format(
"变体 %s 在主指标上相对提升 %.1f%%(p=%.3f)," +
"且所有保护指标均未违反,建议全量发布",
bestWinner.getKey(),
bestWinner.getValue().getPrimaryMetricImprovement(),
bestWinner.getValue().getPrimaryMetricPValue()
))
.build();
}
private String generateAiSummary(AbExperiment experiment,
Map<String, VariantStats> variantStatsMap,
Map<String, MetricComparisonResult> comparisons) {
String dataContext = buildDataContext(experiment, variantStatsMap, comparisons);
return chatClient.prompt()
.system("你是一个AB实验专家,负责用通俗易懂的语言解读实验结果,供产品经理和工程师理解。")
.user(String.format("""
请为以下AB实验结果生成一份简洁的分析总结(200字以内):
%s
重点说明:
1. 实验整体结论(显著/不显著)
2. 最优变体是什么,改善了多少
3. 是否有意外发现或值得关注的指标
""", dataContext))
.call()
.content();
}
public enum DecisionAction {
SHIP_WINNER, // 发布获胜变体
CONTINUE_EXPERIMENT, // 继续实验
STOP_AND_ITERATE, // 停止并重新设计
MANUAL_REVIEW // 需要人工评审
}
}7.2 实验管理REST API
@RestController
@RequestMapping("/api/experiments")
@RequiredArgsConstructor
public class ExperimentController {
private final AbExperimentRepository experimentRepository;
private final ExperimentAnalysisService analysisService;
private final ExperimentAssignmentService assignmentService;
@PostMapping
public ResponseEntity<AbExperiment> createExperiment(
@RequestBody @Valid CreateExperimentRequest request) {
AbExperiment experiment = AbExperiment.builder()
.experimentKey(request.getExperimentKey())
.name(request.getName())
.description(request.getDescription())
.status(AbExperiment.ExperimentStatus.DRAFT)
.trafficPercentage(request.getTrafficPercentage())
.allocationUnit(request.getAllocationUnit())
.minSampleSize(request.getMinSampleSize())
.significanceLevel(request.getSignificanceLevel() != null ?
request.getSignificanceLevel() : 0.05)
.build();
experiment = experimentRepository.save(experiment);
return ResponseEntity.status(HttpStatus.CREATED).body(experiment);
}
@PutMapping("/{experimentKey}/start")
public ResponseEntity<AbExperiment> startExperiment(
@PathVariable String experimentKey) {
AbExperiment experiment = experimentRepository
.findByExperimentKey(experimentKey)
.orElseThrow(() -> new ResponseStatusException(HttpStatus.NOT_FOUND));
experiment.setStatus(AbExperiment.ExperimentStatus.RUNNING);
experiment.setStartedAt(Instant.now());
experiment = experimentRepository.save(experiment);
return ResponseEntity.ok(experiment);
}
@GetMapping("/{experimentKey}/report")
public ResponseEntity<ExperimentReport> getReport(
@PathVariable String experimentKey) {
ExperimentReport report = analysisService.generateReport(experimentKey);
return ResponseEntity.ok(report);
}
@PostMapping("/{experimentKey}/ship")
public ResponseEntity<ShipResult> shipWinner(
@PathVariable String experimentKey,
@RequestBody ShipRequest request) {
// 标记实验完成,记录获胜变体
AbExperiment experiment = experimentRepository
.findByExperimentKey(experimentKey)
.orElseThrow();
experiment.setStatus(AbExperiment.ExperimentStatus.COMPLETED);
experiment.setEndedAt(Instant.now());
experimentRepository.save(experiment);
// 触发获胜变体的自动发布(集成CI/CD流程)
cicdService.deployVariant(experimentKey, request.getWinnerVariantKey());
return ResponseEntity.ok(ShipResult.builder()
.experimentKey(experimentKey)
.deployedVariant(request.getWinnerVariantKey())
.message("获胜变体已提交发布流水线")
.build());
}
// 供前端查询当前用户的分组(调试用)
@GetMapping("/{experimentKey}/assignment")
public ResponseEntity<Map<String, String>> getMyAssignment(
@PathVariable String experimentKey,
@RequestParam String userId) {
Optional<ExperimentAssignmentService.ExperimentAssignment> assignment =
assignmentService.getAssignment(experimentKey, userId, null);
return ResponseEntity.ok(Map.of(
"experimentKey", experimentKey,
"variantKey", assignment.map(a -> a.getVariantKey()).orElse("not_in_experiment")
));
}
}八、多变量正交实验设计
8.1 正交实验配置
多变量实验(Multivariate Test, MVT)同时测试多个变量。关键是确保各变量之间相互独立,不产生交互效应干扰。
@Service
@RequiredArgsConstructor
public class MultivariateExperimentService {
/**
* 正交实验设计:确保各变量因子均匀分布
* 例如:同时测试提示词版本(2个)× 模型(2个)= 4种组合
*/
public List<ExperimentVariant> designOrthogonalVariants(
Map<String, List<String>> factors) {
List<List<String>> allCombinations = generateCombinations(
new ArrayList<>(factors.values()));
List<String> factorNames = new ArrayList<>(factors.keySet());
List<ExperimentVariant> variants = new ArrayList<>();
int totalWeight = 100 / allCombinations.size();
boolean isFirst = true;
for (List<String> combination : allCombinations) {
Map<String, String> config = new HashMap<>();
for (int i = 0; i < factorNames.size(); i++) {
config.put(factorNames.get(i), combination.get(i));
}
variants.add(ExperimentVariant.builder()
.variantKey(buildVariantKey(combination))
.isControl(isFirst)
.trafficWeight(totalWeight)
.config(objectMapper.valueToTree(config))
.build());
isFirst = false;
}
return variants;
}
private List<List<String>> generateCombinations(List<List<String>> factors) {
List<List<String>> result = new ArrayList<>();
result.add(new ArrayList<>());
for (List<String> factor : factors) {
List<List<String>> newResult = new ArrayList<>();
for (List<String> existing : result) {
for (String value : factor) {
List<String> newCombination = new ArrayList<>(existing);
newCombination.add(value);
newResult.add(newCombination);
}
}
result = newResult;
}
return result;
}
private String buildVariantKey(List<String> combination) {
return String.join("_", combination).toLowerCase().replace(" ", "_");
}
}九、实验文化建设
9.1 实验记录与知识沉淀
好的实验文化不只是技术,还需要流程和习惯。以下是刘芳团队建立的实验流程规范:
实验设计清单(每次实验必填):
experiment:
key: resume_skill_extraction_v3
hypothesis: "更详细的提示词说明可以提升技能提取的F1 Score"
background: "当前版本对隐性技能的漏标率为23%"
primary_metric:
name: f1_score
expected_improvement: "+5%"
minimum_detectable_effect: "+2%"
guardrail_metrics:
- name: api_latency_p99
constraint: "不超过当前版本50ms"
- name: cost_per_request
constraint: "不超过当前版本10%"
sample_size:
calculation_method: "power_analysis"
required_per_variant: 1000
estimated_days_to_collect: 7
rollback_plan: "如果p99延迟超过500ms,立即回滚"实验后复盘模板:
## 实验复盘:{实验名称}
### 假设验证
- 假设:...
- 实际结果:...
- 是否验证:✅/❌
### 关键数据
| 指标 | 对照组 | 实验组 | 变化 | 显著? |
|------|--------|--------|------|--------|
| F1 Score | 0.812 | 0.847 | +4.3% | ✅ p<0.01 |
### 意外发现
(记录任何与假设无关但有价值的发现)
### 下一步
(基于实验结果,下一个要验证的假设是什么)十、性能数据
| 指标 | 数值 |
|---|---|
| 分组决策延迟 | < 2ms(Redis缓存命中) |
| 分组决策延迟(冷启动) | < 10ms |
| 指标记录吞吐 | 10000条/秒(批量写入) |
| 统计计算时间 | < 500ms(10万样本) |
| 自动评估延迟 | 2-8秒(LLM调用) |
| 报告生成时间 | < 3秒 |
实际效果(刘芳团队6个月数据):
- 实验数量:23个(vs 之前0个)
- 有统计显著结论的实验:16个(69%)
- 通过实验发现并避免的质量下降:5次
- AI功能整体质量提升(综合F1 Score):+18.3%
FAQ
Q1:实验样本量怎么计算?
A:使用功效分析(Power Analysis)。关键参数:显著性水平α(通常0.05)、统计功效1-β(通常0.8)、预期效应量(你认为最小有意义的提升是多少)。Apache Commons Math提供了完整的功效分析工具,也可以用在线计算器(如Evan's Awesome AB Tools)。
Q2:实验运行多久合适?
A:至少运行7天(覆盖完整的周期性波动),且每个变体的样本量需要达到功效分析的要求。不要因为看到"好像有效果"就提前停止——这会导致假阳性率大幅升高(多重比较问题)。
Q3:LLM-as-Judge评估会有偏差吗?
A:会。已知的偏差包括:位置偏见(更倾向于给第一个输出打高分)、冗长偏见(更倾向于给长输出打高分)、自我偏见(GPT-4给GPT-4的输出打高分)。缓解方法:随机化输出顺序、使用与被评估模型不同厂商的评估模型、结合人工采样校准。
Q4:一个功能同时跑多个实验会不会互相干扰?
A:会,这叫"实验干扰"(Experiment Interference)。解决方案:使用正交分层(Orthogonal Layering),把不同维度的实验放在独立的层(Layer)中,不同层的用户分组相互独立,互不影响。这是Google、Facebook等大厂的标准做法。
Q5:实验数据量小(每天只有几百请求)怎么办?
A:样本量小时,需要运行更长时间,或降低显著性要求(使用更宽松的α值)。也可以考虑贝叶斯实验方法,贝叶斯方法对小样本更友好,输出"某变体更好的概率"而非"是否显著",更适合小流量场景。
