AI应用的灰度实验平台：如何科学地验证AI功能效果

老张2026/7/31大约 20 分钟灰度实验A/B测试实验平台Spring AIJava数据驱动

AI应用的灰度实验平台：如何科学地验证AI功能效果

一次价值280万的教训

2025年9月，某头部电商平台的产品经理陈晓强在复盘会上坐立难安。

他主导的AI智能客服升级项目，刚刚在董事会被"点名批评"。

起因是这样的：陈晓强的团队用了新版的RAG检索策略，在内部测试中体验"感觉好多了"，就直接全量上线了。上线后的第一周：

用户投诉量从日均120条飙升到850条（增长608%）
平均对话轮次从3.2轮增加到7.6轮（用户越聊越烦）
客服问题解决率从78%下降到54%
7天内有2.3万名用户卸载了App（直接损失估算¥280万GMV）

问题出在哪？新的检索策略在信息密集的商品咨询场景效果变好了，但在售后投诉场景完全失效——用户情绪激动时，系统反而给出了大段的政策解释，让用户更加愤怒。

当董事会问"有没有实验数据"时，陈晓强只能回答：没有。

他们凭感觉做了一个价值280万的决定。

这篇文章，就是帮你避免成为下一个陈晓强。

AI实验平台的核心组件

一个完整的AI灰度实验平台由以下核心组件构成：

核心组件职责

组件	职责	关键技术
流量分配器	哈希分桶，确保用户稳定分配	MurmurHash3
特征存储	存储用户/请求特征，支持多维分层	Redis + MySQL
事件收集器	低延迟采集业务指标	Kafka
指标计算引擎	实时/离线计算实验指标	Flink/ClickHouse
统计分析模块	t检验/Mann-Whitney，判断显著性	Apache Commons Math
控制台	实验创建、监控、决策支持	Spring Boot + Vue

实验设计：什么叫一个好的AI实验

SMART实验原则

S - Specific（具体）：实验假设必须具体可测量

差的假设："新RAG策略更好"
好的假设："新RAG策略在商品咨询场景下，问题解决率提升≥5%"

M - Measurable（可测量）：必须有明确的主指标和护栏指标

主指标（North Star）: 问题解决率
次要指标: 平均对话轮次、用户满意度评分
护栏指标（不能变差的）: 响应延迟P99、错误率、用户投诉率

A - Achievable（可实现）：实验规模和时长能检验出预期效果大小 R - Relevant（相关）：变量只改一个，控制其他因素 T - Time-bound（有时限）：提前确定实验时长

实验设计文档模板

/**
 * 实验设计文档（每个实验必须有）
 */
public class ExperimentDesign {
    
    // 实验基本信息
    String experimentId = "EXP-2025-009";
    String name = "RAG分块策略升级实验";
    String owner = "陈晓强";
    LocalDate startDate = LocalDate.of(2025, 10, 1);
    LocalDate endDate = LocalDate.of(2025, 10, 14);  // 至少2周
    
    // 实验假设
    String hypothesis = "将RAG分块从固定512字节改为语义分块，" +
        "商品咨询场景下问题解决率提升≥5%，且售后场景不下降";
    
    // 流量配置
    double controlTrafficRatio = 0.5;    // 50% 对照组（旧策略）
    double treatmentTrafficRatio = 0.5;  // 50% 实验组（新策略）
    
    // 指标定义
    String primaryMetric = "resolution_rate";  // 主指标
    List<String> secondaryMetrics = List.of(
        "avg_turns", "satisfaction_score", "session_duration"
    );
    List<String> guardrailMetrics = List.of(
        "p99_latency", "error_rate", "complaint_rate"  // 护栏：这些不能变差
    );
    
    // 最小可检测效应（MDE）
    double minimumDetectableEffect = 0.05;  // 我们关心5%以上的提升
    double statisticalPower = 0.80;          // 80%的概率检测到真实效果
    double significanceLevel = 0.05;         // 95%置信水平
    
    // 分层策略
    String stratification = "按场景分层: 商品咨询/售后投诉/物流查询";
    
    // 停止条件（何时提前终止）
    String earlyStopConditions = "护栏指标任一劣化超过10%，立即停止";
}

流量分层：避免实验间的干扰

当多个实验同时进行时，如果流量分配不当，实验结果会互相污染。

分层哈希策略

关键原理：不同实验层使用不同的哈希盐（salt），使同一用户在不同实验层中的分桶相互独立，从而消除实验间干扰。

package com.laozhang.experiment.traffic;

import org.springframework.stereotype.Component;
import java.nio.charset.StandardCharsets;
import com.google.common.hash.Hashing;

/**
 * 流量分层分配器
 * 使用一致性哈希确保用户稳定分配，并实现多实验层隔离
 */
@Component
public class TrafficLayerAllocator {

    /**
     * 为用户在指定实验层分配桶号
     *
     * @param userId       用户唯一标识
     * @param experimentId 实验ID（作为哈希盐）
     * @param totalBuckets 总桶数（通常1000）
     * @return 桶号 [0, totalBuckets)
     */
    public int allocateBucket(String userId, String experimentId, int totalBuckets) {
        // 组合键：userId + experimentId 确保不同实验层独立
        String hashKey = userId + "::" + experimentId;

        // 使用MurmurHash3，高性能且分布均匀
        int hash = Hashing.murmur3_32_fixed()
            .hashString(hashKey, StandardCharsets.UTF_8)
            .asInt();

        // 取绝对值后取模
        return Math.abs(hash) % totalBuckets;
    }

    /**
     * 根据桶号确定用户属于哪个实验组
     *
     * @param bucket   用户桶号
     * @param variants 实验变体列表（含流量比例）
     * @return 用户所属变体名称
     */
    public String assignVariant(int bucket, List<ExperimentVariant> variants) {
        int cumulative = 0;
        for (ExperimentVariant variant : variants) {
            cumulative += variant.bucketCount();
            if (bucket < cumulative) {
                return variant.name();
            }
        }
        // 兜底：返回对照组
        return variants.get(0).name();
    }

    /**
     * 实验变体定义
     */
    public record ExperimentVariant(
            String name,           // 变体名称（如 "control", "treatment"）
            int bucketCount,       // 占用桶数（总和应等于totalBuckets）
            Map<String, Object> config  // 变体的具体配置
    ) {}
}

Java实验SDK：嵌入业务代码的实验框架

核心SDK设计

package com.laozhang.experiment.sdk;

import org.springframework.stereotype.Component;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

/**
 * 实验SDK核心类
 * 提供简洁的API让业务代码嵌入实验逻辑
 *
 * 使用示例：
 * String variant = experimentSdk.getVariant("rag-strategy-exp", userId);
 * if ("treatment".equals(variant)) {
 *     return newRagStrategy.query(question);
 * } else {
 *     return oldRagStrategy.query(question);
 * }
 */
@Component
public class ExperimentSdk {

    private final TrafficLayerAllocator allocator;
    private final ExperimentConfigRepository configRepo;
    private final EventCollector eventCollector;
    private final ExperimentCache cache;

    public ExperimentSdk(
            TrafficLayerAllocator allocator,
            ExperimentConfigRepository configRepo,
            EventCollector eventCollector,
            ExperimentCache cache) {
        this.allocator = allocator;
        this.configRepo = configRepo;
        this.eventCollector = eventCollector;
        this.cache = cache;
    }

    /**
     * 获取用户在指定实验中的变体
     * 核心方法，业务代码的入口
     */
    public String getVariant(String experimentId, String userId) {
        // 1. 检查实验是否存在且活跃
        ExperimentConfig config = cache.getConfig(experimentId);
        if (config == null || !config.isActive()) {
            return "control";  // 实验不存在或未启动，返回对照组
        }

        // 2. 检查用户是否在白名单（QA测试用）
        if (config.isWhitelistUser(userId)) {
            return config.getWhitelistVariant(userId);
        }

        // 3. 用户定向实验（仅限特定用户群）
        if (config.hasTargetingRules() && !config.matchTargeting(userId)) {
            return "control";
        }

        // 4. 一致性哈希分桶
        int bucket = allocator.allocateBucket(userId, experimentId, 1000);
        String variant = allocator.assignVariant(bucket, config.getVariants());

        // 5. 记录曝光事件
        eventCollector.trackExposure(experimentId, userId, variant);

        return variant;
    }

    /**
     * 获取变体配置参数
     * 适合需要传递配置值的场景（如：不同的temperature值）
     */
    public <T> T getVariantConfig(String experimentId, String userId,
                                   String configKey, T defaultValue) {
        String variant = getVariant(experimentId, userId);
        ExperimentConfig config = cache.getConfig(experimentId);

        if (config == null) return defaultValue;

        return config.getVariantConfig(variant, configKey, defaultValue);
    }

    /**
     * 记录业务指标（实验的核心数据来源）
     */
    public void trackMetric(String experimentId, String userId,
                             String metricName, double value) {
        String variant = cache.getUserVariant(experimentId, userId);
        if (variant == null) return;

        MetricEvent event = new MetricEvent(
            experimentId, userId, variant, metricName, value,
            System.currentTimeMillis()
        );
        eventCollector.trackMetric(event);
    }

    /**
     * 批量记录多个指标
     */
    public void trackMetrics(String experimentId, String userId,
                              Map<String, Double> metrics) {
        metrics.forEach((metricName, value) ->
            trackMetric(experimentId, userId, metricName, value)
        );
    }
}

与Spring AI的集成示例

package com.laozhang.experiment.integration;

import com.laozhang.experiment.sdk.ExperimentSdk;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;

import java.time.Duration;
import java.time.Instant;
import java.util.Map;

/**
 * 带实验能力的AI问答服务
 * 演示如何将实验SDK无缝嵌入AI业务逻辑
 */
@Service
public class ExperimentalAiService {

    private static final String RAG_EXPERIMENT_ID = "rag-chunking-strategy-v2";
    private static final String MODEL_EXPERIMENT_ID = "model-selection-v1";

    private final ExperimentSdk experimentSdk;
    private final RagStrategyFactory ragStrategyFactory;
    private final ChatClient chatClient;

    public ExperimentalAiService(
            ExperimentSdk experimentSdk,
            RagStrategyFactory ragStrategyFactory,
            ChatClient chatClient) {
        this.experimentSdk = experimentSdk;
        this.ragStrategyFactory = ragStrategyFactory;
        this.chatClient = chatClient;
    }

    /**
     * 带实验的问答接口
     * 同时测试RAG策略和模型选择两个实验
     */
    public AiResponse answer(String userId, String question, String scene) {
        Instant start = Instant.now();

        // === 实验1：RAG分块策略 ===
        String ragVariant = experimentSdk.getVariant(RAG_EXPERIMENT_ID, userId);
        RagStrategy ragStrategy = ragStrategyFactory.getStrategy(ragVariant);

        // === 实验2：模型选择 ===
        String modelVariant = experimentSdk.getVariant(MODEL_EXPERIMENT_ID, userId);
        double temperature = experimentSdk.getVariantConfig(
            MODEL_EXPERIMENT_ID, userId, "temperature", 0.7
        );

        // 执行检索
        List<String> contexts = ragStrategy.retrieve(question, 5);

        // 构建提示词并调用AI
        String response = chatClient.prompt()
            .system("你是智能客服助手，请基于以下背景信息回答用户问题。\n\n背景信息：\n" +
                    String.join("\n---\n", contexts))
            .user(question)
            .call()
            .content();

        // 计算指标
        long latencyMs = Duration.between(start, Instant.now()).toMillis();
        int turnCount = 1;  // 实际场景中从会话上下文获取

        // === 上报指标 ===
        // 对两个实验都上报，系统会自动关联到对应变体
        Map<String, Double> metrics = Map.of(
            "latency_ms", (double) latencyMs,
            "context_count", (double) contexts.size(),
            "response_length", (double) response.length()
        );

        experimentSdk.trackMetrics(RAG_EXPERIMENT_ID, userId, metrics);
        experimentSdk.trackMetrics(MODEL_EXPERIMENT_ID, userId, metrics);

        return new AiResponse(response, ragVariant, modelVariant, latencyMs);
    }

    /**
     * 记录用户明确反馈（问题是否解决）
     * 这是最重要的业务指标
     */
    public void recordFeedback(String userId, String sessionId, boolean resolved,
                                int satisfactionScore) {
        // 记录主指标
        experimentSdk.trackMetric(RAG_EXPERIMENT_ID, userId, "resolution_rate",
            resolved ? 1.0 : 0.0);
        experimentSdk.trackMetric(RAG_EXPERIMENT_ID, userId, "satisfaction_score",
            satisfactionScore);

        experimentSdk.trackMetric(MODEL_EXPERIMENT_ID, userId, "resolution_rate",
            resolved ? 1.0 : 0.0);
        experimentSdk.trackMetric(MODEL_EXPERIMENT_ID, userId, "satisfaction_score",
            satisfactionScore);
    }

    public record AiResponse(String content, String ragVariant,
                              String modelVariant, long latencyMs) {}
}

指标体系：业务指标、质量指标、体验指标

AI应用的三层指标金字塔

指标采集实现

package com.laozhang.experiment.metrics;

import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Component;
import com.fasterxml.jackson.databind.ObjectMapper;

import java.util.Map;
import java.util.UUID;

/**
 * 实验指标采集器
 * 低延迟异步上报，不影响主业务链路
 */
@Component
public class EventCollector {

    private static final String TOPIC_EXPERIMENT_EVENTS = "experiment.events";

    private final KafkaTemplate<String, String> kafkaTemplate;
    private final ObjectMapper objectMapper;

    public EventCollector(KafkaTemplate<String, String> kafkaTemplate,
                          ObjectMapper objectMapper) {
        this.kafkaTemplate = kafkaTemplate;
        this.objectMapper = objectMapper;
    }

    /**
     * 曝光事件：用户看到了某个变体
     */
    public void trackExposure(String experimentId, String userId, String variant) {
        sendEvent(ExperimentEvent.builder()
            .eventId(UUID.randomUUID().toString())
            .eventType("EXPOSURE")
            .experimentId(experimentId)
            .userId(userId)
            .variant(variant)
            .timestamp(System.currentTimeMillis())
            .build());
    }

    /**
     * 指标事件：业务指标数值
     */
    public void trackMetric(MetricEvent event) {
        sendEvent(ExperimentEvent.builder()
            .eventId(UUID.randomUUID().toString())
            .eventType("METRIC")
            .experimentId(event.experimentId())
            .userId(event.userId())
            .variant(event.variant())
            .metricName(event.metricName())
            .metricValue(event.value())
            .timestamp(event.timestamp())
            .build());
    }

    /**
     * 转化事件：用户完成了关键行为
     */
    public void trackConversion(String experimentId, String userId,
                                 String conversionType, Map<String, Object> properties) {
        sendEvent(ExperimentEvent.builder()
            .eventId(UUID.randomUUID().toString())
            .eventType("CONVERSION")
            .experimentId(experimentId)
            .userId(userId)
            .conversionType(conversionType)
            .properties(properties)
            .timestamp(System.currentTimeMillis())
            .build());
    }

    private void sendEvent(ExperimentEvent event) {
        try {
            String json = objectMapper.writeValueAsString(event);
            // 使用userId作为分区键，保证同一用户的事件有序
            kafkaTemplate.send(TOPIC_EXPERIMENT_EVENTS, event.userId(), json);
        } catch (Exception e) {
            // 采集失败不影响主业务
            log.warn("实验事件采集失败: {}", e.getMessage());
        }
    }
}

典型AI场景的指标清单

/**
 * AI应用标准指标枚举
 * 涵盖对话、推荐、内容生成三大场景
 */
public enum AiMetric {

    // ===== 对话场景指标 =====
    RESOLUTION_RATE("resolution_rate", "问题解决率", "对话"),
    AVG_TURNS("avg_turns", "平均对话轮次", "对话"),       // 越少越好
    FIRST_RESPONSE_QUALITY("frq_score", "首轮回复质量", "对话"),
    ESCALATION_RATE("escalation_rate", "转人工率", "对话"),  // 越低越好

    // ===== 质量指标 =====
    HALLUCINATION_RATE("hallucination_rate", "幻觉率", "质量"),   // 越低越好
    RELEVANCE_SCORE("relevance_score", "相关性得分", "质量"),
    CITATION_ACCURACY("citation_acc", "引用准确率", "质量"),
    TOXICITY_RATE("toxicity_rate", "有害内容率", "质量"),          // 护栏指标

    // ===== 体验指标 =====
    RESPONSE_LATENCY_P50("latency_p50", "P50延迟(ms)", "体验"),
    RESPONSE_LATENCY_P99("latency_p99", "P99延迟(ms)", "体验"),    // 护栏指标
    SATISFACTION_SCORE("satisfaction", "用户满意度", "体验"),
    THUMBS_UP_RATE("thumbs_up", "点赞率", "体验"),
    THUMBS_DOWN_RATE("thumbs_down", "踩率", "体验"),               // 护栏指标

    // ===== 业务指标 =====
    SESSION_GMV("session_gmv", "会话GMV贡献", "业务"),
    RETENTION_7D("retention_7d", "7日留存率", "业务"),
    NPS_SCORE("nps", "净推荐值", "业务");

    private final String key;
    private final String description;
    private final String category;

    AiMetric(String key, String description, String category) {
        this.key = key;
        this.description = description;
        this.category = category;
    }
}

统计显著性：何时可以拍板"A比B好"

为什么不能只看均值

假设A组的问题解决率是76%，B组是78%，能说B更好吗？

不能！ 你需要问：这2%的差异是真实的提升，还是随机波动？

t检验实现

package com.laozhang.experiment.stats;

import org.apache.commons.math3.stat.inference.TTest;
import org.apache.commons.math3.stat.descriptive.DescriptiveStatistics;
import org.springframework.stereotype.Component;

import java.util.List;

/**
 * 统计显著性检验引擎
 * 基于Apache Commons Math实现t检验和Mann-Whitney U检验
 */
@Component
public class StatisticalSignificanceTester {

    private static final double DEFAULT_ALPHA = 0.05;  // 5%显著性水平（95%置信度）

    private final TTest tTest = new TTest();

    /**
     * 两组独立样本t检验
     * 适用于：连续指标（如满意度评分、延迟等）
     *
     * @param controlSamples   对照组样本数据
     * @param treatmentSamples 实验组样本数据
     * @param alpha            显著性水平（通常0.05）
     * @return 检验结果
     */
    public TestResult twoSampleTTest(
            List<Double> controlSamples,
            List<Double> treatmentSamples,
            double alpha) {

        if (controlSamples.size() < 30 || treatmentSamples.size() < 30) {
            return TestResult.insufficient("样本量不足（至少需要30个样本）");
        }

        double[] control = controlSamples.stream().mapToDouble(Double::doubleValue).toArray();
        double[] treatment = treatmentSamples.stream().mapToDouble(Double::doubleValue).toArray();

        // 计算p值（双尾检验）
        double pValue = tTest.tTest(control, treatment);

        // 计算效应量（Cohen's d）
        double cohenD = calculateCohensD(control, treatment);

        // 计算置信区间
        double[] confidenceInterval = calculateConfidenceInterval(control, treatment, alpha);

        // 计算统计功效
        double power = calculatePower(control.length, treatment.length, cohenD, alpha);

        DescriptiveStatistics controlStats = new DescriptiveStatistics(control);
        DescriptiveStatistics treatmentStats = new DescriptiveStatistics(treatment);

        return TestResult.builder()
            .isSignificant(pValue < alpha)
            .pValue(pValue)
            .alpha(alpha)
            .controlMean(controlStats.getMean())
            .treatmentMean(treatmentStats.getMean())
            .relativeLift((treatmentStats.getMean() - controlStats.getMean()) /
                          controlStats.getMean())
            .cohenD(cohenD)
            .confidenceIntervalLow(confidenceInterval[0])
            .confidenceIntervalHigh(confidenceInterval[1])
            .statisticalPower(power)
            .controlSampleSize(control.length)
            .treatmentSampleSize(treatment.length)
            .conclusion(buildConclusion(pValue, alpha, controlStats.getMean(),
                                       treatmentStats.getMean()))
            .build();
    }

    /**
     * 比例检验（适用于：解决率、点赞率等转化类指标）
     */
    public TestResult proportionTest(
            int controlConversions, int controlTotal,
            int treatmentConversions, int treatmentTotal,
            double alpha) {

        if (controlTotal < 100 || treatmentTotal < 100) {
            return TestResult.insufficient("样本量不足（比例检验至少需要100个样本）");
        }

        double p1 = (double) controlConversions / controlTotal;
        double p2 = (double) treatmentConversions / treatmentTotal;

        // 合并比例
        double pPooled = (double)(controlConversions + treatmentConversions) /
                         (controlTotal + treatmentTotal);

        // Z统计量
        double se = Math.sqrt(pPooled * (1 - pPooled) *
                              (1.0/controlTotal + 1.0/treatmentTotal));
        double zScore = (p2 - p1) / se;

        // 双尾p值
        double pValue = 2 * (1 - normalCDF(Math.abs(zScore)));

        double relativeLift = (p2 - p1) / p1;

        return TestResult.builder()
            .isSignificant(pValue < alpha)
            .pValue(pValue)
            .controlMean(p1)
            .treatmentMean(p2)
            .relativeLift(relativeLift)
            .conclusion(buildProportionConclusion(pValue, alpha, p1, p2, relativeLift))
            .build();
    }

    private String buildConclusion(double pValue, double alpha,
                                    double controlMean, double treatmentMean) {
        String direction = treatmentMean > controlMean ? "提升" : "下降";
        double change = Math.abs(treatmentMean - controlMean);

        if (pValue < alpha) {
            return String.format("统计显著（p=%.4f < %.2f）。实验组相较对照组%s了%.4f，" +
                "结论可信，建议%s。",
                pValue, alpha, direction, change,
                treatmentMean > controlMean ? "全量推广" : "放弃该方案");
        } else {
            return String.format("统计不显著（p=%.4f >= %.2f）。" +
                "当前数据不足以证明两组有差异，建议继续收集数据或增大流量。",
                pValue, alpha);
        }
    }

    private String buildProportionConclusion(double pValue, double alpha,
                                              double p1, double p2, double lift) {
        if (pValue < alpha) {
            String direction = lift > 0 ? "提升" : "下降";
            return String.format("统计显著（p=%.4f）。实验组转化率%.2f%%，" +
                "对照组%.2f%%，相对%s%.1f%%。",
                pValue, p2 * 100, p1 * 100, direction, Math.abs(lift) * 100);
        } else {
            return String.format("统计不显著（p=%.4f >= %.2f）。两组无显著差异。",
                pValue, alpha);
        }
    }

    // 正态分布CDF近似
    private double normalCDF(double z) {
        return 0.5 * (1 + erf(z / Math.sqrt(2)));
    }

    private double erf(double x) {
        double t = 1.0 / (1.0 + 0.5 * Math.abs(x));
        double tau = t * Math.exp(-x*x - 1.26551223 + t*(1.00002368 + t*(0.37409196 +
            t*(0.09678418 + t*(-0.18628806 + t*(0.27886807 + t*(-1.13520398 +
            t*(1.48851587 + t*(-0.82215223 + t*0.17087294)))))))));
        return x >= 0 ? 1 - tau : tau - 1;
    }

    // 计算Cohen's d效应量
    private double calculateCohensD(double[] g1, double[] g2) {
        DescriptiveStatistics s1 = new DescriptiveStatistics(g1);
        DescriptiveStatistics s2 = new DescriptiveStatistics(g2);
        double pooledStd = Math.sqrt((s1.getVariance() + s2.getVariance()) / 2);
        return pooledStd > 0 ? (s2.getMean() - s1.getMean()) / pooledStd : 0;
    }

    private double[] calculateConfidenceInterval(double[] g1, double[] g2, double alpha) {
        // 简化计算，实际可用Commons Math的完整实现
        DescriptiveStatistics s1 = new DescriptiveStatistics(g1);
        DescriptiveStatistics s2 = new DescriptiveStatistics(g2);
        double diff = s2.getMean() - s1.getMean();
        double se = Math.sqrt(s1.getVariance()/g1.length + s2.getVariance()/g2.length);
        double z = 1.96;  // 95%置信区间
        return new double[]{diff - z * se, diff + z * se};
    }

    private double calculatePower(int n1, int n2, double d, double alpha) {
        // 简化的功效计算（实际场景建议用专业功效分析工具）
        double nHarmonic = 2.0 / (1.0/n1 + 1.0/n2);
        double lambda = Math.abs(d) * Math.sqrt(nHarmonic / 2);
        return Math.min(0.99, Math.max(0.01, normalCDF(lambda - 1.645)));
    }
}

样本量计算器

/**
 * 实验开始前：计算需要多少样本才能检测出效果
 */
@Component
public class SampleSizeCalculator {

    /**
     * 计算最小样本量
     *
     * @param baselineRate    基准转化率（如当前问题解决率0.78）
     * @param minDetectableEffect  最小可检测效应（如希望检测5%提升，传入0.05）
     * @param alpha           显著性水平（0.05）
     * @param power           统计功效（0.80）
     * @return 每组所需最小样本量
     */
    public int calculateForProportion(double baselineRate, double minDetectableEffect,
                                       double alpha, double power) {
        double treatmentRate = baselineRate * (1 + minDetectableEffect);

        // Z值
        double zAlpha = 1.96;  // alpha=0.05，双尾
        double zBeta = 0.84;   // power=0.80

        // 样本量公式
        double pBar = (baselineRate + treatmentRate) / 2;
        double numerator = Math.pow(zAlpha * Math.sqrt(2 * pBar * (1 - pBar)) +
                                    zBeta * Math.sqrt(baselineRate * (1 - baselineRate) +
                                    treatmentRate * (1 - treatmentRate)), 2);
        double denominator = Math.pow(treatmentRate - baselineRate, 2);

        int sampleSize = (int) Math.ceil(numerator / denominator);

        System.out.printf("基准转化率: %.1f%%%n", baselineRate * 100);
        System.out.printf("最小检测效应: %.1f%%%n", minDetectableEffect * 100);
        System.out.printf("每组所需样本: %d%n", sampleSize);
        System.out.printf("总样本量: %d（双组）%n", sampleSize * 2);

        // 根据日均流量估算实验天数
        // 假设日均10000用户，50%进入实验
        int dailyTraffic = 10000;
        double experimentTrafficRatio = 0.5;
        int daysNeeded = (int) Math.ceil(
            sampleSize / (dailyTraffic * experimentTrafficRatio / 2));
        System.out.printf("预估实验时长: %d 天（日均%d用户，%d%%流量参与实验）%n",
            daysNeeded, dailyTraffic, (int)(experimentTrafficRatio * 100));

        return sampleSize;
    }
}

使用示例：

基准解决率: 78.0%，检测5%提升，alpha=0.05，power=0.80
→ 每组所需样本: 4,891
→ 总样本量: 9,782
→ 预估实验时长: 2 天（10000日活，50%流量）

实验加速：如何缩短实验周期

方法1：增大实验流量比例

风险：影响更多用户，如果新版本有问题影响面更大。建议：先10%流量跑1-2天，确认无问题再扩到50%。

方法2：分层抽样（Stratified Sampling）

针对不同用户群单独分析，可以更早发现效果：

/**
 * 分层分析：发现实验在不同用户群中的差异效果
 */
public Map<String, TestResult> stratifiedAnalysis(
        String experimentId,
        List<String> stratifications) {

    Map<String, TestResult> results = new HashMap<>();

    for (String stratum : stratifications) {
        // 获取该层用户的实验数据
        List<Double> controlData = dataService.getMetricByStratum(
            experimentId, "control", stratum);
        List<Double> treatmentData = dataService.getMetricByStratum(
            experimentId, "treatment", stratum);

        if (controlData.size() >= 30 && treatmentData.size() >= 30) {
            TestResult result = tester.twoSampleTTest(controlData, treatmentData, 0.05);
            results.put(stratum, result);
        }
    }

    return results;
}

方法3：序贯检验（Sequential Testing）

不等实验结束，实时监控是否可以提前决策：

/**
 * Always Valid Inference - 允许随时查看实验结果的统计方法
 * 避免传统方法中"频繁查看导致假阳性"的问题
 */
@Component
public class SequentialTester {

    /**
     * 使用mSPRT（mixture Sequential Probability Ratio Test）
     * 这是Optimizely等实验平台使用的方法
     */
    public SequentialTestResult test(
            List<Double> controlSamples,
            List<Double> treatmentSamples,
            double mde,          // 最小检测效应
            double alpha) {       // 显著性水平

        // 计算混合统计量（简化版）
        // 完整实现参考论文: "Always Valid Inference" (Johari et al., 2022)
        double variance = estimateVariance(controlSamples, treatmentSamples);
        double n = Math.min(controlSamples.size(), treatmentSamples.size());

        double controlMean = average(controlSamples);
        double treatmentMean = average(treatmentSamples);
        double diff = treatmentMean - controlMean;

        // 混合统计量
        double tau2 = variance * (1.0 / (n * mde * mde));
        double mixtureLR = Math.sqrt(1 + n / (n + tau2)) *
            Math.exp((n * n * diff * diff) / (2 * variance * (n + tau2)));

        // 临界值（基于alpha）
        double threshold = 1.0 / alpha;

        boolean canDecide = mixtureLR > threshold;
        String decision;
        if (!canDecide) {
            decision = "继续收集数据（置信度不足）";
        } else if (diff > 0) {
            decision = "实验组显著更好，可以全量推广";
        } else {
            decision = "实验组显著更差，建议立即停止";
        }

        return new SequentialTestResult(mixtureLR, threshold, canDecide,
            diff, variance, decision);
    }

    private double average(List<Double> data) {
        return data.stream().mapToDouble(Double::doubleValue).average().orElse(0);
    }

    private double estimateVariance(List<Double> g1, List<Double> g2) {
        DescriptiveStatistics stats1 = new DescriptiveStatistics(
            g1.stream().mapToDouble(Double::doubleValue).toArray());
        DescriptiveStatistics stats2 = new DescriptiveStatistics(
            g2.stream().mapToDouble(Double::doubleValue).toArray());
        return (stats1.getVariance() + stats2.getVariance()) / 2;
    }
}

实战：RAG分块策略的完整实验流程

下面是一个完整的端到端实验，从设计到决策：

package com.laozhang.experiment.example;

import org.springframework.stereotype.Service;
import java.time.LocalDate;

/**
 * 完整的RAG分块策略实验
 * 演示从设计、执行到决策的全流程
 */
@Service
public class RagChunkingExperiment {

    private static final String EXPERIMENT_ID = "rag-chunking-v3";

    private final ExperimentSdk experimentSdk;
    private final StatisticalSignificanceTester tester;
    private final SampleSizeCalculator calculator;
    private final ExperimentReportService reportService;

    // Step 1: 创建实验（通过管理API）
    public void createExperiment() {
        ExperimentConfig config = ExperimentConfig.builder()
            .id(EXPERIMENT_ID)
            .name("RAG语义分块 vs 固定大小分块")
            .hypothesis("语义分块将问题解决率从78%提升至83%（+6.4%）")
            .startDate(LocalDate.now())
            .endDate(LocalDate.now().plusDays(14))
            .variants(List.of(
                new Variant("control", 500,    // 50%流量
                    Map.of("chunk_strategy", "fixed", "chunk_size", "512")),
                new Variant("treatment", 500,   // 50%流量
                    Map.of("chunk_strategy", "semantic", "model", "text-embedding-3-small"))
            ))
            .primaryMetric("resolution_rate")
            .guardrailMetrics(List.of("latency_p99", "error_rate"))
            .guardrailThresholds(Map.of("latency_p99", 3000.0, "error_rate", 0.02))
            .build();

        // 验证样本量是否足够
        int requiredSamples = calculator.calculateForProportion(0.78, 0.064, 0.05, 0.80);
        System.out.println("需要每组 " + requiredSamples + " 个样本");
        // 输出: 需要每组 2,847 个样本 → 日均5000用户，2天可满足
    }

    // Step 2: 业务代码集成（参见上文ExperimentalAiService）

    // Step 3: 每日健康检查
    public void dailyHealthCheck() {
        ExperimentHealth health = reportService.getHealth(EXPERIMENT_ID);

        // 检查护栏指标
        if (health.getGuardrailViolations().size() > 0) {
            System.err.println("护栏指标告警！立即停止实验：" +
                health.getGuardrailViolations());
            experimentSdk.stopExperiment(EXPERIMENT_ID);
        }

        System.out.printf("当前进度：control=%d, treatment=%d%n",
            health.getControlSamples(), health.getTreatmentSamples());
    }

    // Step 4: 实验结论与决策
    public ExperimentDecision makeDecision() {
        ExperimentData data = reportService.getData(EXPERIMENT_ID);

        // 主指标检验（问题解决率 - 比例检验）
        TestResult primaryResult = tester.proportionTest(
            data.controlConversions(),  data.controlTotal(),
            data.treatmentConversions(), data.treatmentTotal(),
            0.05
        );

        // 分层分析
        Map<String, TestResult> stratifiedResults = stratifiedAnalysis(
            EXPERIMENT_ID,
            List.of("商品咨询", "售后投诉", "物流查询", "账号问题")
        );

        // 生成决策报告
        return ExperimentDecision.builder()
            .experimentId(EXPERIMENT_ID)
            .primaryResult(primaryResult)
            .stratifiedResults(stratifiedResults)
            .recommendation(buildRecommendation(primaryResult, stratifiedResults))
            .build();
    }

    private String buildRecommendation(TestResult primary,
                                        Map<String, TestResult> stratified) {
        if (!primary.isSignificant()) {
            return "继续观察：主指标尚未达到统计显著性";
        }

        if (primary.getRelativeLift() > 0) {
            // 检查是否所有场景都有提升
            long negativeScenes = stratified.values().stream()
                .filter(r -> r.isSignificant() && r.getRelativeLift() < 0)
                .count();

            if (negativeScenes > 0) {
                return "谨慎推广：整体有提升，但部分场景有负效果，建议针对性配置";
            }
            return "建议全量：整体提升" +
                String.format("%.1f%%", primary.getRelativeLift() * 100) +
                "，各场景均有正向效果";
        } else {
            return "建议放弃：实验组效果更差，不推荐推广";
        }
    }
}

实验结果示例

实验ID: rag-chunking-v3
实验周期: 2025-10-01 ~ 2025-10-14（14天）

样本量：
  对照组（固定分块）: 34,521 次对话
  实验组（语义分块）: 34,687 次对话

主指标 - 问题解决率：
  对照组: 78.3%（27,030/34,521）
  实验组: 83.7%（29,027/34,687）
  相对提升: +6.9%
  p值: 0.0003（远小于0.05）
  结论: 统计显著，建议全量推广

分层分析：
  商品咨询: +8.2%（p=0.0001）✓ 显著提升
  售后投诉: +5.1%（p=0.0312）✓ 显著提升
  物流查询: +4.3%（p=0.0891）  不显著（样本量不足）
  账号问题: +7.6%（p=0.0008）✓ 显著提升

护栏指标：
  P99延迟: 对照1,847ms → 实验2,103ms（+13.9%，在阈值3000ms内）✓
  错误率:  对照0.3%   → 实验0.3%（无变化）✓

最终决策: 建议全量推广语义分块策略
预期年化收益: 留存提升 → 预估GMV增加约 ¥850万/年

FAQ

Q1：实验期间发现护栏指标超标怎么办？

立即停止实验，回滚到对照组。先止损，再分析原因。建议在SDK中实现自动护栏监控，超出阈值自动停止。

Q2：实验组效果更好，但统计不显著，怎么办？

有两个选择：继续收集数据直到显著，或接受效果不够明显（低于MDE）因此不值得推广。不要在不显著时就拍板。

Q3：如何处理"幸存者偏差"？

分析基于曝光用户，不是所有用户。确保control和treatment的曝光用户在关键属性上无显著差异（使用AA测试验证）。

Q4：可以同时运行多少个实验？

理论上不限，但要注意：

实验间干扰（用分层哈希解决）
人员注意力有限（每个实验需要人跟进）
特别重要的功能建议单独实验，避免被其他实验稀释

Q5：实验结束后，数据保留多久？

建议永久保留（指标汇总数据）。原始事件日志可以保留1年。实验知识库是最宝贵的资产，每次实验的结论都要文档化。

总结

陈晓强事件之后，该团队花了2个月搭建了完整的实验平台，再也没有"凭感觉上线"的事情发生。

实验平台的核心价值：

把"感觉好多了"变成"置信度95%，提升6.9%"
在影响全量用户之前，发现并规避风险
沉淀实验知识，避免重蹈覆辙

技术选型建议：

初期（日活<10万）：可以用LaunchDarkly/Unleash等现成工具
中期（日活10-100万）：定制化Java SDK + ClickHouse
后期（日活>100万）：完整的自研实验平台

任何AI功能上线，都应该先问：这个功能的实验假设是什么？怎么衡量成功？