RAG系统A/B测试：数据驱动优化你的检索效果

老张2026/5/15大约 21 分钟AI工程实践RAGA/B测试数据驱动实验框架Java效果优化

RAG系统A/B测试：数据驱动优化你的检索效果

"明明感觉更好了，用户为什么还在抱怨？"

2025年11月，苏磊在他们公司的代码审查会上被问了一个让他无言以对的问题。

苏磊是一家SaaS公司的AI工程师，负责内部知识库问答系统的优化。3周前，他把系统的分块策略从"固定500字切割"改成了"语义递归分块"，同时把chunk数量从top-3提升到top-5，top-5的召回率也从68%提升到了73%——这是他在自己的测试集（30个问题）上跑出来的数据。

他很自信地上线了这个改动。

但上线两周后，客服收到的用户投诉反而多了——用户说"AI好像变笨了"，"经常给出不相关的答案"。

在代码审查会上，同事老周问他："你上线前做A/B测试了吗？"

苏磊愣了一下，说："我有测试啊，在自己的测试集上准确率提升了5%。"

老周追问："你的测试集是哪来的？有多少个问题？代表真实用户的查询分布吗？"

苏磊哑了。他的30个测试问题是他自己随手想的，全部是"标准型"的技术问题，没有包含用户真实的模糊查询、多轮追问、方言表达。

他的改动确实对他的测试集有效，但对真实用户的查询反而更差了——新的分块策略在处理短文档时效果不佳，而他们的大量用户恰好在查短文档。

苏磊用了接下来两周时间，建立了一套真正的A/B测试框架，把每一次RAG优化都跑到统计显著性才上线。这篇文章是对那套框架的完整记录。

先说结论（TL;DR）

方面	没有A/B测试	有A/B测试
优化决策依据	主观感受 + 小样本测试	统计显著的用户行为数据
优化失败风险	高（约40%的改动上线后实际变差）	低（失败的改动在实验阶段就被筛掉）
发现问题速度	慢（用户投诉才知道）	快（实验阶段就能看到指标变化）
多变量优化	难（每次只能测一个变量）	可以（多变量实验）
上线信心	低	高

核心结论：

RAG优化必须通过A/B测试验证，主观感受不可信
核心指标：用户满意度、点击引用率、对话轮次（越少越好）
实验流量分配：对照组50%，实验组50%（小流量时可用10/90）
统计显著性：p < 0.05，且样本量足够（每组至少200个会话）
关注绝对值而非相对值：1%的真实提升 > 20%的统计噪音

为什么RAG优化容易"感觉对了，数据错了"

苏磊的问题有个专业名词：Overfitting on Internal Test Set（内部测试集过拟合）。

这不是他一个人的问题。根据笔者与多个团队的交流，大约60%的RAG工程师在"优化"时用的是自己手工构建的小测试集，这类测试集有几个共同的问题：

问题1：样本不代表真实分布

工程师构建测试问题时，倾向于写"标准、清晰"的问题，而真实用户的查询往往是模糊的、有错别字的、包含方言的。

问题2：样本量太小

30个测试问题，哪怕准确率提升5%，也就是1.5个问题的差异。这在统计上完全可能是随机波动，不具有显著性。

问题3：测试环境与生产不同

测试时用的是静态知识库，生产上知识库在实时更新；测试时是单次问答，生产上有多轮对话。

问题4：指标选择偏差

"召回率"和"准确率"是工程指标，不是用户指标。用户真正关心的是：我的问题被解决了吗？A/B测试必须用用户行为指标（点击率、满意度、继续追问率）。

RAG A/B测试的设计原则

实验单元的选择

RAG系统的A/B测试，实验单元应该是会话（Session），而不是单次查询。

原因：同一个用户的多次查询之间有关联性，如果同一个会话里有时走A策略、有时走B策略，会造成体验割裂，也会影响"对话轮次"这个指标的计算。

正确做法：
用户A的整个会话都走策略A
用户B的整个会话都走策略B

错误做法：
用户A的第1个问题走策略A，第2个问题走策略B

流量分配策略

实验框架设计

数据库表设计

-- 实验配置表
CREATE TABLE ab_experiments (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    experiment_id   VARCHAR(64) NOT NULL UNIQUE,  -- 如 "hyde_vs_baseline_20260501"
    name            VARCHAR(255) NOT NULL,
    description     TEXT,
    status          ENUM('DRAFT', 'RUNNING', 'PAUSED', 'COMPLETED') NOT NULL DEFAULT 'DRAFT',
    traffic_split   JSON NOT NULL,  -- {"control": 50, "treatment": 50}
    config_a        JSON NOT NULL,  -- 控制组配置
    config_b        JSON NOT NULL,  -- 实验组配置
    start_time      DATETIME,
    end_time        DATETIME,
    created_by      VARCHAR(64),
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    INDEX idx_status (status),
    INDEX idx_start_time (start_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- 用户实验分组表
CREATE TABLE experiment_assignments (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    experiment_id   VARCHAR(64) NOT NULL,
    user_id         VARCHAR(64) NOT NULL,
    session_id      VARCHAR(64) NOT NULL,
    group_name      VARCHAR(32) NOT NULL,  -- "control" or "treatment"
    assigned_at     DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    
    UNIQUE KEY uk_session_experiment (session_id, experiment_id),
    INDEX idx_user_experiment (user_id, experiment_id),
    INDEX idx_experiment_group (experiment_id, group_name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- 查询事件表（每次查询记录一条）
CREATE TABLE query_events (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    event_id        VARCHAR(64) NOT NULL UNIQUE,
    experiment_id   VARCHAR(64),
    group_name      VARCHAR(32),
    session_id      VARCHAR(64) NOT NULL,
    user_id         VARCHAR(64),
    query           TEXT NOT NULL,
    answer          TEXT,
    retrieved_docs  JSON,           -- 检索到的文档列表
    latency_ms      INT,
    
    -- 用户反馈（异步更新）
    thumbs_up       BOOLEAN,        -- 用户点了赞
    thumbs_down     BOOLEAN,        -- 用户点了踩
    cited_sources   BOOLEAN,        -- 用户点击了引用来源
    follow_up_query TEXT,           -- 用户的追问（说明上次回答不满足需求）
    
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    
    INDEX idx_session (session_id),
    INDEX idx_experiment_group (experiment_id, group_name),
    INDEX idx_created_at (created_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- 会话汇总表（每个会话结束后汇总）
CREATE TABLE session_summaries (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    session_id      VARCHAR(64) NOT NULL UNIQUE,
    experiment_id   VARCHAR(64),
    group_name      VARCHAR(32),
    user_id         VARCHAR(64),
    
    query_count     INT NOT NULL DEFAULT 0,     -- 会话中的查询次数
    thumbs_up_count INT NOT NULL DEFAULT 0,
    thumbs_down_count INT NOT NULL DEFAULT 0,
    source_click_count INT NOT NULL DEFAULT 0,
    follow_up_count INT NOT NULL DEFAULT 0,     -- 追问次数（越少越好）
    
    -- 派生指标
    satisfaction_rate DECIMAL(5,4),             -- thumbs_up / (thumbs_up + thumbs_down)
    source_click_rate DECIMAL(5,4),
    follow_up_rate  DECIMAL(5,4),               -- follow_up / query_count
    
    session_duration_seconds INT,
    started_at      DATETIME,
    ended_at        DATETIME,
    
    INDEX idx_experiment_group (experiment_id, group_name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Java实现：基于Feature Flag的A/B测试系统

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.laozhang.rag</groupId>
    <artifactId>rag-ab-testing</artifactId>
    <version>1.0.0</version>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.2.5</version>
    </parent>

    <properties>
        <java.version>21</java.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-redis</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
            <version>1.0.0</version>
        </dependency>
        <!-- 统计计算 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-math3</artifactId>
            <version>3.6.1</version>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <scope>runtime</scope>
        </dependency>
    </dependencies>
</project>

实验配置模型

package com.laozhang.rag.experiment;

import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;

import java.time.LocalDateTime;
import java.util.Map;

/**
 * 实验配置
 * 定义了A/B测试的两组配置
 */
@Data
public class ExperimentConfig {
    
    private String experimentId;
    private String name;
    private ExperimentStatus status;
    
    /**
     * 控制组（A组）：当前生产配置
     */
    private RagConfig controlConfig;
    
    /**
     * 实验组（B组）：待测试的新配置
     */
    private RagConfig treatmentConfig;
    
    /**
     * 流量分配：控制组百分比（0-100）
     */
    private int controlTrafficPercent;
    
    private LocalDateTime startTime;
    private LocalDateTime endTime;
    
    /**
     * RAG策略配置
     */
    @Data
    public static class RagConfig {
        // 检索配置
        @JsonProperty("top_k")
        private int topK = 5;
        
        @JsonProperty("similarity_threshold")
        private double similarityThreshold = 0.65;
        
        // 检索策略
        @JsonProperty("retrieval_strategy")
        private String retrievalStrategy = "vector";  // vector / hybrid / hyde
        
        // HyDE配置
        @JsonProperty("hyde_enabled")
        private boolean hydeEnabled = false;
        
        // 分块配置（影响索引构建，通常不在A/B测试中实时切换）
        @JsonProperty("chunk_size")
        private int chunkSize = 512;
        
        @JsonProperty("chunk_overlap")
        private int chunkOverlap = 50;
        
        // Reranker配置
        @JsonProperty("reranker_enabled")
        private boolean rerankerEnabled = false;
        
        @JsonProperty("reranker_top_k")
        private int rerankerTopK = 3;
        
        // 其他可选配置
        private Map<String, Object> additionalConfig;
    }
    
    public enum ExperimentStatus {
        DRAFT, RUNNING, PAUSED, COMPLETED
    }
}

实验分配服务

package com.laozhang.rag.experiment;

import lombok.extern.slf4j.Slf4j;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Service;

import java.security.MessageDigest;
import java.time.Duration;
import java.util.Optional;

/**
 * 实验分配服务
 *
 * 负责：
 * 1. 决定用户属于哪个实验组
 * 2. 确保同一用户在同一实验中始终属于同一组（会话粒度稳定性）
 * 3. 支持多个实验同时运行
 */
@Slf4j
@Service
public class ExperimentAssignmentService {

    private final StringRedisTemplate redisTemplate;
    private final ExperimentRepository experimentRepo;
    private final AssignmentRepository assignmentRepo;
    
    // Redis中的key前缀
    private static final String ASSIGNMENT_CACHE_PREFIX = "exp:assignment:";
    private static final Duration ASSIGNMENT_TTL = Duration.ofDays(7);

    public ExperimentAssignmentService(
            StringRedisTemplate redisTemplate,
            ExperimentRepository experimentRepo,
            AssignmentRepository assignmentRepo) {
        this.redisTemplate = redisTemplate;
        this.experimentRepo = experimentRepo;
        this.assignmentRepo = assignmentRepo;
    }

    /**
     * 获取用户在指定实验中的分组
     *
     * @param experimentId 实验ID
     * @param sessionId    会话ID（分配的最小粒度）
     * @param userId       用户ID（可选，用于跨会话保持一致性）
     */
    public ExperimentGroup getAssignment(String experimentId, String sessionId, String userId) {
        // 1. 先查Redis缓存
        String cacheKey = ASSIGNMENT_CACHE_PREFIX + experimentId + ":" + sessionId;
        String cachedGroup = redisTemplate.opsForValue().get(cacheKey);
        
        if (cachedGroup != null) {
            return ExperimentGroup.valueOf(cachedGroup);
        }
        
        // 2. 查数据库（已有分配）
        Optional<ExperimentAssignment> existing = assignmentRepo
            .findByExperimentIdAndSessionId(experimentId, sessionId);
        
        if (existing.isPresent()) {
            ExperimentGroup group = ExperimentGroup.valueOf(existing.get().getGroupName());
            // 回写缓存
            redisTemplate.opsForValue().set(cacheKey, group.name(), ASSIGNMENT_TTL);
            return group;
        }
        
        // 3. 首次访问，分配实验组
        ExperimentConfig config = experimentRepo.findByExperimentId(experimentId)
            .orElseThrow(() -> new ExperimentNotFoundException(experimentId));
        
        if (config.getStatus() != ExperimentConfig.ExperimentStatus.RUNNING) {
            // 实验未运行，返回控制组
            return ExperimentGroup.CONTROL;
        }
        
        ExperimentGroup group = allocateGroup(sessionId, userId, config);
        
        // 4. 持久化分配结果
        ExperimentAssignment assignment = new ExperimentAssignment();
        assignment.setExperimentId(experimentId);
        assignment.setSessionId(sessionId);
        assignment.setUserId(userId);
        assignment.setGroupName(group.name());
        assignmentRepo.save(assignment);
        
        // 5. 写入Redis缓存
        redisTemplate.opsForValue().set(cacheKey, group.name(), ASSIGNMENT_TTL);
        
        log.debug("Assigned session {} to group {} in experiment {}", 
                 sessionId, group, experimentId);
        
        return group;
    }

    /**
     * 分配实验组
     *
     * 使用哈希取模法：保证相同输入始终得到相同分组（确定性）
     * 同时让分配结果均匀分布
     */
    private ExperimentGroup allocateGroup(String sessionId, String userId, 
                                           ExperimentConfig config) {
        // 使用sessionId的哈希值决定分组（如果userId存在，优先用userId保证跨会话一致性）
        String hashInput = (userId != null && !userId.isBlank()) ? userId : sessionId;
        
        int hashValue = Math.abs(murmurHash(hashInput + config.getExperimentId()));
        int bucket = hashValue % 100;  // 0-99
        
        // 根据流量配置分配：前controlTrafficPercent%的bucket属于控制组
        if (bucket < config.getControlTrafficPercent()) {
            return ExperimentGroup.CONTROL;
        } else {
            return ExperimentGroup.TREATMENT;
        }
    }
    
    /**
     * MurmurHash：快速、均匀分布的哈希函数
     */
    private int murmurHash(String input) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] bytes = md.digest(input.getBytes());
            int result = 0;
            for (int i = 0; i < 4; i++) {
                result = (result << 8) | (bytes[i] & 0xFF);
            }
            return result;
        } catch (Exception e) {
            return input.hashCode();
        }
    }
    
    public enum ExperimentGroup {
        CONTROL, TREATMENT
    }
}

RAG策略路由器

package com.laozhang.rag.experiment;

import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;
import org.springframework.web.context.request.RequestContextHolder;
import org.springframework.web.context.request.ServletRequestAttributes;

import jakarta.servlet.http.HttpServletRequest;

/**
 * RAG请求路由器
 *
 * 根据实验分组，将请求路由到不同的RAG策略
 * 同时记录查询事件，用于后续统计分析
 */
@Slf4j
@Service
public class RagStrategyRouter {

    private final ExperimentAssignmentService assignmentService;
    private final ExperimentRepository experimentRepo;
    private final QueryEventRepository eventRepo;
    private final BaselineRagService baselineRag;
    private final HydeRagService hydeRag;
    private final HybridRagService hybridRag;

    /**
     * 路由并执行RAG查询
     */
    public RagResult query(String question, String sessionId, String userId) {
        long startTime = System.currentTimeMillis();
        
        // 获取所有运行中的实验
        List<ExperimentConfig> runningExperiments = experimentRepo.findByStatus(
            ExperimentConfig.ExperimentStatus.RUNNING
        );
        
        if (runningExperiments.isEmpty()) {
            // 没有运行中的实验，使用默认策略
            return executeWithDefaultStrategy(question);
        }
        
        // 取第一个运行中的实验（生产环境应该避免多实验同时运行）
        ExperimentConfig experiment = runningExperiments.get(0);
        
        // 获取用户分组
        ExperimentAssignmentService.ExperimentGroup group = assignmentService.getAssignment(
            experiment.getExperimentId(), sessionId, userId
        );
        
        // 根据分组选择配置
        ExperimentConfig.RagConfig ragConfig = group == ExperimentAssignmentService.ExperimentGroup.CONTROL
            ? experiment.getControlConfig()
            : experiment.getTreatmentConfig();
        
        // 执行RAG
        RagResult result = executeWithConfig(question, ragConfig);
        result.setLatencyMs(System.currentTimeMillis() - startTime);
        
        // 异步记录查询事件
        recordQueryEvent(experiment.getExperimentId(), group.name(), sessionId, 
                        userId, question, result);
        
        return result;
    }

    private RagResult executeWithConfig(String question, ExperimentConfig.RagConfig config) {
        return switch (config.getRetrievalStrategy()) {
            case "hyde" -> hydeRag.query(question, config.getTopK());
            case "hybrid" -> hybridRag.query(question, config.getTopK());
            default -> baselineRag.query(question, config.getTopK(), 
                                         config.getSimilarityThreshold());
        };
    }
    
    private void recordQueryEvent(String experimentId, String groupName, 
                                   String sessionId, String userId,
                                   String question, RagResult result) {
        QueryEvent event = new QueryEvent();
        event.setEventId(java.util.UUID.randomUUID().toString());
        event.setExperimentId(experimentId);
        event.setGroupName(groupName);
        event.setSessionId(sessionId);
        event.setUserId(userId);
        event.setQuery(question);
        event.setAnswer(result.getAnswer());
        event.setLatencyMs((int) result.getLatencyMs());
        event.setRetrievedDocs(result.getSources());
        
        // 异步保存，不阻塞主流程
        CompletableFuture.runAsync(() -> eventRepo.save(event));
    }
}

指标收集：记录用户真实反馈

package com.laozhang.rag.experiment;

import org.springframework.stereotype.Service;

/**
 * 用户反馈收集服务
 *
 * 收集三类信号：
 * 1. 显式反馈：点赞/点踩
 * 2. 隐式反馈：是否点击了来源链接
 * 3. 行为反馈：是否继续追问（说明上次答案不满足需求）
 */
@Service
public class FeedbackCollector {

    private final QueryEventRepository eventRepo;
    private final SessionSummaryRepository summaryRepo;

    /**
     * 记录用户点赞
     */
    public void recordThumbsUp(String eventId) {
        eventRepo.updateFeedback(eventId, true, null, false);
        log.info("Thumbs up for event: {}", eventId);
    }

    /**
     * 记录用户点踩（触发更详细的记录）
     */
    public void recordThumbsDown(String eventId, String reason) {
        eventRepo.updateFeedback(eventId, false, true, false);
        
        // 点踩时记录原因，用于后续分析
        if (reason != null && !reason.isBlank()) {
            eventRepo.updateFeedbackReason(eventId, reason);
        }
        
        log.info("Thumbs down for event: {}, reason: {}", eventId, reason);
    }

    /**
     * 记录用户点击了来源链接
     * 这是一个正向信号：说明用户认为答案有参考价值
     */
    public void recordSourceClick(String eventId, String sourceDocId) {
        eventRepo.updateSourceClick(eventId, true);
        log.debug("Source clicked for event: {}, doc: {}", eventId, sourceDocId);
    }

    /**
     * 记录追问
     * 当用户在同一会话中对相似问题再次查询，说明上次答案不满足需求
     * 这是一个负向信号
     */
    public void recordFollowUp(String sessionId, String currentEventId, 
                                String followUpQuery) {
        // 找到同一会话中的上一个事件
        Optional<QueryEvent> previousEvent = eventRepo
            .findPreviousEventInSession(sessionId, currentEventId);
        
        previousEvent.ifPresent(prev -> {
            prev.setFollowUpQuery(followUpQuery);
            eventRepo.save(prev);
        });
    }

    /**
     * 会话结束时汇总指标
     */
    public void summarizeSession(String sessionId) {
        List<QueryEvent> events = eventRepo.findBySessionId(sessionId);
        if (events.isEmpty()) return;
        
        QueryEvent firstEvent = events.get(0);
        QueryEvent lastEvent = events.get(events.size() - 1);
        
        long thumbsUpCount = events.stream().filter(e -> Boolean.TRUE.equals(e.getThumbsUp())).count();
        long thumbsDownCount = events.stream().filter(e -> Boolean.TRUE.equals(e.getThumbsDown())).count();
        long sourceClickCount = events.stream().filter(e -> Boolean.TRUE.equals(e.getCitedSources())).count();
        long followUpCount = events.stream().filter(e -> e.getFollowUpQuery() != null).count();
        
        long totalFeedback = thumbsUpCount + thumbsDownCount;
        double satisfactionRate = totalFeedback > 0 ? (double) thumbsUpCount / totalFeedback : -1;
        double sourceClickRate = !events.isEmpty() ? (double) sourceClickCount / events.size() : 0;
        double followUpRate = !events.isEmpty() ? (double) followUpCount / events.size() : 0;
        
        SessionSummary summary = new SessionSummary();
        summary.setSessionId(sessionId);
        summary.setExperimentId(firstEvent.getExperimentId());
        summary.setGroupName(firstEvent.getGroupName());
        summary.setUserId(firstEvent.getUserId());
        summary.setQueryCount(events.size());
        summary.setThumbsUpCount((int) thumbsUpCount);
        summary.setThumbsDownCount((int) thumbsDownCount);
        summary.setSourceClickCount((int) sourceClickCount);
        summary.setFollowUpCount((int) followUpCount);
        summary.setSatisfactionRate(satisfactionRate >= 0 ? satisfactionRate : null);
        summary.setSourceClickRate(sourceClickRate);
        summary.setFollowUpRate(followUpRate);
        
        summaryRepo.save(summary);
    }
}

统计显著性检验：如何判断A比B真的更好

"实验组的满意度是82%，控制组是80%，提升了2%，是不是就可以上线了？"

不一定。

这2%的差异可能只是随机波动。我们需要统计检验来判断这个差异是否具有显著性。

package com.laozhang.rag.experiment;

import org.apache.commons.math3.stat.inference.ChiSquareTest;
import org.apache.commons.math3.stat.inference.TTest;
import org.springframework.stereotype.Service;

/**
 * 统计显著性检验服务
 *
 * 对于比例型指标（满意度、点击率）：使用卡方检验或Z检验
 * 对于连续型指标（延迟、查询次数）：使用t检验
 *
 * 显著性水平：p < 0.05（95%置信度）
 */
@Service
public class StatisticalSignificanceService {

    private final TTest tTest = new TTest();
    private final ChiSquareTest chiSquareTest = new ChiSquareTest();

    @Data
    public static class SignificanceResult {
        private boolean isSignificant;
        private double pValue;
        private double controlMean;
        private double treatmentMean;
        private double relativeLift;    // 相对提升百分比
        private double absoluteLift;    // 绝对提升
        private int controlSampleSize;
        private int treatmentSampleSize;
        private String recommendation;
        private String statisticalNote;
    }

    /**
     * 对比例型指标进行检验（如满意度、点击率）
     *
     * 使用双比例Z检验（Two-Proportion Z-Test）
     *
     * @param controlSuccess   控制组成功次数（如点赞数）
     * @param controlTotal     控制组总样本数
     * @param treatmentSuccess 实验组成功次数
     * @param treatmentTotal   实验组总样本数
     */
    public SignificanceResult testProportions(
            long controlSuccess, long controlTotal,
            long treatmentSuccess, long treatmentTotal) {
        
        SignificanceResult result = new SignificanceResult();
        result.setControlSampleSize((int) controlTotal);
        result.setTreatmentSampleSize((int) treatmentTotal);
        
        if (controlTotal == 0 || treatmentTotal == 0) {
            result.setIsSignificant(false);
            result.setStatisticalNote("样本量为0，无法进行统计检验");
            return result;
        }
        
        double controlRate = (double) controlSuccess / controlTotal;
        double treatmentRate = (double) treatmentSuccess / treatmentTotal;
        
        result.setControlMean(controlRate);
        result.setTreatmentMean(treatmentRate);
        result.setAbsoluteLift(treatmentRate - controlRate);
        result.setRelativeLift(controlRate > 0 ? (treatmentRate - controlRate) / controlRate * 100 : 0);
        
        // 使用卡方检验
        long[][] observed = {
            {controlSuccess, controlTotal - controlSuccess},
            {treatmentSuccess, treatmentTotal - treatmentSuccess}
        };
        
        try {
            double pValue = chiSquareTest.chiSquareTest(observed);
            result.setPValue(pValue);
            result.setIsSignificant(pValue < 0.05);
            
            // 生成建议
            result.setRecommendation(generateRecommendation(result));
            result.setStatisticalNote(generateStatisticalNote(result));
            
        } catch (Exception e) {
            result.setIsSignificant(false);
            result.setStatisticalNote("统计检验失败：" + e.getMessage());
        }
        
        return result;
    }

    /**
     * 对连续型指标进行检验（如延迟、对话轮次）
     * 使用独立样本t检验
     */
    public SignificanceResult testContinuous(double[] controlValues, double[] treatmentValues) {
        SignificanceResult result = new SignificanceResult();
        result.setControlSampleSize(controlValues.length);
        result.setTreatmentSampleSize(treatmentValues.length);
        
        if (controlValues.length < 2 || treatmentValues.length < 2) {
            result.setIsSignificant(false);
            result.setStatisticalNote("样本量不足，需要至少2个样本");
            return result;
        }
        
        double controlMean = java.util.Arrays.stream(controlValues).average().orElse(0);
        double treatmentMean = java.util.Arrays.stream(treatmentValues).average().orElse(0);
        
        result.setControlMean(controlMean);
        result.setTreatmentMean(treatmentMean);
        result.setAbsoluteLift(treatmentMean - controlMean);
        result.setRelativeLift(controlMean != 0 ? (treatmentMean - controlMean) / controlMean * 100 : 0);
        
        try {
            double pValue = tTest.tTest(controlValues, treatmentValues);
            result.setPValue(pValue);
            result.setIsSignificant(pValue < 0.05);
            result.setRecommendation(generateRecommendation(result));
            result.setStatisticalNote(generateStatisticalNote(result));
        } catch (Exception e) {
            result.setIsSignificant(false);
            result.setStatisticalNote("t检验失败：" + e.getMessage());
        }
        
        return result;
    }
    
    private String generateRecommendation(SignificanceResult result) {
        if (!result.isSignificant()) {
            return String.format(
                "差异不显著（p=%.3f > 0.05），建议继续收集数据或放弃本次改动",
                result.getPValue()
            );
        }
        
        if (result.getAbsoluteLift() > 0) {
            return String.format(
                "实验组显著优于控制组（p=%.3f，相对提升%.1f%%），建议上线实验组配置",
                result.getPValue(), result.getRelativeLift()
            );
        } else {
            return String.format(
                "实验组显著劣于控制组（p=%.3f，相对下降%.1f%%），建议保留控制组配置",
                result.getPValue(), Math.abs(result.getRelativeLift())
            );
        }
    }
    
    private String generateStatisticalNote(SignificanceResult result) {
        return String.format(
            "控制组：%.1f%% (n=%d)，实验组：%.1f%% (n=%d)，绝对差异：%.2f%%，p值：%.4f",
            result.getControlMean() * 100, result.getControlSampleSize(),
            result.getTreatmentMean() * 100, result.getTreatmentSampleSize(),
            result.getAbsoluteLift() * 100, result.getPValue()
        );
    }

    /**
     * 计算达到统计显著性所需的最小样本量
     * 帮助工程师提前规划实验时间
     *
     * @param baselineRate     控制组的基准转化率（如当前满意度75%则为0.75）
     * @param minimumDetectableEffect 最小可检测效应（如想检测1%的提升则为0.01）
     */
    public int calculateRequiredSampleSize(double baselineRate, double minimumDetectableEffect) {
        // 基于Z检验的样本量公式
        // n = (Z_α/2 + Z_β)² * (p1*(1-p1) + p2*(1-p2)) / (p1-p2)²
        // α=0.05 (Z=1.96), β=0.2 (Z=0.84), power=80%
        
        double alpha = 0.05;
        double beta = 0.20;
        double zAlpha = 1.96;  // Z score for α=0.05 (two-tailed)
        double zBeta = 0.84;   // Z score for β=0.20 (power=80%)
        
        double p1 = baselineRate;
        double p2 = baselineRate + minimumDetectableEffect;
        
        double numerator = Math.pow(zAlpha + zBeta, 2) * (p1 * (1 - p1) + p2 * (1 - p2));
        double denominator = Math.pow(p1 - p2, 2);
        
        int requiredPerGroup = (int) Math.ceil(numerator / denominator);
        
        log.info("Required sample size per group: {} (baseline: {}, MDE: {})", 
                requiredPerGroup, baselineRate, minimumDetectableEffect);
        
        return requiredPerGroup;
    }
}

实验结果分析服务

@Service
public class ExperimentAnalysisService {

    private final SessionSummaryRepository summaryRepo;
    private final QueryEventRepository eventRepo;
    private final StatisticalSignificanceService statsService;

    /**
     * 生成完整的实验分析报告
     */
    public ExperimentReport generateReport(String experimentId) {
        // 获取两组的会话数据
        List<SessionSummary> controlSessions = summaryRepo
            .findByExperimentIdAndGroupName(experimentId, "CONTROL");
        List<SessionSummary> treatmentSessions = summaryRepo
            .findByExperimentIdAndGroupName(experimentId, "TREATMENT");
        
        ExperimentReport report = new ExperimentReport();
        report.setExperimentId(experimentId);
        report.setControlSessionCount(controlSessions.size());
        report.setTreatmentSessionCount(treatmentSessions.size());
        report.setGeneratedAt(Instant.now());
        
        // 1. 满意度对比
        long controlThumbsUp = controlSessions.stream()
            .mapToLong(s -> s.getThumbsUpCount()).sum();
        long controlTotalFeedback = controlSessions.stream()
            .mapToLong(s -> s.getThumbsUpCount() + s.getThumbsDownCount()).sum();
        
        long treatmentThumbsUp = treatmentSessions.stream()
            .mapToLong(s -> s.getThumbsUpCount()).sum();
        long treatmentTotalFeedback = treatmentSessions.stream()
            .mapToLong(s -> s.getThumbsUpCount() + s.getThumbsDownCount()).sum();
        
        if (controlTotalFeedback > 0 && treatmentTotalFeedback > 0) {
            SignificanceResult satisfactionResult = statsService.testProportions(
                controlThumbsUp, controlTotalFeedback,
                treatmentThumbsUp, treatmentTotalFeedback
            );
            report.setSatisfactionAnalysis(satisfactionResult);
        }
        
        // 2. 来源点击率对比
        long controlSourceClicks = controlSessions.stream()
            .mapToLong(s -> s.getSourceClickCount()).sum();
        long controlTotalQueries = controlSessions.stream()
            .mapToLong(s -> s.getQueryCount()).sum();
        
        long treatmentSourceClicks = treatmentSessions.stream()
            .mapToLong(s -> s.getSourceClickCount()).sum();
        long treatmentTotalQueries = treatmentSessions.stream()
            .mapToLong(s -> s.getQueryCount()).sum();
        
        if (controlTotalQueries > 0 && treatmentTotalQueries > 0) {
            SignificanceResult sourceClickResult = statsService.testProportions(
                controlSourceClicks, controlTotalQueries,
                treatmentSourceClicks, treatmentTotalQueries
            );
            report.setSourceClickAnalysis(sourceClickResult);
        }
        
        // 3. 追问率对比（越低越好）
        double[] controlFollowUpRates = controlSessions.stream()
            .mapToDouble(s -> s.getFollowUpRate() != null ? s.getFollowUpRate() : 0.0)
            .toArray();
        double[] treatmentFollowUpRates = treatmentSessions.stream()
            .mapToDouble(s -> s.getFollowUpRate() != null ? s.getFollowUpRate() : 0.0)
            .toArray();
        
        if (controlFollowUpRates.length > 10 && treatmentFollowUpRates.length > 10) {
            SignificanceResult followUpResult = statsService.testContinuous(
                controlFollowUpRates, treatmentFollowUpRates
            );
            report.setFollowUpAnalysis(followUpResult);
        }
        
        // 4. 生成综合建议
        report.setOverallRecommendation(generateOverallRecommendation(report));
        
        return report;
    }
    
    private String generateOverallRecommendation(ExperimentReport report) {
        int positiveSignals = 0;
        int negativeSignals = 0;
        int totalSignificantMetrics = 0;
        
        // 满意度：越高越好
        if (report.getSatisfactionAnalysis() != null && 
            report.getSatisfactionAnalysis().isSignificant()) {
            totalSignificantMetrics++;
            if (report.getSatisfactionAnalysis().getAbsoluteLift() > 0) positiveSignals++;
            else negativeSignals++;
        }
        
        // 来源点击率：越高越好
        if (report.getSourceClickAnalysis() != null && 
            report.getSourceClickAnalysis().isSignificant()) {
            totalSignificantMetrics++;
            if (report.getSourceClickAnalysis().getAbsoluteLift() > 0) positiveSignals++;
            else negativeSignals++;
        }
        
        // 追问率：越低越好（负提升是正向的）
        if (report.getFollowUpAnalysis() != null && 
            report.getFollowUpAnalysis().isSignificant()) {
            totalSignificantMetrics++;
            if (report.getFollowUpAnalysis().getAbsoluteLift() < 0) positiveSignals++;
            else negativeSignals++;
        }
        
        if (totalSignificantMetrics == 0) {
            return "实验结果尚未达到统计显著性。建议：继续实验直到每组至少收集到" +
                   statsService.calculateRequiredSampleSize(0.75, 0.02) + "个有效反馈。";
        }
        
        if (positiveSignals > negativeSignals) {
            return String.format("推荐上线实验组配置！%d项指标显著提升，%d项显著下降。", 
                               positiveSignals, negativeSignals);
        } else if (negativeSignals > positiveSignals) {
            return String.format("建议保留控制组配置！%d项指标显著下降，%d项显著提升。", 
                               negativeSignals, positiveSignals);
        } else {
            return "实验结果混合，建议进一步分析哪些用户群体从实验组中受益更多。";
        }
    }
}

多变量测试：同时测试多个维度

真实的优化往往涉及多个变量（分块策略 + 检索数量 + 是否用Reranker）。多变量测试可以同时测试。

@Service
public class MultiVariateExperiment {

    /**
     * 四组多变量测试配置
     */
    public static final Map<String, ExperimentConfig.RagConfig> EXPERIMENT_GROUPS = Map.of(
        "A_baseline", ExperimentConfig.RagConfig.builder()
            .topK(5).rerankerEnabled(false).hydeEnabled(false).build(),
        
        "B_more_retrieval", ExperimentConfig.RagConfig.builder()
            .topK(10).rerankerEnabled(false).hydeEnabled(false).build(),
        
        "C_reranker", ExperimentConfig.RagConfig.builder()
            .topK(5).rerankerEnabled(true).rerankerTopK(3).hydeEnabled(false).build(),
        
        "D_full_stack", ExperimentConfig.RagConfig.builder()
            .topK(10).rerankerEnabled(true).rerankerTopK(5).hydeEnabled(true).build()
    );
    
    /**
     * 按用户ID均匀分配到4个组
     */
    public String assignGroup(String userId) {
        int hash = Math.abs(userId.hashCode()) % 4;
        String[] groups = {"A_baseline", "B_more_retrieval", "C_reranker", "D_full_stack"};
        return groups[hash];
    }
}

实验看板：实时监控实验数据

@RestController
@RequestMapping("/api/v1/experiments")
public class ExperimentDashboardController {

    private final ExperimentAnalysisService analysisService;
    private final ExperimentRepository experimentRepo;

    /**
     * 获取实验实时指标
     */
    @GetMapping("/{experimentId}/metrics")
    public ResponseEntity<ExperimentMetrics> getMetrics(
            @PathVariable String experimentId,
            @RequestParam(defaultValue = "24h") String timeRange) {
        
        ExperimentMetrics metrics = analysisService.getRealtimeMetrics(experimentId, timeRange);
        return ResponseEntity.ok(metrics);
    }
    
    /**
     * 获取完整分析报告
     */
    @GetMapping("/{experimentId}/report")
    public ResponseEntity<ExperimentReport> getReport(@PathVariable String experimentId) {
        ExperimentReport report = analysisService.generateReport(experimentId);
        return ResponseEntity.ok(report);
    }
    
    /**
     * 停止实验（上线或回滚）
     */
    @PostMapping("/{experimentId}/stop")
    public ResponseEntity<Void> stopExperiment(
            @PathVariable String experimentId,
            @RequestParam String decision,  // "ship_treatment" or "keep_control"
            @RequestParam String reason) {
        
        experimentRepo.updateStatus(experimentId, 
            ExperimentConfig.ExperimentStatus.COMPLETED,
            decision, reason);
        
        return ResponseEntity.ok().build();
    }
}