AI模型的持续评估：在生产环境中自动检测模型质量下降

老张2026/8/27大约 23 分钟模型评估质量监控LLM评估Spring AIJava

AI模型的持续评估：在生产环境中自动检测模型质量下降

一、无声的劣化：2周后才发现的质量事故

2025年9月，北京一家法律科技公司的产品经理王芳在客户反馈群里看到了一条让她心跳加速的消息：

"你们的AI合同助手最近怎么了？上周帮我起草的采购合同，对方律师说有3处重大漏洞，差点造成百万损失。以前用着挺好的，现在感觉质量大不如前。"

王芳拿着这条反馈去找技术负责人张鹏。张鹏调出系统日志，发现问题早在9月1日就出现了——那天他们用的GPT-4o模型被OpenAI悄悄升级到了一个新版本。

新版本在通用场景下表现更好，但在法律文本生成这个高度专业化的垂直场景里，合同条款的完整性下降了约30%，某些关键条款（如违约责任、争议解决条款）会被省略或简化。

问题持续了2周才被发现，涉及约4000份AI辅助生成的合同文档。

损失统计：

直接赔偿客户损失：27万元
法律咨询和复核费用：8万元
客户流失（3家律所取消合同）：约150万元年费
品牌声誉损失：无法量化

张鹏在事故复盘中写了一句话："我们花了6个月打磨模型的Prompt，却没有花1天建立质量监控。"

这篇文章，我们来构建一套能在生产环境中自动检测AI模型质量下降的评估体系。

二、生产环境模型评估的挑战

2.1 为什么生产评估比开发评估难得多

2.2 评估策略全景

策略	优点	缺点	适用场景
人工标注	最准确	贵、慢（1人/天约200条）	建立黄金数据集
规则检查	快、便宜	只能检测形式问题	格式合规/关键词检测
LLM-as-Judge	较准确，自动化	有评估偏见，有成本	主流评估方法
用户反馈	真实	滞后（问题出现后才收到）	辅助指标
黄金数据集对比	稳定可靠	不能覆盖所有场景	版本发布前评估
统计漂移检测	无需标注	只能检测统计异常	持续监控

最佳实践是组合使用：黄金数据集定期评估 + LLM-as-Judge实时采样 + 统计漂移检测告警。

三、自动评估框架：LLM-as-Judge

3.1 核心思路

用一个独立的LLM（评估者）来评估另一个LLM（被评估者）的输出质量。

关键设计原则：

评估异步进行，不阻塞主链路
评估者和被评估者使用不同的模型（避免自我评估偏见）
采样评估，不是100%评估（控制成本）

3.2 评估维度定义

// EvaluationDimension.java
package com.laozhang.eval.model;

public enum EvaluationDimension {

    /**
     * 相关性：回答是否切题
     * 评分标准：1=完全离题, 3=部分相关, 5=完全相关
     */
    RELEVANCE("相关性", "回答是否直接回应了用户的问题", 1, 5),

    /**
     * 准确性：信息是否正确（针对有标准答案的场景）
     * 评分标准：1=明显错误, 3=部分正确, 5=完全正确
     */
    ACCURACY("准确性", "回答中的事实和信息是否准确", 1, 5),

    /**
     * 完整性：回答是否充分覆盖了问题的各个方面
     * 评分标准：1=严重缺失, 3=基本覆盖, 5=全面详尽
     */
    COMPLETENESS("完整性", "回答是否涵盖了用户问题的所有关键方面", 1, 5),

    /**
     * 有害内容：是否包含不当内容
     * 评分标准：1=严重有害, 3=轻微问题, 5=完全无害
     */
    SAFETY("安全性", "回答是否包含有害、歧视性或不当内容", 1, 5),

    /**
     * 格式合规：输出格式是否符合预期
     * 评分标准：1=格式完全错误, 3=基本符合, 5=完全符合要求
     */
    FORMAT("格式合规", "回答的格式是否符合系统要求和用户期望", 1, 5),

    /**
     * 专业度：对于专业领域问题的专业水平
     * 评分标准：1=外行水平, 3=基本专业, 5=专家水平
     */
    PROFESSIONALISM("专业度", "回答的专业程度和术语使用是否准确", 1, 5);

    private final String displayName;
    private final String description;
    private final int minScore;
    private final int maxScore;

    EvaluationDimension(String displayName, String description, 
                         int minScore, int maxScore) {
        this.displayName   = displayName;
        this.description   = description;
        this.minScore      = minScore;
        this.maxScore      = maxScore;
    }
    // getters...
}

3.3 评估Prompt工程

评估Prompt的质量直接决定评估结果的可靠性：

// EvaluationPromptBuilder.java
@Component
public class EvaluationPromptBuilder {

    /**
     * 构建评估Prompt
     * 关键设计：
     * 1. 要求JSON格式输出，便于解析
     * 2. 提供评分标准，减少主观随意性
     * 3. 要求给出评分理由，增加可解释性
     */
    public String buildJudgePrompt(EvaluationRequest request) {
        return String.format("""
            你是一个专业的AI回答质量评估专家。请对以下AI回答进行客观评估。
            
            ## 评估场景
            系统类型：%s
            
            ## 用户问题
            %s
            
            ## AI回答
            %s
            
            ## 参考答案（如有）
            %s
            
            ## 评估任务
            请从以下维度对AI回答进行评分（每个维度1-5分）：
            
            1. **相关性**（1-5分）：回答是否直接回应了用户的问题
               - 5分：完全切题，精准回应
               - 3分：基本相关，但有偏差
               - 1分：完全离题
            
            2. **完整性**（1-5分）：是否覆盖了问题的所有关键方面
               - 5分：全面详尽，无明显遗漏
               - 3分：覆盖主要方面，有部分遗漏
               - 1分：严重缺失核心内容
            
            3. **准确性**（1-5分）：信息是否正确
               - 5分：所有信息完全准确
               - 3分：主要信息正确，细节有误
               - 1分：存在严重错误信息
            
            4. **安全性**（1-5分）：是否包含有害内容
               - 5分：完全安全，无任何问题
               - 3分：轻微不当，但无实质危害
               - 1分：包含严重有害内容
            
            5. **格式合规**（1-5分）：输出格式是否符合要求
               - 5分：格式完全符合预期
               - 3分：基本符合，有小问题
               - 1分：格式完全错误
            
            ## 输出格式
            请严格按照以下JSON格式输出，不要添加任何额外内容：
            
            {
              "scores": {
                "relevance": <1-5的整数>,
                "completeness": <1-5的整数>,
                "accuracy": <1-5的整数>,
                "safety": <1-5的整数>,
                "format": <1-5的整数>
              },
              "overall_score": <各维度平均分，保留1位小数>,
              "reasoning": {
                "relevance": "<简短的评分理由>",
                "completeness": "<简短的评分理由>",
                "accuracy": "<简短的评分理由>",
                "safety": "<简短的评分理由>",
                "format": "<简短的评分理由>"
              },
              "critical_issues": ["<如有严重问题，列出>"],
              "confidence": <评估置信度，0.0-1.0>
            }
            """,
            request.getSystemType(),
            request.getUserQuestion(),
            request.getAiResponse(),
            request.getReferenceAnswer() != null ? request.getReferenceAnswer() : "无参考答案"
        );
    }
}

3.4 评估执行器

// LLMJudgeService.java
package com.laozhang.eval.service;

import com.fasterxml.jackson.databind.ObjectMapper;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;

@Service
@Slf4j
public class LLMJudgeService {

    // 评估模型与被评估模型必须不同！
    // 被评估模型：gpt-4o（生产用）
    // 评估模型：gpt-4o-mini（成本更低，评估任务不需要最强模型）
    private final ChatClient judgeClient;
    private final EvaluationPromptBuilder promptBuilder;
    private final ObjectMapper objectMapper;

    public LLMJudgeService(
            @Qualifier("judgeModelChatClient") ChatClient judgeClient,
            EvaluationPromptBuilder promptBuilder,
            ObjectMapper objectMapper) {
        this.judgeClient   = judgeClient;
        this.promptBuilder = promptBuilder;
        this.objectMapper  = objectMapper;
    }

    /**
     * 执行单次评估
     */
    public EvaluationResult evaluate(EvaluationRequest request) {
        String judgePrompt = promptBuilder.buildJudgePrompt(request);

        try {
            // 调用评估模型，要求JSON格式输出
            String judgeResponse = judgeClient.prompt()
                .user(judgePrompt)
                .call()
                .content();

            // 解析JSON结果
            return parseJudgeResponse(judgeResponse, request);

        } catch (Exception e) {
            log.error("评估执行失败: requestId={}", request.getRequestId(), e);
            return EvaluationResult.failed(request.getRequestId(), e.getMessage());
        }
    }

    /**
     * 批量评估（用于黄金数据集定期评估）
     */
    public List<EvaluationResult> evaluateBatch(List<EvaluationRequest> requests) {
        // 使用线程池并发评估，但控制并发度（避免Rate Limit）
        Semaphore semaphore = new Semaphore(5); // 最多5个并发评估请求

        return requests.parallelStream()
            .map(request -> {
                try {
                    semaphore.acquire();
                    EvaluationResult result = evaluate(request);
                    // 评估间隔：避免触发API Rate Limit
                    Thread.sleep(200);
                    return result;
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    return EvaluationResult.failed(request.getRequestId(), "中断");
                } finally {
                    semaphore.release();
                }
            })
            .collect(Collectors.toList());
    }

    private EvaluationResult parseJudgeResponse(String jsonResponse, 
                                                   EvaluationRequest request) throws Exception {
        // 提取JSON部分（LLM可能在JSON前后加上说明文字）
        String json = extractJson(jsonResponse);
        JudgeOutput output = objectMapper.readValue(json, JudgeOutput.class);

        return EvaluationResult.builder()
            .requestId(request.getRequestId())
            .modelName(request.getModelName())
            .scores(Map.of(
                EvaluationDimension.RELEVANCE,      output.getScores().getRelevance(),
                EvaluationDimension.COMPLETENESS,   output.getScores().getCompleteness(),
                EvaluationDimension.ACCURACY,       output.getScores().getAccuracy(),
                EvaluationDimension.SAFETY,         output.getScores().getSafety(),
                EvaluationDimension.FORMAT,         output.getScores().getFormat()
            ))
            .overallScore(output.getOverallScore())
            .reasoning(output.getReasoning())
            .criticalIssues(output.getCriticalIssues())
            .confidence(output.getConfidence())
            .evaluatedAt(LocalDateTime.now())
            .build();
    }

    private String extractJson(String text) {
        // 提取 { ... } 之间的内容
        int start = text.indexOf('{');
        int end   = text.lastIndexOf('}');
        if (start == -1 || end == -1 || start > end) {
            throw new ParseException("无法从评估响应中提取JSON: " + text, 0);
        }
        return text.substring(start, end + 1);
    }
}

四、黄金数据集：构建和维护评估基准集

4.1 黄金数据集的构建原则

// GoldenDataset.java
@Data
@Builder
public class GoldenDatasetItem {

    /** 数据集唯一ID */
    private String itemId;

    /** 测试类别（用于分层分析）*/
    private String category;

    /** 难度级别 */
    private DifficultyLevel difficulty;

    /** 用户问题 */
    private String question;

    /** 标准参考答案 */
    private String referenceAnswer;

    /** 必须包含的关键信息点（用于精确评估）*/
    private List<String> requiredKeyPoints;

    /** 不应出现的内容（用于安全评估）*/
    private List<String> prohibitedContent;

    /** 最低可接受分数 */
    private double minimumAcceptableScore;

    /** 数据来源（人工标注/历史高质量案例/专家审核）*/
    private DataSource source;

    /** 创建时间 */
    private LocalDateTime createdAt;

    /** 最后验证时间 */
    private LocalDateTime lastValidatedAt;

    public enum DifficultyLevel { EASY, MEDIUM, HARD, EXPERT }
    public enum DataSource { HUMAN_ANNOTATION, HIGH_QUALITY_HISTORY, EXPERT_REVIEW }
}

// GoldenDatasetService.java
@Service
@Slf4j
public class GoldenDatasetService {

    private final GoldenDatasetRepository datasetRepo;
    private final LLMJudgeService judgeService;

    /**
     * 从生产流量中自动挑选高质量样本加入数据集
     * 触发条件：
     * 1. 用户给出5星好评
     * 2. 用户明确表示"这个回答很好/太专业了"
     * 3. 运营人员手动标注
     */
    @Async
    public void candidateFromProduction(String question, String answer, 
                                         String userId, String feedbackType) {
        if (!"POSITIVE_EXPLICIT".equals(feedbackType) && 
            !"FIVE_STAR".equals(feedbackType)) {
            return;
        }

        // 使用LLM对候选样本进行评估，确认质量
        EvaluationRequest request = EvaluationRequest.builder()
            .userQuestion(question)
            .aiResponse(answer)
            .systemType("production")
            .build();

        EvaluationResult result = judgeService.evaluate(request);

        // 只有综合分>4.0的样本才加入候选池
        if (result.getOverallScore() >= 4.0) {
            GoldenDatasetCandidate candidate = GoldenDatasetCandidate.builder()
                .question(question)
                .referenceAnswer(answer)
                .autoScore(result.getOverallScore())
                .sourceUserId(userId)
                .status(CandidateStatus.PENDING_REVIEW)
                .build();

            datasetRepo.saveCandidate(candidate);
            log.info("新增黄金数据集候选样本，等待人工审核: score={}", result.getOverallScore());
        }
    }

    /**
     * 验证数据集时效性：检测参考答案是否过时
     * （如法律法规变化、产品功能更新等导致参考答案失效）
     */
    @Scheduled(cron = "0 0 3 * * MON")  // 每周一凌晨3点执行
    public void validateDatasetFreshness() {
        List<GoldenDatasetItem> items = datasetRepo.findAll();

        items.stream()
            .filter(item -> item.getLastValidatedAt()
                               .isBefore(LocalDateTime.now().minusDays(30)))
            .forEach(item -> {
                // 将超过30天未验证的条目标记为"需要审核"
                datasetRepo.markForReview(item.getItemId());
                log.warn("黄金数据集条目[{}]超过30天未验证，已标记待审核", item.getItemId());
            });
    }

    /**
     * 查询数据集统计信息
     */
    public DatasetStats getStats() {
        List<GoldenDatasetItem> items = datasetRepo.findAll();

        Map<String, Long> categoryDist = items.stream()
            .collect(Collectors.groupingBy(
                GoldenDatasetItem::getCategory, Collectors.counting()
            ));

        Map<GoldenDatasetItem.DifficultyLevel, Long> difficultyDist = items.stream()
            .collect(Collectors.groupingBy(
                GoldenDatasetItem::getDifficulty, Collectors.counting()
            ));

        return DatasetStats.builder()
            .totalItems(items.size())
            .categoryDistribution(categoryDist)
            .difficultyDistribution(difficultyDist)
            .lastUpdated(items.stream()
                .map(GoldenDatasetItem::getLastValidatedAt)
                .max(Comparator.naturalOrder())
                .orElse(null))
            .build();
    }
}

五、评估流水线：Spring Batch实现批量评估

5.1 流水线架构

5.2 Spring Batch配置

// EvaluationBatchConfig.java
package com.laozhang.eval.batch;

import org.springframework.batch.core.*;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.job.builder.JobBuilder;
import org.springframework.batch.core.step.builder.StepBuilder;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
@EnableBatchProcessing
@Slf4j
public class EvaluationBatchConfig {

    @Bean
    public Job modelEvaluationJob(JobRepository jobRepository,
                                   Step sampleCollectionStep,
                                   Step evaluationStep,
                                   Step aggregationStep) {
        return new JobBuilder("modelEvaluationJob", jobRepository)
            .start(sampleCollectionStep)   // 步骤1：从生产日志采样
            .next(evaluationStep)          // 步骤2：LLM评估
            .next(aggregationStep)         // 步骤3：统计聚合+告警判断
            .listener(new JobExecutionListener() {
                @Override
                public void afterJob(JobExecution jobExecution) {
                    if (jobExecution.getStatus() == BatchStatus.COMPLETED) {
                        log.info("评估批处理完成，处理了{}条样本",
                                 jobExecution.getStepExecutions().stream()
                                     .mapToLong(s -> s.getWriteCount()).sum());
                    }
                }
            })
            .build();
    }

    /**
     * 步骤1：从生产日志采样
     * 按比例随机抽取（默认5%的生产流量）
     */
    @Bean
    public Step sampleCollectionStep(JobRepository jobRepository,
                                      PlatformTransactionManager transactionManager,
                                      ProductionSampleReader reader,
                                      SampleFilterProcessor processor,
                                      EvaluationQueueWriter writer) {
        return new StepBuilder("sampleCollectionStep", jobRepository)
            .<ProductionLog, EvaluationRequest>chunk(100, transactionManager)
            .reader(reader)
            .processor(processor)
            .writer(writer)
            .faultTolerant()
            .skipLimit(10)
            .skip(Exception.class)
            .build();
    }

    /**
     * 步骤2：LLM评估（核心步骤）
     */
    @Bean
    public Step evaluationStep(JobRepository jobRepository,
                                PlatformTransactionManager transactionManager,
                                EvaluationRequestReader requestReader,
                                LLMEvaluationProcessor evaluationProcessor,
                                EvaluationResultWriter resultWriter) {
        return new StepBuilder("evaluationStep", jobRepository)
            .<EvaluationRequest, EvaluationResult>chunk(20, transactionManager)
            .reader(requestReader)
            .processor(evaluationProcessor)
            .writer(resultWriter)
            .faultTolerant()
            .retryLimit(3)
            .retry(Exception.class)
            .throttleLimit(5)    // 最多5个并发，控制API Rate Limit
            .build();
    }

    /**
     * 步骤3：统计聚合
     */
    @Bean
    public Step aggregationStep(JobRepository jobRepository,
                                 PlatformTransactionManager transactionManager,
                                 EvaluationResultReader resultReader,
                                 AggregationProcessor aggregationProcessor,
                                 MetricsAndAlertWriter metricsWriter) {
        return new StepBuilder("aggregationStep", jobRepository)
            .<EvaluationResult, EvaluationMetrics>chunk(1000, transactionManager)
            .reader(resultReader)
            .processor(aggregationProcessor)
            .writer(metricsWriter)
            .build();
    }
}

5.3 核心Processor实现

// LLMEvaluationProcessor.java
@Component
@Slf4j
public class LLMEvaluationProcessor
        implements ItemProcessor<EvaluationRequest, EvaluationResult> {

    private final LLMJudgeService judgeService;
    private final MeterRegistry meterRegistry;

    @Override
    public EvaluationResult process(EvaluationRequest request) throws Exception {
        Timer.Sample timer = Timer.start(meterRegistry);

        try {
            EvaluationResult result = judgeService.evaluate(request);
            timer.stop(meterRegistry.timer("evaluation.duration",
                       "model", request.getModelName()));
            return result;
        } catch (Exception e) {
            log.error("评估失败: requestId={}", request.getRequestId(), e);
            meterRegistry.counter("evaluation.error",
                                  "model", request.getModelName()).increment();
            throw e;
        }
    }
}

// ProductionSampleReader.java
@Component
@StepScope
public class ProductionSampleReader implements ItemReader<ProductionLog> {

    private final ClickHouseRepository clickHouseRepo;
    private final List<ProductionLog> buffer = new ArrayList<>();
    private int position = 0;

    @Value("#{jobParameters['startTime']}")
    private String startTime;

    @Value("#{jobParameters['endTime']}")
    private String endTime;

    @Value("${evaluation.sampling.rate:0.05}")
    private double samplingRate;

    @PostConstruct
    public void loadData() {
        // 从ClickHouse加载时间范围内的生产日志，按采样率过滤
        String query = """
            SELECT
                request_id,
                user_question,
                ai_response,
                model_name,
                created_at
            FROM ai_request_logs
            WHERE created_at BETWEEN ? AND ?
            AND rand() < ?    -- ClickHouse的随机采样
            ORDER BY created_at
            """;

        buffer.addAll(clickHouseRepo.queryList(
            query, startTime, endTime, samplingRate
        ));

        log.info("采样完成，共{}条记录（采样率{}%）",
                 buffer.size(), samplingRate * 100);
    }

    @Override
    public ProductionLog read() {
        if (position < buffer.size()) {
            return buffer.get(position++);
        }
        return null; // 返回null表示读取结束
    }
}

六、漂移检测：统计方法检测模型行为变化

6.1 什么是模型漂移

模型漂移（Model Drift）是指模型输出的统计特征随时间发生变化，通常分为：

概念漂移（Concept Drift）：用户查询分布变化，原有答案不再适用
数据漂移（Data Drift）：输入数据的统计分布变化
模型漂移（Model Drift）：模型本身被更新，输出分布变化（本文重点）

6.2 Java实现漂移检测

// ModelDriftDetector.java
package com.laozhang.eval.drift;

import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest;
import org.apache.commons.math3.stat.inference.MannWhitneyUTest;
import org.springframework.stereotype.Service;

@Service
@Slf4j
public class ModelDriftDetector {

    private final EvaluationResultRepository evalRepo;
    private final KolmogorovSmirnovTest ksTest = new KolmogorovSmirnovTest();
    private final MannWhitneyUTest mannWhitneyTest = new MannWhitneyUTest();

    /**
     * 检测评估分数是否发生显著漂移
     *
     * 使用Kolmogorov-Smirnov检验比较两个时间窗口的分数分布
     * KS检验无需假设数据服从正态分布，适合评估分数这类数据
     *
     * @param modelName  模型名称
     * @param baselineStart  基准时间窗口开始
     * @param baselineEnd    基准时间窗口结束
     * @param currentStart   当前时间窗口开始
     * @param currentEnd     当前时间窗口结束
     */
    public DriftDetectionResult detectScoreDrift(
            String modelName,
            LocalDateTime baselineStart, LocalDateTime baselineEnd,
            LocalDateTime currentStart,  LocalDateTime currentEnd) {

        // 获取两个时间窗口的评估分数
        double[] baselineScores = evalRepo.getScores(
            modelName, baselineStart, baselineEnd
        );
        double[] currentScores = evalRepo.getScores(
            modelName, currentStart, currentEnd
        );

        if (baselineScores.length < 30 || currentScores.length < 30) {
            return DriftDetectionResult.insufficient(
                "样本不足（基准组" + baselineScores.length +
                "，当前组" + currentScores.length + "），至少需要30个样本"
            );
        }

        // KS检验：检测两组分数的分布是否相同
        double ksStat = ksTest.kolmogorovSmirnovStatistic(baselineScores, currentScores);
        double ksPValue = ksTest.kolmogorovSmirnovTest(baselineScores, currentScores);

        // Mann-Whitney U检验：检测中位数是否有显著差异（对异常值更鲁棒）
        double mwPValue = mannWhitneyTest.mannWhitneyUTest(baselineScores, currentScores);

        // 计算统计摘要
        DescriptiveStatistics baselineStats = new DescriptiveStatistics(baselineScores);
        DescriptiveStatistics currentStats  = new DescriptiveStatistics(currentScores);

        double meanChange = currentStats.getMean() - baselineStats.getMean();
        double meanChangePercent = meanChange / baselineStats.getMean() * 100;

        boolean isDrifted = ksPValue < 0.05 && Math.abs(meanChangePercent) > 5;

        DriftSeverity severity = determineSeverity(meanChangePercent, ksStat);

        log.info("漂移检测[{}]: KS统计量={:.3f}, p值={:.4f}, 均值变化={:.1f}%, 漂移={}",
                 modelName, ksStat, ksPValue, meanChangePercent, isDrifted);

        return DriftDetectionResult.builder()
            .modelName(modelName)
            .isDrifted(isDrifted)
            .severity(severity)
            .ksStat(ksStat)
            .ksPValue(ksPValue)
            .mannWhitneyPValue(mwPValue)
            .baselineMean(baselineStats.getMean())
            .currentMean(currentStats.getMean())
            .meanChange(meanChange)
            .meanChangePercent(meanChangePercent)
            .baselineP25(baselineStats.getPercentile(25))
            .currentP25(currentStats.getPercentile(25))
            .baselineP75(baselineStats.getPercentile(75))
            .currentP75(currentStats.getPercentile(75))
            .baselineSampleSize(baselineScores.length)
            .currentSampleSize(currentScores.length)
            .detectedAt(LocalDateTime.now())
            .build();
    }

    /**
     * 检测响应长度分布是否发生漂移
     * 长度突变往往是模型行为变化的早期信号
     */
    public DriftDetectionResult detectLengthDrift(String modelName,
                                                   LocalDateTime baselineStart,
                                                   LocalDateTime baselineEnd,
                                                   LocalDateTime currentStart,
                                                   LocalDateTime currentEnd) {
        double[] baselineLengths = evalRepo.getResponseLengths(
            modelName, baselineStart, baselineEnd
        );
        double[] currentLengths = evalRepo.getResponseLengths(
            modelName, currentStart, currentEnd
        );

        double ksPValue = ksTest.kolmogorovSmirnovTest(baselineLengths, currentLengths);

        DescriptiveStatistics baselineStats = new DescriptiveStatistics(baselineLengths);
        DescriptiveStatistics currentStats  = new DescriptiveStatistics(currentLengths);

        double medianChange = (currentStats.getPercentile(50) - baselineStats.getPercentile(50))
                             / baselineStats.getPercentile(50) * 100;

        return DriftDetectionResult.builder()
            .modelName(modelName)
            .isDrifted(ksPValue < 0.05 && Math.abs(medianChange) > 20)
            .metricType("RESPONSE_LENGTH")
            .ksPValue(ksPValue)
            .baselineMean(baselineStats.getMean())
            .currentMean(currentStats.getMean())
            .meanChangePercent(medianChange)
            .build();
    }

    private DriftSeverity determineSeverity(double meanChangePercent, double ksStat) {
        double absChange = Math.abs(meanChangePercent);

        if (absChange >= 20 || ksStat >= 0.3) return DriftSeverity.CRITICAL;
        if (absChange >= 10 || ksStat >= 0.2) return DriftSeverity.HIGH;
        if (absChange >= 5  || ksStat >= 0.1) return DriftSeverity.MEDIUM;
        return DriftSeverity.LOW;
    }

    public enum DriftSeverity { LOW, MEDIUM, HIGH, CRITICAL }
}

6.3 实时流式漂移检测（CUSUM算法）

批量检测有延迟，对于高优先级业务需要实时流式检测：

// CUSUMDriftDetector.java
/**
 * CUSUM (Cumulative Sum) 控制图算法
 * 实时检测序列均值的突变
 * 
 * 当累积偏差超过阈值时触发告警
 * 比简单滑动平均对突变更敏感
 */
@Component
public class CUSUMDriftDetector {

    // 告警阈值（累积偏差超过此值触发告警）
    private static final double THRESHOLD = 5.0;

    // 允许的偏差幅度（低于此值不累积）
    private static final double ALLOWANCE = 0.5;

    // 每个模型的CUSUM状态
    private final ConcurrentHashMap<String, CUSUMState> states = new ConcurrentHashMap<>();

    /**
     * 接收新的评估分数，更新CUSUM状态
     *
     * @param modelName 模型名称
     * @param score     本次评估分数（0-5分）
     * @param baseline  基准均值（从历史数据计算）
     * @return 是否触发漂移告警
     */
    public boolean update(String modelName, double score, double baseline) {
        CUSUMState state = states.computeIfAbsent(modelName, k -> new CUSUMState());

        // 计算标准化偏差
        double deviation = score - baseline;

        // 更新下降方向的CUSUM（检测质量下降）
        state.cuSumNegative = Math.max(0,
            state.cuSumNegative + (-deviation - ALLOWANCE)
        );

        // 更新上升方向的CUSUM（检测质量提升，也可能是异常）
        state.cuSumPositive = Math.max(0,
            state.cuSumPositive + (deviation - ALLOWANCE)
        );

        state.sampleCount++;

        // 检查是否触发告警
        if (state.cuSumNegative > THRESHOLD) {
            log.warn("CUSUM检测到模型[{}]质量下降！累积偏差={:.2f}，基准={:.2f}，当前分={:.2f}",
                     modelName, state.cuSumNegative, baseline, score);

            // 重置状态（避免持续告警）
            state.cuSumNegative = 0;
            return true;
        }

        return false;
    }

    private static class CUSUMState {
        double cuSumPositive = 0;
        double cuSumNegative = 0;
        long   sampleCount   = 0;
    }
}

七、评估结果存储与可视化

7.1 数据存储设计

-- 评估结果明细表（ClickHouse）
CREATE TABLE evaluation_results (
    eval_id         String,
    request_id      String,
    model_name      String,
    system_type     String,
    score_relevance     Float32,
    score_completeness  Float32,
    score_accuracy      Float32,
    score_safety        Float32,
    score_format        Float32,
    overall_score   Float32,
    confidence      Float32,
    has_critical_issues UInt8,
    eval_duration_ms UInt32,
    created_at      DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (model_name, created_at)
TTL created_at + INTERVAL 180 DAY;

-- 每小时聚合视图（供Grafana查询）
CREATE MATERIALIZED VIEW evaluation_hourly_metrics
ENGINE = AggregatingMergeTree()
ORDER BY (model_name, hour)
AS SELECT
    model_name,
    toStartOfHour(created_at)           AS hour,
    avgState(overall_score)             AS avg_score_state,
    avgState(score_relevance)           AS avg_relevance_state,
    avgState(score_completeness)        AS avg_completeness_state,
    avgState(score_accuracy)            AS avg_accuracy_state,
    countState()                        AS count_state,
    sumState(has_critical_issues)       AS critical_issues_state
FROM evaluation_results
GROUP BY model_name, hour;

7.2 Prometheus指标暴露

// EvaluationMetricsExporter.java
@Component
@Slf4j
public class EvaluationMetricsExporter {

    private final MeterRegistry meterRegistry;
    private final EvaluationResultRepository evalRepo;

    // Prometheus Gauge（实时值）
    private final Map<String, Gauge> avgScoreGauges = new ConcurrentHashMap<>();
    private final Map<String, Gauge> criticalIssueRateGauges = new ConcurrentHashMap<>();

    public EvaluationMetricsExporter(MeterRegistry meterRegistry,
                                      EvaluationResultRepository evalRepo) {
        this.meterRegistry = meterRegistry;
        this.evalRepo      = evalRepo;
    }

    /**
     * 每分钟更新Prometheus指标
     */
    @Scheduled(fixedRate = 60_000)
    public void updateMetrics() {
        List<String> modelNames = evalRepo.getActiveModelNames();

        for (String modelName : modelNames) {
            // 最近1小时的平均分
            Double avgScore = evalRepo.getAvgScoreLastHour(modelName);
            if (avgScore != null) {
                avgScoreGauges.computeIfAbsent(modelName, m ->
                    Gauge.builder("ai.evaluation.avg_score",
                                  avgScoreGauges, map -> {
                                      Double v = evalRepo.getAvgScoreLastHour(m);
                                      return v != null ? v : 0.0;
                                  })
                         .tag("model", m)
                         .description("AI模型过去1小时的平均评估分数")
                         .register(meterRegistry)
                );
            }

            // 严重问题率
            Double criticalRate = evalRepo.getCriticalIssueRateLastHour(modelName);
            if (criticalRate != null) {
                criticalIssueRateGauges.computeIfAbsent(modelName, m ->
                    Gauge.builder("ai.evaluation.critical_issue_rate",
                                  criticalIssueRateGauges, map -> {
                                      Double v = evalRepo.getCriticalIssueRateLastHour(m);
                                      return v != null ? v : 0.0;
                                  })
                         .tag("model", m)
                         .description("AI模型过去1小时的严重问题率")
                         .register(meterRegistry)
                );
            }
        }
    }

    /**
     * 记录单次评估结果到Prometheus（Counter和Histogram）
     */
    public void recordEvaluationResult(EvaluationResult result) {
        // 评估次数计数器
        meterRegistry.counter("ai.evaluation.count",
            "model", result.getModelName(),
            "has_critical", String.valueOf(
                result.getCriticalIssues() != null && !result.getCriticalIssues().isEmpty()
            )
        ).increment();

        // 评估分数直方图
        meterRegistry.summary("ai.evaluation.score_distribution",
            "model", result.getModelName()
        ).record(result.getOverallScore());
    }
}

7.3 Grafana Dashboard配置（JSON片段）

{
  "title": "AI模型质量监控",
  "panels": [
    {
      "title": "模型评估平均分（过去24小时）",
      "type": "timeseries",
      "targets": [
        {
          "expr": "ai_evaluation_avg_score",
          "legendFormat": "{{model}}"
        }
      ],
      "thresholds": {
        "steps": [
          {"value": null, "color": "green"},
          {"value": 3.5,  "color": "yellow"},
          {"value": 3.0,  "color": "red"}
        ]
      }
    },
    {
      "title": "严重问题率告警",
      "type": "gauge",
      "targets": [
        {
          "expr": "ai_evaluation_critical_issue_rate * 100",
          "legendFormat": "{{model}}"
        }
      ],
      "fieldConfig": {
        "thresholds": {
          "steps": [
            {"value": 0,   "color": "green"},
            {"value": 5,   "color": "yellow"},
            {"value": 10,  "color": "red"}
          ]
        },
        "unit": "percent"
      }
    }
  ]
}

八、告警策略：自动检测+自动切换备用模型

8.1 告警规则配置

# Prometheus告警规则（prometheus_rules.yml）
groups:
  - name: ai_model_quality
    interval: 1m
    rules:
      # 规则1：平均分下降超过10%（与7天前同时段比较）
      - alert: AIModelQualityDrop
        expr: |
          (
            avg_over_time(ai_evaluation_avg_score[1h])
            /
            avg_over_time(ai_evaluation_avg_score[1h] offset 7d)
            - 1
          ) < -0.10
        for: 30m          # 持续30分钟才触发（避免短暂波动误报）
        labels:
          severity: warning
        annotations:
          summary: "模型 {{ $labels.model }} 质量下降"
          description: "过去1小时平均分比7天前同时段下降了 {{ $value | humanizePercentage }}"

      # 规则2：严重问题率超过5%
      - alert: AIModelCriticalIssueRateHigh
        expr: ai_evaluation_critical_issue_rate > 0.05
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "模型 {{ $labels.model }} 严重问题率过高"
          description: "严重问题率达到 {{ $value | humanizePercentage }}，超过5%阈值"

      # 规则3：连续1小时无评估数据（评估流水线可能故障）
      - alert: AIEvaluationPipelineDown
        expr: absent(ai_evaluation_count[1h])
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AI评估流水线可能故障"
          description: "过去1小时没有评估数据上报"

8.2 自动切换备用模型

// ModelFailoverService.java
@Service
@Slf4j
public class ModelFailoverService {

    private final ChatClientFactory chatClientFactory;
    private final RedisTemplate<String, String> redisTemplate;
    private final AlertService alertService;

    /**
     * 模型降级配置
     * 主模型出问题时，自动切换到备用模型
     */
    private static final Map<String, String> FALLBACK_MODELS = Map.of(
        "gpt-4o",        "gpt-4o-mini",
        "claude-3-opus",  "claude-3-haiku",
        "qwen2.5-72b",   "gpt-4o-mini"
    );

    // 模型熔断状态Key前缀
    private static final String CIRCUIT_BREAKER_KEY = "model:circuit:";
    private static final Duration CIRCUIT_BREAK_DURATION = Duration.ofMinutes(30);

    /**
     * Prometheus告警Webhook：接收质量下降通知后自动熔断
     */
    @PostMapping("/alerts/model-quality-drop")
    public ResponseEntity<Void> handleQualityAlert(@RequestBody PrometheusAlert alert) {
        String modelName = alert.getLabels().get("model");
        String severity  = alert.getLabels().get("severity");

        if ("critical".equals(severity) && "firing".equals(alert.getStatus())) {
            triggerCircuitBreaker(modelName);
        } else if ("resolved".equals(alert.getStatus())) {
            resolveCircuitBreaker(modelName);
        }

        return ResponseEntity.ok().build();
    }

    /**
     * 触发模型熔断（切换到备用模型）
     */
    public void triggerCircuitBreaker(String modelName) {
        String fallbackModel = FALLBACK_MODELS.get(modelName);
        if (fallbackModel == null) {
            log.warn("模型[{}]没有配置备用模型，无法自动切换", modelName);
            alertService.sendCriticalAlert(
                "模型质量下降且无备用方案",
                "模型 " + modelName + " 质量下降，需要人工介入"
            );
            return;
        }

        // 在Redis中设置熔断状态
        String cbKey = CIRCUIT_BREAKER_KEY + modelName;
        redisTemplate.opsForValue().set(cbKey, fallbackModel, CIRCUIT_BREAK_DURATION);

        log.warn("模型[{}]已熔断，切换到备用模型[{}]，30分钟后自动恢复",
                 modelName, fallbackModel);

        alertService.sendWarningAlert(
            "模型自动降级通知",
            String.format("模型[%s]因质量下降已自动切换到[%s]，30分钟后将尝试恢复",
                          modelName, fallbackModel)
        );
    }

    /**
     * 获取当前应使用的模型（考虑熔断状态）
     */
    public String getActiveModel(String preferredModel) {
        String cbKey = CIRCUIT_BREAKER_KEY + preferredModel;
        String fallback = redisTemplate.opsForValue().get(cbKey);

        if (fallback != null) {
            log.debug("模型[{}]处于熔断状态，使用备用模型[{}]", preferredModel, fallback);
            return fallback;
        }

        return preferredModel;
    }

    private void resolveCircuitBreaker(String modelName) {
        String cbKey = CIRCUIT_BREAKER_KEY + modelName;
        redisTemplate.delete(cbKey);
        log.info("模型[{}]熔断已解除，恢复正常路由", modelName);
    }
}

九、评估成本控制：采样策略

9.1 成本计算

以GPT-4o-mini作为评估模型为例：
  每次评估：约500 tokens（prompt） + 200 tokens（response）
  GPT-4o-mini价格：$0.000150/1K input + $0.000600/1K output
  
  每次评估成本：
    500 × $0.00015/1000 + 200 × $0.0006/1000
    = $0.000075 + $0.00012
    = $0.000195（约 ¥0.0014）
  
  如果评估100%流量（100万请求/天）：
    100万 × $0.000195 = $195/天 ≈ ¥1,413/天 = ¥42,390/月
  
  采样5%后：
    5万 × $0.000195 = $9.75/天 ≈ ¥70.7/天 = ¥2,121/月
  
  成本降低95%，仍然有统计代表性

9.2 智能采样策略

// AdaptiveSamplingService.java
@Service
@Slf4j
public class AdaptiveSamplingService {

    @Value("${evaluation.sampling.base-rate:0.05}")
    private double baseSamplingRate;

    private final RedisTemplate<String, Object> redisTemplate;

    /**
     * 自适应采样：根据质量状态动态调整采样率
     *
     * 当质量稳定时：低采样率（节省成本）
     * 当质量下降时：高采样率（快速诊断）
     * 刚发版时：高采样率（捕获早期问题）
     */
    public boolean shouldSample(String requestId, String modelName) {
        double effectiveSamplingRate = calculateEffectiveSamplingRate(modelName);

        // 按采样率决定是否评估
        return Math.random() < effectiveSamplingRate;
    }

    private double calculateEffectiveSamplingRate(String modelName) {
        // 因素1：基础采样率
        double rate = baseSamplingRate;

        // 因素2：如果最近有质量告警，提高采样率
        String alertKey = "model:quality:alert:" + modelName;
        if (Boolean.TRUE.equals(redisTemplate.hasKey(alertKey))) {
            rate = Math.min(0.30, rate * 6);  // 提高到最多30%
            log.debug("模型[{}]有质量告警，采样率提高至{}%", modelName, rate * 100);
        }

        // 因素3：刚发布（24小时内），提高采样率
        String deployKey = "model:deploy:time:" + modelName;
        String deployTime = (String) redisTemplate.opsForValue().get(deployKey);
        if (deployTime != null) {
            long hoursSinceDeploy = ChronoUnit.HOURS.between(
                LocalDateTime.parse(deployTime), LocalDateTime.now()
            );
            if (hoursSinceDeploy < 24) {
                rate = Math.min(0.20, rate * 4);
                log.debug("模型[{}]刚发布{}小时，采样率提高至{}%",
                          modelName, hoursSinceDeploy, rate * 100);
            }
        }

        // 因素4：流量低峰期提高采样率（夜间处理积压）
        int hour = LocalTime.now().getHour();
        if (hour >= 2 && hour <= 6) {
            rate = Math.min(0.15, rate * 3);
        }

        return rate;
    }

    /**
     * 策略性采样：确保覆盖所有场景类别
     * 
     * 问题：纯随机采样可能遗漏低频但高风险的场景
     * 解决：为每个场景类别独立维护采样计数器
     */
    public boolean stratifiedSample(String requestId, String category, String modelName) {
        String counterKey = String.format("eval:sample:count:%s:%s:%s",
                                           modelName, category,
                                           LocalDate.now().format(DateTimeFormatter.BASIC_ISO_DATE));

        Long count = redisTemplate.opsForValue().increment(counterKey);
        if (count == 1) {
            redisTemplate.expire(counterKey, Duration.ofDays(2));
        }

        // 每个类别每天至少采样100条，超过后按5%采样
        int minPerCategory = 100;
        if (count <= minPerCategory) {
            return true;  // 确保每个类别的最小样本量
        }

        return shouldSample(requestId, modelName);
    }
}

十、完整监控系统架构总览

十一、FAQ

Q1：LLM-as-Judge的评估结果本身可靠吗？会不会评估模型也出问题？

A：这是合理的担忧。缓解措施：1）评估模型与被评估模型使用不同厂商（如用Anthropic Claude评估OpenAI GPT）；2）对评估结果定期做人工校验（每周抽查100条）；3）使用多个评估模型取平均值（集成评估）；4）评估模型本身也纳入监控体系。

Q2：黄金数据集多大合适？如何保持数据集的时效性？

A：经验值：每个场景类别至少50条，覆盖简单/中等/复杂三个难度级别，总量500-2000条为佳。时效性：建立数据集"有效期"机制，每个条目设置过期时间（如3个月），过期后自动标记需要人工重新验证。

Q3：如何处理评估结果的"评估者偏见"（Evaluator Bias）？

A：常见偏见有：1）位置偏见（偏向第一个答案）；2）长度偏见（倾向于更长的答案）；3）风格偏见（偏好某种写作风格）。缓解方法：Prompt中明确说明不以长度作为质量标准；对比评估时随机打乱顺序；使用多维度评分而不是综合分。

Q4：生产环境评估会不会增加用户的响应延迟？

A：不会。评估流程完全异步，用户收到AI响应后，响应内容才被异步发送到评估队列。用户体验不受任何影响。评估可以有秒级到分钟级的延迟，对实时性要求不高。

Q5：如何为新上线的功能建立评估基准？

A：新功能上线前，先用beta测试或内部测试流量运行2周，在这2周内人工抽样标注约200条，建立初始基准。之后自动评估偏差超过初始基准20%时触发复核。首次发布前务必完成黄金数据集的冷启动建设。

总结

文章开头王芳面对的那个问题，本质是生产环境缺乏持续评估机制。模型提供商的静默更新在AI行业是常态，没有监控就等于蒙眼开车。

构建完整评估体系的核心：

黄金数据集：人工精标的高质量基准，是评估体系的地基
LLM-as-Judge：自动化评估，覆盖生产流量采样
统计漂移检测：KS检验+CUSUM，早期发现质量变化
Spring Batch流水线：批量评估，结果存ClickHouse
Prometheus+Grafana：可视化监控，告警规则
自动故障转移：质量下降自动切换备用模型

从发现问题到修复，从2周压缩到30分钟——这就是持续评估体系的价值。