LLM评估体系建设：RAGAs·BLEU·LLM-as-Judge完整实践

老张2026/4/30大约 8 分钟

LLM评估体系建设：RAGAs·BLEU·LLM-as-Judge完整实践

适读人群：有1-5年Java开发经验，想向AI工程师方向转型的开发者 阅读时长：约17分钟 文章价值：① 建立AI系统评估的完整认知框架 ② 掌握RAGAs/BLEU/LLM-as-Judge三种评估方法的适用场景 ③ 学会在Spring Boot项目中集成自动化评估流水线

"老张，我们的RAG系统怎么评估好不好用？"

这是小林前几天来问我的问题。他们做了一个企业知识库问答系统，用了两个月，产品经理一直问：我们的AI回答准不准？比上个版本强还是弱？怎么量化？

"你现在怎么评估的？"我问。

"就是…几个人手动问几个问题，感觉答得还行就上线了。"

我叹了口气。

"你知道吗，在AI工程里，没有评估体系，就等于没有导航仪开车。你不知道去哪，也不知道偏了多远。"

今天这篇，我来讲清楚三种主流的LLM评估方法，以及怎么在Java项目里落地。

三种评估方法的适用场景

先整体看一眼：

方法	成本	速度	准确度	适用场景
BLEU/ROUGE	极低	极快	中（有参考答案时）	翻译/摘要/有标准答案场景
RAGAs	低（需LLM）	快	高（RAG专属）	任何RAG系统
LLM-as-Judge	中（需LLM调用）	中	最高	开放式生成/无参考答案

方法一：BLEU/ROUGE（基础文本匹配）

BLEU（双语评估替补）和ROUGE（面向召回的摘要评估）是最传统的文本评估指标。

核心思想：把生成文本和参考答案做n-gram（连续词组）匹配，计算重叠率。

@Component
public class TextSimilarityEvaluator {

    /**
     * 计算BLEU-4分数（1-gram到4-gram的加权平均）
     */
    public double calculateBleu(String hypothesis, String reference) {
        hypothesis = hypothesis.trim().toLowerCase();
        reference = reference.trim().toLowerCase();

        double[] precisions = new double[4];
        for (int n = 1; n <= 4; n++) {
            precisions[n - 1] = calculateNgramPrecision(hypothesis, reference, n);
        }

        // 几何平均（BLEU标准公式）
        double logSum = 0;
        for (double p : precisions) {
            logSum += Math.log(Math.max(p, 1e-10));
        }
        double geometricMean = Math.exp(logSum / 4);

        // 简短惩罚因子（brevity penalty）
        double bp = hypothesis.length() >= reference.length()
                ? 1.0
                : Math.exp(1 - (double) reference.length() / hypothesis.length());

        return bp * geometricMean;
    }

    /**
     * 计算ROUGE-L（最长公共子序列）
     */
    public double calculateRougeL(String hypothesis, String reference) {
        String[] hypWords = hypothesis.split("\\s+");
        String[] refWords = reference.split("\\s+");

        int lcsLength = longestCommonSubsequence(hypWords, refWords);

        double precision = (double) lcsLength / hypWords.length;
        double recall = (double) lcsLength / refWords.length;

        if (precision + recall == 0) return 0;

        // F1 score
        return 2 * precision * recall / (precision + recall);
    }

    private double calculateNgramPrecision(String hypothesis, String reference, int n) {
        Map<String, Integer> hypNgrams = getNgramCounts(hypothesis, n);
        Map<String, Integer> refNgrams = getNgramCounts(reference, n);

        int matches = 0;
        int total = 0;

        for (Map.Entry<String, Integer> entry : hypNgrams.entrySet()) {
            total += entry.getValue();
            int refCount = refNgrams.getOrDefault(entry.getKey(), 0);
            matches += Math.min(entry.getValue(), refCount);
        }

        return total == 0 ? 0 : (double) matches / total;
    }

    private Map<String, Integer> getNgramCounts(String text, int n) {
        String[] words = text.split("\\s+");
        Map<String, Integer> counts = new HashMap<>();
        for (int i = 0; i <= words.length - n; i++) {
            String ngram = String.join(" ", Arrays.copyOfRange(words, i, i + n));
            counts.merge(ngram, 1, Integer::sum);
        }
        return counts;
    }

    private int longestCommonSubsequence(String[] a, String[] b) {
        int[][] dp = new int[a.length + 1][b.length + 1];
        for (int i = 1; i <= a.length; i++) {
            for (int j = 1; j <= b.length; j++) {
                dp[i][j] = a[i - 1].equals(b[j - 1])
                        ? dp[i - 1][j - 1] + 1
                        : Math.max(dp[i - 1][j], dp[i][j - 1]);
            }
        }
        return dp[a.length][b.length];
    }
}

方法二：RAGAs（RAG专属评估）

RAGAs是专门为RAG系统设计的评估框架，从四个维度评估：

用Java + Spring AI实现RAGAs评估：

@Service
@RequiredArgsConstructor
@Slf4j
public class RAGAsEvaluator {

    private final ChatClient evalChatClient;

    /**
     * 计算忠实度（Faithfulness）：答案中的陈述是否都能从检索文档中找到依据
     * 分数范围：0-1，越高越好
     */
    public double evaluateFaithfulness(String answer, List<String> contexts) {
        String contextText = String.join("\n---\n", contexts);

        String prompt = """
                【任务】评估以下"AI回答"的忠实度：回答中的每个陈述是否都有"参考文档"支持。
                
                【参考文档】
                %s
                
                【AI回答】
                %s
                
                【评分步骤】
                1. 列出回答中的所有陈述（每句话算一个陈述）
                2. 对每个陈述判断：是否能从参考文档中找到支持（是/否）
                3. 忠实度 = 有支持的陈述数 / 总陈述数
                
                只返回JSON：{"statements_total": 5, "statements_supported": 4, "faithfulness": 0.8}
                """.formatted(contextText, answer);

        String response = evalChatClient.prompt().user(prompt).call().content();
        return parseDoubleField(response, "faithfulness");
    }

    /**
     * 计算答案相关性（Answer Relevance）：生成的答案是否切实回答了用户问题
     */
    public double evaluateAnswerRelevance(String question, String answer) {
        String prompt = """
                【任务】评估以下"AI回答"对"用户问题"的相关性。
                
                【用户问题】%s
                【AI回答】%s
                
                【评分维度】
                1. 答案是否直接回应了问题？（0-1分）
                2. 答案是否完整，没有遗漏关键点？（0-1分）
                3. 答案是否简洁，没有大量无关内容？（0-1分）
                
                只返回JSON：{"relevance_score": 0.85, "reason": "一句话理由"}
                """.formatted(question, answer);

        String response = evalChatClient.prompt().user(prompt).call().content();
        return parseDoubleField(response, "relevance_score");
    }

    /**
     * 计算上下文召回率（Context Recall）：标准答案中的关键信息是否都被检索到了
     */
    public double evaluateContextRecall(String groundTruth, List<String> contexts) {
        String contextText = String.join("\n---\n", contexts);

        String prompt = """
                【任务】评估"检索文档"是否覆盖了"标准答案"中的所有关键信息。
                
                【标准答案（Ground Truth）】%s
                
                【检索到的文档】%s
                
                请列出标准答案中的每个关键信息点，判断是否在检索文档中找到。
                只返回JSON：{"key_points_total": 4, "key_points_covered": 3, "context_recall": 0.75}
                """.formatted(groundTruth, contextText);

        String response = evalChatClient.prompt().user(prompt).call().content();
        return parseDoubleField(response, "context_recall");
    }

    /**
     * 一次性计算所有RAGAs指标
     */
    public RAGAsResult evaluateAll(EvalSample sample) {
        double faithfulness = evaluateFaithfulness(sample.getAnswer(), sample.getContexts());
        double answerRelevance = evaluateAnswerRelevance(sample.getQuestion(), sample.getAnswer());
        double contextRecall = evaluateContextRecall(sample.getGroundTruth(), sample.getContexts());

        // RAGAs综合分 = 三指标平均
        double overallScore = (faithfulness + answerRelevance + contextRecall) / 3;

        return RAGAsResult.builder()
                .faithfulness(faithfulness)
                .answerRelevance(answerRelevance)
                .contextRecall(contextRecall)
                .overallScore(overallScore)
                .build();
    }

    private double parseDoubleField(String json, String field) {
        try {
            String cleaned = json.replaceAll("```json\\s*|```", "").trim();
            JsonNode node = new ObjectMapper().readTree(cleaned);
            return node.get(field).asDouble();
        } catch (Exception e) {
            log.warn("RAGAs指标解析失败：{}", e.getMessage());
            return 0.0;
        }
    }
}

方法三：LLM-as-Judge（AI评AI）

对于开放式生成任务，没有标准答案，用LLM来打分是目前最接近人工评估的方法。

@Component
@RequiredArgsConstructor
@Slf4j
public class LlmJudgeEvaluator {

    private final ChatClient judgeChatClient;  // 用于评估的LLM（通常比被评估的模型强）

    /**
     * 单一评估：对一个回答打分
     */
    public JudgeResult evaluate(String question, String answer, EvalCriteria criteria) {
        String prompt = buildJudgePrompt(question, answer, criteria);

        String response = judgeChatClient.prompt()
                .system("你是一个严格、客观的AI回答质量评估专家。")
                .user(prompt)
                .call()
                .content();

        return parseJudgeResult(response);
    }

    /**
     * 对比评估（A/B测试）：对比两个答案哪个更好
     */
    public ComparisonResult compare(String question,
                                     String answerA, String answerB) {
        String prompt = """
                【任务】对比以下两个AI回答，判断哪个更好。
                
                【用户问题】%s
                
                【回答A】%s
                
                【回答B】%s
                
                【评估维度】
                1. 准确性：事实是否正确
                2. 完整性：是否完整回答了问题
                3. 简洁性：是否简洁不啰嗦
                4. 有用性：实际价值高低
                
                只返回JSON：
                {"winner": "A/B/TIE", "confidence": 0.8,
                 "score_a": 7.5, "score_b": 8.0,
                 "reason": "B回答更完整，准确率相当"}
                """.formatted(question, answerA, answerB);

        String response = judgeChatClient.prompt().user(prompt).call().content();

        try {
            String cleaned = response.replaceAll("```json\\s*|```", "").trim();
            ObjectMapper mapper = new ObjectMapper();
            JsonNode node = mapper.readTree(cleaned);

            return ComparisonResult.builder()
                    .winner(node.get("winner").asText())
                    .confidence(node.get("confidence").asDouble())
                    .scoreA(node.get("score_a").asDouble())
                    .scoreB(node.get("score_b").asDouble())
                    .reason(node.get("reason").asText())
                    .build();
        } catch (Exception e) {
            log.error("对比评估结果解析失败", e);
            return ComparisonResult.error();
        }
    }

    private String buildJudgePrompt(String question, String answer, EvalCriteria criteria) {
        return """
                【评估任务】对以下AI回答按照指定标准打分（0-10分）
                
                【用户问题】%s
                【AI回答】%s
                
                【评分标准】%s
                
                请逐条打分，最后给出综合分。
                只返回JSON：{"scores": {"accuracy": 8, "completeness": 7, "clarity": 9},
                             "overall": 8.0, "feedback": "优点和不足各一句"}
                """.formatted(question, answer, criteria.getDescription());
    }

    private JudgeResult parseJudgeResult(String json) {
        try {
            String cleaned = json.replaceAll("```json\\s*|```", "").trim();
            ObjectMapper mapper = new ObjectMapper();
            JsonNode node = mapper.readTree(cleaned);

            Map<String, Double> scores = new HashMap<>();
            node.get("scores").fields().forEachRemaining(e ->
                    scores.put(e.getKey(), e.getValue().asDouble()));

            return JudgeResult.builder()
                    .scores(scores)
                    .overall(node.get("overall").asDouble())
                    .feedback(node.get("feedback").asText())
                    .build();
        } catch (Exception e) {
            log.error("Judge结果解析失败", e);
            return JudgeResult.error();
        }
    }
}

自动化评估流水线

把三种方法组合成CI/CD流水线，每次版本迭代自动跑评估：

@Service
@RequiredArgsConstructor
@Slf4j
public class EvaluationPipeline {

    private final RAGAsEvaluator ragAsEvaluator;
    private final LlmJudgeEvaluator llmJudge;
    private final EvalDatasetRepository datasetRepository;
    private final EvalResultRepository resultRepository;

    /**
     * 运行完整评估流水线
     */
    @Scheduled(cron = "0 0 2 * * ?")  // 每天凌晨2点自动跑
    public EvalReport runFullEvaluation() {
        log.info("开始自动化评估...");

        List<EvalSample> testDataset = datasetRepository.findAll();
        List<EvalResult> results = new ArrayList<>();

        for (EvalSample sample : testDataset) {
            try {
                EvalResult result = evaluateSample(sample);
                results.add(result);
                resultRepository.save(result);
            } catch (Exception e) {
                log.error("样本评估失败，sampleId={}", sample.getId(), e);
            }
        }

        EvalReport report = generateReport(results);
        log.info("评估完成：总样本={}，平均RAGAs={:.2f}，平均Judge分={}",
                results.size(), report.getAvgRagas(), report.getAvgJudgeScore());

        // 如果分数下降超过阈值，发送告警
        checkRegressionAlert(report);

        return report;
    }

    private EvalResult evaluateSample(EvalSample sample) {
        RAGAsResult ragas = ragAsEvaluator.evaluateAll(sample);
        JudgeResult judge = llmJudge.evaluate(
                sample.getQuestion(),
                sample.getAnswer(),
                EvalCriteria.STANDARD
        );

        return EvalResult.builder()
                .sampleId(sample.getId())
                .ragas(ragas)
                .judgeResult(judge)
                .evaluatedAt(LocalDateTime.now())
                .build();
    }
}

有了这套体系，小林的产品经理问"AI答得准不准"的时候，终于可以拿出一张图表来回答了——不是"感觉还行"，而是"RAGAs忠实度0.87，本周比上周提升0.04"。

这才是工程师该有的答案。