第2086篇:RAG评估框架的Java实现——RAGAS核心指标从原理到代码
大约 11 分钟
第2086篇:RAG评估框架的Java实现——RAGAS核心指标从原理到代码
适读人群:正在做RAG系统质量评估的工程师 | 阅读时长:约20分钟 | 核心价值:深入理解RAGAS四大核心指标的计算原理,用Java实现可集成到CI/CD的RAG评估框架
前几天有个同学来问我:他的RAG系统回答"感觉"还不错,但无法量化,不知道改进有没有效果。这个问题很典型——没有量化评估,优化就是摸黑走路。
RAGAS(Retrieval-Augmented Generation Assessment)是目前最成熟的RAG评估框架,最初是Python实现的。这篇文章把它的核心思路用Java重新实现,可以直接集成到Java技术栈的项目中。
RAGAS的四个核心指标
四个指标互补:Faithfulness防幻觉,Answer Relevancy防跑题,Context Recall防遗漏,Context Precision防噪声。
评估数据集格式
/**
* 评估样本
* 注意:reference_answer(参考答案)是可选的
* 没有参考答案时,部分指标无法计算(Context Recall需要参考答案)
*/
@Data
@Builder
public class EvaluationSample {
/**
* 用户问题
*/
private String question;
/**
* RAG系统实际检索到的上下文(多个chunk)
*/
private List<String> contexts;
/**
* RAG系统生成的回答
*/
private String answer;
/**
* 参考答案(ground truth)——用于计算Context Recall
* 可以是人工标注,也可以是从权威文档中提取
*/
private String referenceAnswer;
/**
* 样本ID(用于追踪)
*/
private String sampleId;
/**
* 标签(按Topic分组)
*/
private String category;
}指标一:Faithfulness(忠实度)
核心问题:回答中的陈述,是否都能从上下文中找到支撑?
/**
* Faithfulness评估
*
* 计算方法:
* 1. 把回答分解成多个原子性陈述(atomic claims)
* 2. 对每个陈述,判断上下文是否支持它
* 3. score = 被支持的陈述数 / 总陈述数
*/
@Service
@RequiredArgsConstructor
@Slf4j
public class FaithfulnessEvaluator {
private final ChatLanguageModel llm;
private final ObjectMapper objectMapper;
public FaithfulnessScore evaluate(EvaluationSample sample) {
if (sample.getAnswer() == null || sample.getContexts() == null ||
sample.getContexts().isEmpty()) {
return FaithfulnessScore.empty();
}
// Step 1: 分解回答为原子性陈述
List<String> claims = extractAtomicClaims(sample.getAnswer());
if (claims.isEmpty()) {
return new FaithfulnessScore(1.0, List.of(), List.of(), 0);
}
// Step 2: 判断每个陈述是否被上下文支持
String contextStr = String.join("\n---\n", sample.getContexts());
List<ClaimVerification> verifications = verifyClaims(claims, contextStr);
// Step 3: 计算分数
long supportedCount = verifications.stream()
.filter(ClaimVerification::isSupported)
.count();
double score = (double) supportedCount / claims.size();
List<String> supportedClaims = verifications.stream()
.filter(ClaimVerification::isSupported)
.map(ClaimVerification::claim)
.toList();
List<String> unsupportedClaims = verifications.stream()
.filter(v -> !v.isSupported())
.map(ClaimVerification::claim)
.toList();
log.debug("Faithfulness: {}/{} claims supported, score={:.3f}",
supportedCount, claims.size(), score);
return new FaithfulnessScore(score, supportedClaims, unsupportedClaims, claims.size());
}
/**
* 把回答分解成原子性陈述
* 例如:"Java是面向对象的语言,由Sun公司开发"
* → ["Java是面向对象的语言", "Java由Sun公司开发"]
*/
private List<String> extractAtomicClaims(String answer) {
String prompt = String.format("""
请将以下回答分解为独立的原子性陈述,每个陈述包含且仅包含一个事实。
回答:
%s
要求:
1. 每个陈述是一个独立的、可验证的事实
2. 去掉过渡词、修饰语
3. 一行一个陈述
4. 不超过10个陈述
5. 只输出陈述列表,不要其他内容
陈述:
""", answer);
String response = llm.generate(prompt).trim();
return Arrays.stream(response.split("\n"))
.map(String::trim)
.filter(s -> !s.isEmpty())
.filter(s -> !s.startsWith("#")) // 去掉注释行
.limit(10)
.toList();
}
/**
* 批量验证陈述是否被上下文支持
* 一次请求验证所有陈述,减少API调用次数
*/
private List<ClaimVerification> verifyClaims(List<String> claims, String context) {
String claimsJson = IntStream.range(0, claims.size())
.mapToObj(i -> String.format(" {\"id\": %d, \"claim\": \"%s\"}",
i, claims.get(i).replace("\"", "\\\"").replace("\n", " ")))
.collect(Collectors.joining(",\n"));
String prompt = String.format("""
根据以下上下文,判断每个陈述是否能被支持。
上下文:
%s
陈述列表:
[
%s
]
请输出JSON数组,每个元素包含id和verdict("supported"或"not_supported"):
[{"id": 0, "verdict": "supported"}, ...]
只输出JSON,不要其他内容:
""", context, claimsJson);
try {
String response = llm.generate(prompt).trim();
String json = extractJsonArray(response);
List<Map<String, Object>> results = objectMapper.readValue(json,
new TypeReference<>() {});
return results.stream()
.map(r -> {
int id = ((Number) r.get("id")).intValue();
String verdict = (String) r.get("verdict");
return new ClaimVerification(
id < claims.size() ? claims.get(id) : "unknown",
"supported".equals(verdict)
);
})
.toList();
} catch (Exception e) {
log.warn("陈述验证解析失败: {}", e.getMessage());
// 解析失败时保守处理:所有陈述视为不支持
return claims.stream()
.map(c -> new ClaimVerification(c, false))
.toList();
}
}
private String extractJsonArray(String text) {
int start = text.indexOf('[');
int end = text.lastIndexOf(']');
return start >= 0 && end > start ? text.substring(start, end + 1) : "[]";
}
public record ClaimVerification(String claim, boolean isSupported) {}
public record FaithfulnessScore(
double score,
List<String> supportedClaims,
List<String> unsupportedClaims,
int totalClaims
) {
public static FaithfulnessScore empty() {
return new FaithfulnessScore(1.0, List.of(), List.of(), 0);
}
}
}指标二:Answer Relevancy(答案相关性)
核心问题:如果只看这个答案,能反推出原始问题是什么吗?
/**
* Answer Relevancy评估
*
* 独特的计算方法:
* 1. 让LLM根据回答反向生成N个问题
* 2. 计算这N个生成的问题与原始问题的平均语义相似度
*
* 直觉理解:
* - 如果回答切题,反向生成的问题会和原始问题很像
* - 如果回答跑题,反向生成的问题会和原始问题差很远
*/
@Service
@RequiredArgsConstructor
@Slf4j
public class AnswerRelevancyEvaluator {
private final ChatLanguageModel llm;
private final EmbeddingModel embeddingModel;
private static final int NUM_GENERATED_QUESTIONS = 3;
public AnswerRelevancyScore evaluate(EvaluationSample sample) {
if (sample.getAnswer() == null || sample.getQuestion() == null) {
return AnswerRelevancyScore.empty();
}
// Step 1: 根据回答反向生成问题
List<String> generatedQuestions = generateQuestionsFromAnswer(
sample.getAnswer(), NUM_GENERATED_QUESTIONS);
if (generatedQuestions.isEmpty()) {
return AnswerRelevancyScore.empty();
}
// Step 2: 计算生成的问题与原始问题的余弦相似度
float[] originalEmbedding = embeddingModel.embed(sample.getQuestion());
List<Double> similarities = generatedQuestions.stream()
.map(q -> {
float[] genEmbedding = embeddingModel.embed(q);
return cosineSimilarity(originalEmbedding, genEmbedding);
})
.toList();
// Step 3: 取平均值作为最终分数
double avgSimilarity = similarities.stream()
.mapToDouble(Double::doubleValue)
.average()
.orElse(0.0);
log.debug("Answer Relevancy: generatedQ={}, similarities={}, avg={:.3f}",
generatedQuestions.size(), similarities, avgSimilarity);
return new AnswerRelevancyScore(avgSimilarity, generatedQuestions, similarities);
}
private List<String> generateQuestionsFromAnswer(String answer, int count) {
String prompt = String.format("""
根据以下回答,生成%d个可能导致该回答的问题。
回答:
%s
要求:
1. 每个问题一行
2. 问题要自然,像真实用户会问的
3. 问题之间要有差异性(不要只是换个说法)
4. 只输出问题列表,不要编号或解释
问题:
""", count, answer);
String response = llm.generate(prompt).trim();
return Arrays.stream(response.split("\n"))
.map(String::trim)
.filter(s -> !s.isEmpty() && s.endsWith("?") || s.endsWith("?"))
.limit(count)
.toList();
}
private double cosineSimilarity(float[] a, float[] b) {
double dot = 0, normA = 0, normB = 0;
for (int i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
if (normA == 0 || normB == 0) return 0;
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
public record AnswerRelevancyScore(
double score,
List<String> generatedQuestions,
List<Double> questionSimilarities
) {
public static AnswerRelevancyScore empty() {
return new AnswerRelevancyScore(0.0, List.of(), List.of());
}
}
}指标三:Context Recall(上下文召回)
/**
* Context Recall评估(需要参考答案)
*
* 核心问题:参考答案中的每一个陈述,都能在检索的上下文中找到支撑吗?
* score = 被上下文覆盖的参考答案陈述数 / 参考答案总陈述数
*/
@Service
@RequiredArgsConstructor
@Slf4j
public class ContextRecallEvaluator {
private final ChatLanguageModel llm;
private final ObjectMapper objectMapper;
public ContextRecallScore evaluate(EvaluationSample sample) {
if (sample.getReferenceAnswer() == null || sample.getContexts() == null ||
sample.getContexts().isEmpty()) {
log.debug("Context Recall需要参考答案,跳过: sampleId={}", sample.getSampleId());
return ContextRecallScore.noReference();
}
// Step 1: 分解参考答案为陈述
List<String> referenceClaims = extractClaims(sample.getReferenceAnswer());
if (referenceClaims.isEmpty()) {
return new ContextRecallScore(1.0, List.of(), List.of(), true);
}
// Step 2: 判断每个参考陈述是否在上下文中
String contextStr = String.join("\n---\n", sample.getContexts());
List<ClaimCoverage> coverages = checkCoverage(referenceClaims, contextStr);
long coveredCount = coverages.stream().filter(ClaimCoverage::isCovered).count();
double score = (double) coveredCount / referenceClaims.size();
return new ContextRecallScore(
score,
coverages.stream().filter(ClaimCoverage::isCovered)
.map(ClaimCoverage::claim).toList(),
coverages.stream().filter(c -> !c.isCovered())
.map(ClaimCoverage::claim).toList(),
true
);
}
private List<String> extractClaims(String text) {
String prompt = String.format("""
将以下文本分解为独立的事实陈述(每行一个,不超过8个):
%s
只输出陈述列表:
""", text);
return Arrays.stream(llm.generate(prompt).trim().split("\n"))
.map(String::trim)
.filter(s -> !s.isEmpty())
.limit(8)
.toList();
}
private List<ClaimCoverage> checkCoverage(List<String> claims, String context) {
String claimsText = IntStream.range(0, claims.size())
.mapToObj(i -> (i + 1) + ". " + claims.get(i))
.collect(Collectors.joining("\n"));
String prompt = String.format("""
判断以下每个陈述是否能从上下文中找到支撑。
上下文:
%s
陈述:
%s
输出JSON(verdict: "covered" 或 "not_covered"):
[{"index": 1, "verdict": "covered"}, ...]
只输出JSON:
""", context, claimsText);
try {
String response = llm.generate(prompt).trim();
String json = extractJsonArray(response);
List<Map<String, Object>> results = objectMapper.readValue(json,
new TypeReference<>() {});
return results.stream()
.map(r -> {
int idx = ((Number) r.get("index")).intValue() - 1;
String verdict = (String) r.get("verdict");
return new ClaimCoverage(
idx >= 0 && idx < claims.size() ? claims.get(idx) : "",
"covered".equals(verdict)
);
})
.toList();
} catch (Exception e) {
log.warn("Context Recall解析失败: {}", e.getMessage());
return claims.stream().map(c -> new ClaimCoverage(c, false)).toList();
}
}
private String extractJsonArray(String text) {
int start = text.indexOf('[');
int end = text.lastIndexOf(']');
return start >= 0 && end > start ? text.substring(start, end + 1) : "[]";
}
public record ClaimCoverage(String claim, boolean isCovered) {}
public record ContextRecallScore(
double score,
List<String> coveredClaims,
List<String> uncoveredClaims,
boolean hasReference
) {
public static ContextRecallScore noReference() {
return new ContextRecallScore(-1.0, List.of(), List.of(), false);
}
}
}指标四:Context Precision(上下文精确度)
/**
* Context Precision评估
*
* 核心问题:检索的所有chunk,有多少是真正回答问题需要的?
* score = 有用的chunk数 / 总chunk数
*
* 用于评估检索的"噪声比",过低说明检索引入了太多无关内容
*/
@Service
@RequiredArgsConstructor
@Slf4j
public class ContextPrecisionEvaluator {
private final ChatLanguageModel llm;
private final ObjectMapper objectMapper;
public ContextPrecisionScore evaluate(EvaluationSample sample) {
if (sample.getContexts() == null || sample.getContexts().isEmpty()) {
return ContextPrecisionScore.empty();
}
// 判断每个chunk对于回答这个问题是否有用
List<ChunkRelevance> relevances = evaluateChunkRelevance(
sample.getQuestion(),
sample.getContexts(),
sample.getAnswer()
);
long relevantCount = relevances.stream().filter(ChunkRelevance::isRelevant).count();
double score = (double) relevantCount / sample.getContexts().size();
// 计算平均精确度(AP,考虑排序)
double ap = calculateAveragePrecision(relevances);
log.debug("Context Precision: {}/{} chunks relevant, score={:.3f}, AP={:.3f}",
relevantCount, sample.getContexts().size(), score, ap);
return new ContextPrecisionScore(score, ap, relevances);
}
private List<ChunkRelevance> evaluateChunkRelevance(
String question, List<String> contexts, String answer) {
// 批量判断每个chunk的相关性
StringBuilder prompt = new StringBuilder();
prompt.append(String.format("问题:%s\n\n", question));
if (answer != null) {
prompt.append(String.format("实际回答:%s\n\n", answer));
}
prompt.append("请判断以下每个上下文片段对于回答该问题是否有用:\n\n");
for (int i = 0; i < contexts.size(); i++) {
prompt.append(String.format("片段%d:\n%s\n\n",
i + 1,
contexts.get(i).substring(0, Math.min(300, contexts.get(i).length()))));
}
prompt.append("输出JSON(verdict: \"relevant\" 或 \"not_relevant\"):\n");
prompt.append("[{\"index\": 1, \"verdict\": \"relevant\"}, ...]\n只输出JSON:");
try {
String response = llm.generate(prompt.toString()).trim();
String json = extractJsonArray(response);
List<Map<String, Object>> results = objectMapper.readValue(json,
new TypeReference<>() {});
return results.stream()
.map(r -> {
int idx = ((Number) r.get("index")).intValue() - 1;
String verdict = (String) r.get("verdict");
return new ChunkRelevance(
idx >= 0 && idx < contexts.size() ? contexts.get(idx) : "",
idx,
"relevant".equals(verdict)
);
})
.toList();
} catch (Exception e) {
log.warn("Context Precision解析失败: {}", e.getMessage());
return IntStream.range(0, contexts.size())
.mapToObj(i -> new ChunkRelevance(contexts.get(i), i, false))
.toList();
}
}
/**
* 计算Average Precision(AP)
* 考虑相关chunk的排序位置,靠前的相关chunk价值更高
*/
private double calculateAveragePrecision(List<ChunkRelevance> relevances) {
if (relevances.isEmpty()) return 0.0;
int relevantSoFar = 0;
double sumPrecision = 0.0;
int totalRelevant = (int) relevances.stream().filter(ChunkRelevance::isRelevant).count();
if (totalRelevant == 0) return 0.0;
for (int i = 0; i < relevances.size(); i++) {
if (relevances.get(i).isRelevant()) {
relevantSoFar++;
double precision = (double) relevantSoFar / (i + 1);
sumPrecision += precision;
}
}
return sumPrecision / totalRelevant;
}
private String extractJsonArray(String text) {
int start = text.indexOf('[');
int end = text.lastIndexOf(']');
return start >= 0 && end > start ? text.substring(start, end + 1) : "[]";
}
public record ChunkRelevance(String chunk, int position, boolean isRelevant) {}
public record ContextPrecisionScore(
double score, double averagePrecision, List<ChunkRelevance> chunkRelevances
) {
public static ContextPrecisionScore empty() {
return new ContextPrecisionScore(0.0, 0.0, List.of());
}
}
}整合评估框架
/**
* 综合RAG评估框架
* 整合四个指标,生成评估报告
*/
@Service
@RequiredArgsConstructor
@Slf4j
public class RagEvaluationFramework {
private final FaithfulnessEvaluator faithfulnessEvaluator;
private final AnswerRelevancyEvaluator answerRelevancyEvaluator;
private final ContextRecallEvaluator contextRecallEvaluator;
private final ContextPrecisionEvaluator contextPrecisionEvaluator;
// 质量门槛
private static final double MIN_FAITHFULNESS = 0.80;
private static final double MIN_ANSWER_RELEVANCY = 0.75;
private static final double MIN_CONTEXT_RECALL = 0.70;
private static final double MIN_CONTEXT_PRECISION = 0.65;
/**
* 对单个样本进行完整评估
*/
public SampleEvaluationResult evaluate(EvaluationSample sample) {
log.info("评估样本: sampleId={}, question={}",
sample.getSampleId(),
sample.getQuestion().substring(0, Math.min(50, sample.getQuestion().length())));
FaithfulnessEvaluator.FaithfulnessScore faithfulness =
faithfulnessEvaluator.evaluate(sample);
AnswerRelevancyEvaluator.AnswerRelevancyScore answerRelevancy =
answerRelevancyEvaluator.evaluate(sample);
ContextRecallEvaluator.ContextRecallScore contextRecall =
contextRecallEvaluator.evaluate(sample);
ContextPrecisionEvaluator.ContextPrecisionScore contextPrecision =
contextPrecisionEvaluator.evaluate(sample);
return new SampleEvaluationResult(
sample.getSampleId(),
sample.getQuestion(),
faithfulness,
answerRelevancy,
contextRecall,
contextPrecision
);
}
/**
* 批量评估,生成数据集报告
*/
public DatasetEvaluationReport evaluateDataset(List<EvaluationSample> samples) {
log.info("开始批量评估: {} 个样本", samples.size());
List<SampleEvaluationResult> results = samples.stream()
.map(this::evaluate)
.toList();
// 统计指标
DoubleSummaryStatistics faithStats = results.stream()
.mapToDouble(r -> r.faithfulness().score())
.summaryStatistics();
DoubleSummaryStatistics relevancyStats = results.stream()
.mapToDouble(r -> r.answerRelevancy().score())
.summaryStatistics();
DoubleSummaryStatistics recallStats = results.stream()
.filter(r -> r.contextRecall().hasReference())
.mapToDouble(r -> r.contextRecall().score())
.summaryStatistics();
DoubleSummaryStatistics precisionStats = results.stream()
.mapToDouble(r -> r.contextPrecision().score())
.summaryStatistics();
AggregatedMetrics metrics = new AggregatedMetrics(
faithStats.getAverage(),
relevancyStats.getAverage(),
recallStats.getCount() > 0 ? recallStats.getAverage() : -1,
precisionStats.getAverage()
);
// 质量门测试
boolean passesGate = meetsQualityGate(metrics);
List<String> failedGates = getFailedGates(metrics);
log.info("评估完成: faithfulness={:.3f}, relevancy={:.3f}, recall={:.3f}, precision={:.3f}, pass={}",
metrics.avgFaithfulness(), metrics.avgAnswerRelevancy(),
metrics.avgContextRecall(), metrics.avgContextPrecision(), passesGate);
return new DatasetEvaluationReport(results, metrics, passesGate, failedGates);
}
private boolean meetsQualityGate(AggregatedMetrics metrics) {
return metrics.avgFaithfulness() >= MIN_FAITHFULNESS &&
metrics.avgAnswerRelevancy() >= MIN_ANSWER_RELEVANCY &&
(metrics.avgContextRecall() < 0 ||
metrics.avgContextRecall() >= MIN_CONTEXT_RECALL) &&
metrics.avgContextPrecision() >= MIN_CONTEXT_PRECISION;
}
private List<String> getFailedGates(AggregatedMetrics metrics) {
List<String> failed = new ArrayList<>();
if (metrics.avgFaithfulness() < MIN_FAITHFULNESS) {
failed.add(String.format("Faithfulness=%.3f < %.2f(存在幻觉风险)",
metrics.avgFaithfulness(), MIN_FAITHFULNESS));
}
if (metrics.avgAnswerRelevancy() < MIN_ANSWER_RELEVANCY) {
failed.add(String.format("AnswerRelevancy=%.3f < %.2f(回答跑题)",
metrics.avgAnswerRelevancy(), MIN_ANSWER_RELEVANCY));
}
if (metrics.avgContextRecall() >= 0 &&
metrics.avgContextRecall() < MIN_CONTEXT_RECALL) {
failed.add(String.format("ContextRecall=%.3f < %.2f(检索遗漏信息)",
metrics.avgContextRecall(), MIN_CONTEXT_RECALL));
}
if (metrics.avgContextPrecision() < MIN_CONTEXT_PRECISION) {
failed.add(String.format("ContextPrecision=%.3f < %.2f(检索噪声过多)",
metrics.avgContextPrecision(), MIN_CONTEXT_PRECISION));
}
return failed;
}
public record AggregatedMetrics(
double avgFaithfulness, double avgAnswerRelevancy,
double avgContextRecall, double avgContextPrecision
) {}
public record SampleEvaluationResult(
String sampleId, String question,
FaithfulnessEvaluator.FaithfulnessScore faithfulness,
AnswerRelevancyEvaluator.AnswerRelevancyScore answerRelevancy,
ContextRecallEvaluator.ContextRecallScore contextRecall,
ContextPrecisionEvaluator.ContextPrecisionScore contextPrecision
) {}
public record DatasetEvaluationReport(
List<SampleEvaluationResult> sampleResults,
AggregatedMetrics metrics,
boolean passesQualityGate,
List<String> failedGates
) {}
}CI/CD集成
/**
* 集成到JUnit测试(可以在CI中运行)
*/
@SpringBootTest
class RagQualityGateTest {
@Autowired
private RagEvaluationFramework evaluationFramework;
@Autowired
private YourRagService ragService; // 被测的RAG服务
@Test
@DisplayName("RAG系统质量门测试")
void ragQualityGatePasses() {
// 加载测试集
List<EvaluationSample> testSamples = loadTestDataset();
// 执行RAG查询,填充answer和contexts
List<EvaluationSample> samplesWithAnswers = testSamples.stream()
.map(sample -> {
RagResponse response = ragService.query(sample.getQuestion());
return EvaluationSample.builder()
.sampleId(sample.getSampleId())
.question(sample.getQuestion())
.answer(response.getAnswer())
.contexts(response.getRetrievedContexts())
.referenceAnswer(sample.getReferenceAnswer())
.build();
})
.toList();
// 评估
RagEvaluationFramework.DatasetEvaluationReport report =
evaluationFramework.evaluateDataset(samplesWithAnswers);
// 断言
if (!report.passesQualityGate()) {
String failMessage = "RAG质量门未通过:\n" +
String.join("\n", report.failedGates());
Assertions.fail(failMessage);
}
}
private List<EvaluationSample> loadTestDataset() {
// 从JSON文件加载测试集(通常50-200个样本)
// ...
return List.of();
}
}RAGAS的核心思想是:用LLM来评估LLM,这样就不需要大量人工标注。但也正因为如此,评估本身也有误差——大约有5-15%的误判率。在实际使用中,如果某个指标的分数变化在5%以内,不一定是真实的质量变化,要结合人工抽查来确认。
