第1715篇:Golden Path测试——为AI系统建立基准测试套件
第1715篇:Golden Path测试——为AI系统建立基准测试套件
有天下午团队在讨论一个问题:我们怎么知道今天的AI服务比上个月好还是差?没人说得清楚。指标倒是有一堆——响应时间、错误率、Token消耗——但这些都是基础设施指标,不是质量指标。AI的回答质量好不好、Prompt优化有没有起作用、模型升级有没有副作用,全凭"感觉"。
Golden Path测试就是为了解决这个"感觉"问题。
一、什么是Golden Path测试
Golden Path(黄金路径)测试的概念来自软件测试里的"happy path",但含义更丰富:它是一组精心挑选的、代表最重要业务场景的测试用例集合,这个集合的通过率和质量分数,就是你的系统质量基准线。
叫"Golden"的原因:这组测试用例是你的"金标准",是你认为系统必须做好的最重要的事情,每次发布前都要跑,每次跑都要对比历史数据。
与普通回归测试的区别:
| 维度 | 普通回归测试 | Golden Path测试 |
|---|---|---|
| 目的 | 验证不退化 | 建立质量基准 |
| 粒度 | 功能级别 | 业务价值级别 |
| 数量 | 可以很多 | 精选,通常50-200条 |
| 评估维度 | 通过/失败 | 多维度质量评分 |
| 趋势分析 | 不关注 | 核心价值 |
二、Golden Path用例的设计原则
原则1:覆盖关键用户旅程,而不是覆盖代码路径
差的选法:给每个API接口各选一个用例
好的选法:选出"用户用我们的AI功能最常做的10件事",每件事选几个代表性用例原则2:包含边界和异常场景
不只是正常路径,还要包括:
- 极短输入(1-10字)
- 极长输入(接近最大Token限制)
- 混合语言输入
- 含特殊字符、表情符号
- 歧义性较高的输入
原则3:用例要有明确的质量预期
每个Golden Path用例不仅要定义输入,还要定义:
- 最低可接受质量分数
- 关键输出约束(结构、内容)
- 参考输出(作为对比基准)
三、基准测试框架架构
四、核心实现
// 基准测试用例定义
@Data
@Builder
public class GoldenPathTestCase {
private String id;
private String category; // 用例分类
private int priority; // 1-5,越小越重要
private String description;
// 输入
private String systemPrompt;
private String userInput;
private Map<String, Object> context;
// 质量要求
private double minimumQualityScore; // 最低质量分数 0-1
private List<QualityDimension> evaluationDimensions;
private String referenceOutput; // 参考输出(可选)
// 约束条件
private List<OutputConstraint> constraints;
}
// 质量评估维度
public enum QualityDimension {
ACCURACY, // 准确性
COMPLETENESS, // 完整性
FORMAT, // 格式正确性
RELEVANCE, // 相关性
SAFETY, // 安全性(无有害内容)
CONSISTENCY // 一致性
}
// 评估结果
@Data
@Builder
public class EvaluationResult {
private String testCaseId;
private double overallScore; // 综合质量分 0-1
private Map<QualityDimension, Double> dimensionScores;
private boolean passedConstraints; // 是否通过约束
private List<String> failedConstraints;
private String actualOutput;
private long latencyMs;
private int tokenCount;
private String runId; // 本次运行ID
private Instant timestamp;
}五、多维度评估器实现
// 结构评估器——检查输出格式是否符合要求
@Component
public class StructureEvaluator implements Evaluator {
private final ObjectMapper mapper = new ObjectMapper();
@Override
public DimensionScore evaluate(String output, GoldenPathTestCase testCase) {
List<OutputConstraint> structureConstraints = testCase.getConstraints().stream()
.filter(c -> c.getType() == ConstraintType.STRUCTURE)
.collect(Collectors.toList());
if (structureConstraints.isEmpty()) {
return DimensionScore.notApplicable(QualityDimension.FORMAT);
}
int passed = 0;
List<String> issues = new ArrayList<>();
for (OutputConstraint constraint : structureConstraints) {
boolean ok = checkConstraint(output, constraint);
if (ok) passed++;
else issues.add(constraint.getDescription());
}
double score = (double) passed / structureConstraints.size();
return DimensionScore.of(QualityDimension.FORMAT, score, issues);
}
private boolean checkConstraint(String output, OutputConstraint constraint) {
return switch (constraint.getType()) {
case VALID_JSON -> isValidJson(output);
case CONTAINS_FIELD -> containsField(output, constraint.getParam());
case FIELD_ENUM -> fieldMatchesEnum(output, constraint);
case NO_CODE_BLOCK -> !output.contains("```");
default -> true;
};
}
}
// 参考输出对比评估器(用LLM作为评判)
@Component
public class ReferenceEvaluator implements Evaluator {
private final LlmClient judgeClient;
private static final String JUDGE_PROMPT = """
你是一个客观的AI输出质量评估专家。
请比较以下两个输出,评估"实际输出"相对于"参考输出"的质量。
参考输出(理想答案):
%s
实际输出(待评估):
%s
请从以下维度打分(0-10分),并简要说明理由:
- 内容准确性:信息是否正确
- 内容完整性:是否涵盖了关键点
- 表达质量:语言是否清晰准确
请严格按照以下JSON格式返回,不要添加其他内容:
{
"accuracy": <0-10>,
"completeness": <0-10>,
"expression": <0-10>,
"reasoning": "<简要说明>"
}
""";
@Override
public DimensionScore evaluate(String output, GoldenPathTestCase testCase) {
if (testCase.getReferenceOutput() == null || testCase.getReferenceOutput().isBlank()) {
return DimensionScore.notApplicable(QualityDimension.ACCURACY);
}
String judgePrompt = String.format(
JUDGE_PROMPT,
testCase.getReferenceOutput(),
output
);
try {
String judgeResponse = judgeClient.complete(
"你是一个专业的AI输出质量评估专家,只输出JSON。",
judgePrompt
);
JudgeResult judgeResult = parseJudgeResult(judgeResponse);
double normalizedScore = (judgeResult.getAccuracy()
+ judgeResult.getCompleteness()
+ judgeResult.getExpression()) / 30.0; // 归一化到0-1
return DimensionScore.of(
QualityDimension.ACCURACY,
normalizedScore,
List.of(judgeResult.getReasoning())
);
} catch (Exception e) {
// 评估失败时给中间分
return DimensionScore.of(QualityDimension.ACCURACY, 0.5,
List.of("评估过程发生异常:" + e.getMessage()));
}
}
}
// 安全性评估器
@Component
public class SafetyEvaluator implements Evaluator {
private final List<Pattern> harmfulPatterns = List.of(
Pattern.compile("(?i)(如何制作|怎么做).*(炸弹|武器|毒品)"),
Pattern.compile("(?i)(骗人|欺诈|洗钱).*(教程|方法|步骤)"),
Pattern.compile("个人信息泄露|隐私数据")
);
@Override
public DimensionScore evaluate(String output, GoldenPathTestCase testCase) {
List<String> issues = harmfulPatterns.stream()
.filter(p -> p.matcher(output).find())
.map(p -> "检测到潜在有害内容模式:" + p.pattern())
.collect(Collectors.toList());
double score = issues.isEmpty() ? 1.0 : 0.0;
return DimensionScore.of(QualityDimension.SAFETY, score, issues);
}
}六、基准执行引擎与历史对比
@Service
public class BenchmarkEngine {
private final List<Evaluator> evaluators;
private final BenchmarkHistoryStore historyStore;
private final LlmService llmService;
public BenchmarkRunResult runSuite(String suiteName,
List<GoldenPathTestCase> testCases) {
String runId = UUID.randomUUID().toString();
Instant startTime = Instant.now();
// 按优先级排序,优先级高的先跑
List<GoldenPathTestCase> sortedCases = testCases.stream()
.sorted(Comparator.comparingInt(GoldenPathTestCase::getPriority))
.collect(Collectors.toList());
List<EvaluationResult> results = sortedCases.parallelStream()
.map(tc -> evaluateTestCase(tc, runId))
.collect(Collectors.toList());
BenchmarkRunResult runResult = buildRunResult(runId, suiteName, results, startTime);
// 保存历史数据
historyStore.save(runResult);
// 对比历史基准
Optional<BenchmarkRunResult> previousRun = historyStore.findLastRun(suiteName);
if (previousRun.isPresent()) {
runResult.setComparison(compare(runResult, previousRun.get()));
}
return runResult;
}
private EvaluationResult evaluateTestCase(GoldenPathTestCase testCase, String runId) {
long startMs = System.currentTimeMillis();
try {
LlmResponse response = llmService.complete(
testCase.getSystemPrompt(),
testCase.getUserInput()
);
String output = response.getContent();
long latency = System.currentTimeMillis() - startMs;
// 运行所有评估器
Map<QualityDimension, Double> dimensionScores = new HashMap<>();
List<String> allIssues = new ArrayList<>();
for (Evaluator evaluator : evaluators) {
DimensionScore ds = evaluator.evaluate(output, testCase);
if (ds.isApplicable()) {
dimensionScores.put(ds.getDimension(), ds.getScore());
allIssues.addAll(ds.getIssues());
}
}
// 加权计算综合分
double overallScore = calculateWeightedScore(
dimensionScores,
testCase.getEvaluationDimensions()
);
boolean passed = overallScore >= testCase.getMinimumQualityScore();
return EvaluationResult.builder()
.testCaseId(testCase.getId())
.overallScore(overallScore)
.dimensionScores(dimensionScores)
.passedConstraints(passed)
.actualOutput(output)
.latencyMs(latency)
.tokenCount(response.getTokenCount())
.runId(runId)
.timestamp(Instant.now())
.build();
} catch (Exception e) {
return EvaluationResult.builder()
.testCaseId(testCase.getId())
.overallScore(0.0)
.passedConstraints(false)
.failedConstraints(List.of("执行异常:" + e.getMessage()))
.latencyMs(System.currentTimeMillis() - startMs)
.runId(runId)
.timestamp(Instant.now())
.build();
}
}
// 计算与上次运行的差异
private BenchmarkComparison compare(BenchmarkRunResult current,
BenchmarkRunResult previous) {
double scoreDelta = current.getAverageScore() - previous.getAverageScore();
double passRateDelta = current.getPassRate() - previous.getPassRate();
List<RegressionItem> regressions = new ArrayList<>();
for (EvaluationResult cr : current.getResults()) {
previous.getResults().stream()
.filter(pr -> pr.getTestCaseId().equals(cr.getTestCaseId()))
.findFirst()
.ifPresent(pr -> {
double delta = cr.getOverallScore() - pr.getOverallScore();
if (delta < -0.1) { // 下降超过10%视为退化
regressions.add(RegressionItem.builder()
.testCaseId(cr.getTestCaseId())
.previousScore(pr.getOverallScore())
.currentScore(cr.getOverallScore())
.delta(delta)
.build());
}
});
}
return BenchmarkComparison.builder()
.scoreDelta(scoreDelta)
.passRateDelta(passRateDelta)
.regressions(regressions)
.improved(scoreDelta > 0.05) // 提升超过5%算有效提升
.degraded(scoreDelta < -0.05) // 下降超过5%算退化
.build();
}
}七、Golden Path用例库的设计
真实项目里我们怎么存这些用例?建议用YAML文件,提交到代码仓库:
# golden-path-suite.yaml
suite:
id: "sentiment-analysis-golden-path"
name: "情感分析黄金路径测试套件"
version: "1.3.0"
test_cases:
- id: "GP-SENT-001"
category: "product_review"
priority: 1
description: "典型正面产品评价——核心场景"
system_prompt: "你是一个情感分析专家,请分析以下文本的情感倾向,并以JSON格式返回结果。"
user_input: "这款手机真的超出预期,拍照效果惊艳,续航也很强,值得购买!"
minimum_quality_score: 0.85
evaluation_dimensions:
- FORMAT
- ACCURACY
- SAFETY
reference_output: |
{
"sentiment": "positive",
"score": 0.92,
"keywords": ["超出预期", "惊艳", "值得购买"],
"reasoning": "用户使用了大量积极词汇,表达强烈的满意情绪"
}
constraints:
- type: "VALID_JSON"
description: "输出必须是有效JSON"
- type: "CONTAINS_FIELD"
param: "sentiment"
description: "必须包含sentiment字段"
- type: "FIELD_ENUM"
field: "sentiment"
values: ["positive", "negative", "neutral"]
description: "sentiment值必须是预定义枚举"
- id: "GP-SENT-002"
category: "edge_case"
priority: 2
description: "极短文本——1-5个字"
user_input: "不错"
minimum_quality_score: 0.7
evaluation_dimensions:
- FORMAT
constraints:
- type: "VALID_JSON"
- type: "CONTAINS_FIELD"
param: "sentiment"
- id: "GP-SENT-003"
category: "edge_case"
priority: 2
description: "混合情感文本"
user_input: "产品质量不错,但快递太慢了,总体还行吧"
minimum_quality_score: 0.75
evaluation_dimensions:
- FORMAT
- ACCURACY
reference_output: |
{
"sentiment": "neutral",
"score": 0.5,
"reasoning": "包含正负面元素,整体情感中性"
}加载和运行:
@Test
@Tag("golden-path")
void runGoldenPathSuite() throws Exception {
List<GoldenPathTestCase> testCases = GoldenPathLoader.loadFromYaml(
"classpath:golden-path/sentiment-analysis-golden-path.yaml"
);
BenchmarkRunResult result = benchmarkEngine.runSuite(
"sentiment-analysis-golden-path",
testCases
);
// 打印报告
System.out.println(generateReport(result));
// CI门禁:综合通过率不得低于85%
assertThat(result.getPassRate())
.as("Golden Path套件通过率")
.isGreaterThanOrEqualTo(0.85);
// 不允许出现严重退化(任意用例分数下降超过20%)
if (result.getComparison() != null) {
List<RegressionItem> severeRegressions = result.getComparison()
.getRegressions().stream()
.filter(r -> r.getDelta() < -0.20)
.collect(Collectors.toList());
assertThat(severeRegressions)
.as("严重退化用例(下降>20%)不能超过0个")
.isEmpty();
}
}八、趋势分析与告警
质量趋势是Golden Path的核心价值:
@Service
public class QualityTrendAnalyzer {
public TrendReport analyzeWeeklyTrend(String suiteName) {
List<BenchmarkRunResult> last30Days = historyStore.findByDateRange(
suiteName,
Instant.now().minus(30, ChronoUnit.DAYS),
Instant.now()
);
// 计算每日平均分
Map<LocalDate, Double> dailyScores = last30Days.stream()
.collect(Collectors.groupingBy(
r -> r.getStartTime().atZone(ZoneId.systemDefault()).toLocalDate(),
Collectors.averagingDouble(BenchmarkRunResult::getAverageScore)
));
// 线性回归计算趋势
double trend = calculateLinearTrend(dailyScores);
// 识别异常(某天分数突降超过15%)
List<AnomalyEvent> anomalies = detectAnomalies(dailyScores, 0.15);
return TrendReport.builder()
.suiteName(suiteName)
.dailyScores(dailyScores)
.overallTrend(trend > 0 ? "IMPROVING" : trend < -0.01 ? "DEGRADING" : "STABLE")
.trendValue(trend)
.anomalies(anomalies)
.recommendation(generateRecommendation(trend, anomalies))
.build();
}
}九、踩坑:我们经历过的教训
坑1:用例太多,样本失控
一开始觉得用例越多越好,结果跑一次两个小时,没人愿意等,Golden Path变成了摆设。后来控制在100个以内,跑一次20分钟以内,才真正用起来了。
坑2:参考输出腐化
半年前定义的参考输出,当时是最好的答案,但后来业务需求变了,那个参考输出已经不合适了。参考输出要定期review,至少每季度一次。
坑3:分数膨胀
随着Prompt不断优化,通过率越来越高,门禁阈值也一直没调高,结果变成了形同虚设。阈值要跟着进度动态调整。
坑4:LLM-as-Judge的偏差
用LLM做评估器(LLM-as-Judge)本身也有偏差,它可能倾向于给类似自己风格的回答高分。需要定期人工抽查评估结果,校准Judge的可靠性。
总结
Golden Path测试的价值不在于发现具体bug,而在于建立一个可量化的质量基准。当你能回答"我们这周的情感分析质量比上周高了3.5%,比上线时提升了18%",整个团队对质量的感知就从模糊变成了精确。
这种精确感知是后续所有改进工作的基础。
