Spring AI测试完全指南:从单元测试到生产压测
Spring AI测试完全指南:从单元测试到生产压测
那次"测试全过,上线即崩"的深刻教训
2025年2月,一个叫赵磊的工程师在我的知识星球发了一条消息,让我印象深刻。
他说:"老张,我的AI客服系统,单元测试100%通过,集成测试也全过,上线第一天就崩了。原因是什么?"
我问了他几个问题。
"你的单元测试里,ChatClient是真实调用OpenAI的吗?"
"不是,都是Mock的。"
"你的集成测试呢?"
"也是Mock的,怕花API费用。"
"那你测试了什么?"
沉默了一会儿,他说:"...测试了我的业务逻辑,但没有测AI部分。"
他的系统本质上只测了一个假设——"如果AI按照我期望的方式返回,业务逻辑是对的"。但真实AI的响应和Mock的完全不同:格式不一样、时序不一样、偶尔会返回意想不到的内容。
上线后,真实AI第一次返回了一个带有json代码块格式的JSON,他的解析代码崩了,因为他测试时Mock的是裸JSON。
AI应用测试,和普通Web应用测试有根本性的不同。这篇文章,是我见过最完整的Spring AI测试指南。
先说结论(TL;DR)
| 测试层次 | 工具 | 是否调用真实AI | 成本 | 速度 | 覆盖目标 |
|---|---|---|---|---|---|
| 单元测试 | Mockito | 否 | 零 | <1s | 业务逻辑 |
| 集成测试 | WireMock | 否(录制回放) | 极低 | <5s | 接口契约 |
| 系统测试 | TestContainers + 真实AI | 是 | 中 | 分钟级 | 端到端流程 |
| 压力测试 | Gatling | 是或Mock | 高 | 小时级 | 性能基准 |
| 混沌测试 | Chaos Monkey | 模拟故障 | 低 | 分钟级 | 容错能力 |
测试策略的核心原则:
- 单元测试:测业务逻辑,不测AI
- 集成测试:测接口契约(用录制的真实响应)
- 系统测试:真实AI调用,但只测关键路径(控制成本)
- 语义断言:用LLM-as-Judge评估AI输出质量
AI应用测试的特殊挑战
挑战1:非确定性输出
普通代码:add(1, 2) 永远返回 3。 AI代码:同样的输入,每次输出都可能不同。传统断言 assertEquals("期望输出", response) 完全失效。
挑战2:API调用成本
每次调用OpenAI GPT-4o,约0.002-0.01美元。CI/CD每天跑100次测试,一个月就是几百美元。
挑战3:速度慢
AI API调用通常需要1-10秒。如果所有测试都真实调用AI,测试套件可能要跑几十分钟。
挑战4:提示词是代码
提示词的微小变化可能导致输出质量大幅下降,但普通代码差异对比工具发现不了。
测试层次一:单元测试——Mock ChatClient的正确姿势
1.1 不要这样Mock(错误示范)
// 错误:过度Mock,没有测到任何有意义的东西
@Test
void testAskQuestion_WRONG() {
when(chatClient.prompt().user("问题").call().content())
.thenReturn("完美答案");
String result = aiService.ask("问题");
assertEquals("完美答案", result); // 这个测试有什么意义?
}1.2 正确的Mock策略
package com.laozhang.test.unit;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;
import org.springframework.ai.chat.client.ChatClient;
import static org.assertj.core.api.Assertions.*;
import static org.mockito.ArgumentMatchers.*;
import static org.mockito.Mockito.*;
/**
* AI服务单元测试示例
* 原则:只测业务逻辑,Mock AI的输出
* 要覆盖:正常响应处理、格式异常降级、空内容处理、业务规则校验
*/
@ExtendWith(MockitoExtension.class)
class ContractRiskAnalysisServiceTest {
@Mock private ChatClient chatClient;
@Mock private ChatClient.ChatClientRequestSpec requestSpec;
@Mock private ChatClient.CallResponseSpec callResponseSpec;
private ContractRiskAnalysisService analysisService;
@BeforeEach
void setUp() {
analysisService = new ContractRiskAnalysisService(chatClient, riskRuleRepository);
}
/**
* 测试:AI返回合法JSON格式时,正确解析
*/
@Test
void testAnalyzeContract_ValidJsonResponse() {
String mockAiResponse = """
{
"riskLevel": "HIGH",
"riskItems": [
{
"clause": "第5条违约金条款",
"risk": "违约金比例超出法定上限",
"suggestion": "建议将违约金比例降至20%以内"
}
],
"summary": "本合同存在1处高风险条款,建议修改后再签署"
}
""";
setupChatClientMock(mockAiResponse);
ContractAnalysisResult result = analysisService.analyze("合同文本内容...");
assertThat(result).isNotNull();
assertThat(result.getRiskLevel()).isEqualTo(RiskLevel.HIGH);
assertThat(result.getRiskItems()).hasSize(1);
assertThat(result.getSummary()).contains("高风险条款");
}
/**
* 测试:AI返回带代码块的JSON(真实场景中非常常见!)
* 这是赵磊上线崩溃的根本原因
*/
@Test
void testAnalyzeContract_JsonWrappedInCodeBlock() {
String mockAiResponse = """
```json
{
"riskLevel": "MEDIUM",
"riskItems": [],
"summary": "未发现明显风险"
}
```
""";
setupChatClientMock(mockAiResponse);
ContractAnalysisResult result = analysisService.analyze("合同文本...");
assertThat(result).isNotNull();
assertThat(result.getRiskLevel()).isEqualTo(RiskLevel.MEDIUM);
}
/**
* 测试:AI返回空内容时的降级处理
*/
@Test
void testAnalyzeContract_EmptyResponse_FallbackToDefault() {
setupChatClientMock("");
ContractAnalysisResult result = analysisService.analyze("合同文本...");
assertThat(result).isNotNull();
assertThat(result.getRiskLevel()).isEqualTo(RiskLevel.UNKNOWN);
assertThat(result.getSummary()).contains("分析失败");
}
/**
* 测试:AI返回不合法的JSON时的错误处理
*/
@Test
void testAnalyzeContract_InvalidJson_GracefulDegradation() {
setupChatClientMock("这不是一个JSON格式的响应,AI说了一堆废话");
assertThatNoException().isThrownBy(
() -> analysisService.analyze("合同文本..."));
ContractAnalysisResult result = analysisService.analyze("合同文本...");
assertThat(result.getRiskLevel()).isEqualTo(RiskLevel.UNKNOWN);
}
/**
* 测试:输入为空时,应该抛出业务异常,不调用AI
*/
@Test
void testAnalyzeContract_EmptyInput_ThrowsException() {
assertThatThrownBy(() -> analysisService.analyze(""))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("合同内容不能为空");
verifyNoInteractions(chatClient);
}
/**
* 测试:合同超过最大长度时,正确分片处理
*/
@Test
void testAnalyzeContract_LongContract_SplitIntoChunks() {
String longContract = "A".repeat(100_000);
setupChatClientMock("{\"riskLevel\":\"LOW\",\"riskItems\":[],\"summary\":\"无风险\"}");
analysisService.analyze(longContract);
verify(chatClient, atLeast(2)).prompt();
}
private void setupChatClientMock(String responseContent) {
when(chatClient.prompt()).thenReturn(requestSpec);
when(requestSpec.user(any(java.util.function.Consumer.class))).thenReturn(requestSpec);
when(requestSpec.system(anyString())).thenReturn(requestSpec);
when(requestSpec.advisors(any())).thenReturn(requestSpec);
when(requestSpec.call()).thenReturn(callResponseSpec);
when(callResponseSpec.content()).thenReturn(responseContent);
}
}测试层次二:WireMock集成测试
package com.laozhang.test.integration;
import com.github.tomakehurst.wiremock.WireMockServer;
import com.github.tomakehurst.wiremock.core.WireMockConfiguration;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.DynamicPropertyRegistry;
import org.springframework.test.context.DynamicPropertySource;
import static com.github.tomakehurst.wiremock.client.WireMock.*;
/**
* AI集成测试基类
* 使用WireMock拦截OpenAI API调用,返回录制的真实响应
* 不消耗真实API费用,但测试的是真实HTTP通信
*/
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
public abstract class BaseAiIntegrationTest {
static WireMockServer wireMockServer;
@BeforeAll
static void startWireMock() {
wireMockServer = new WireMockServer(
WireMockConfiguration.wireMockConfig()
.port(8089)
.withRootDirectory("src/test/resources/wiremock")
);
wireMockServer.start();
}
@AfterAll
static void stopWireMock() {
wireMockServer.stop();
}
@DynamicPropertySource
static void configureOpenAiBaseUrl(DynamicPropertyRegistry registry) {
registry.add("spring.ai.openai.base-url",
() -> "http://localhost:" + wireMockServer.port());
registry.add("spring.ai.openai.api-key", () -> "test-key-wiremock");
}
protected void stubContractAnalysisSuccess() {
wireMockServer.stubFor(
post(urlEqualTo("/v1/chat/completions"))
.withHeader("Content-Type", containing("application/json"))
.withRequestBody(containing("合同"))
.willReturn(aResponse()
.withStatus(200)
.withHeader("Content-Type", "application/json")
.withBodyFile("openai/contract-analysis-success.json")
.withFixedDelay(200))
);
}
protected void stubOpenAiServerError() {
wireMockServer.stubFor(
post(urlEqualTo("/v1/chat/completions"))
.willReturn(aResponse()
.withStatus(500)
.withBody("{\"error\":{\"message\":\"服务器内部错误\"}}"))
);
}
protected void stubOpenAiRateLimit() {
wireMockServer.stubFor(
post(urlEqualTo("/v1/chat/completions"))
.willReturn(aResponse()
.withStatus(429)
.withHeader("Retry-After", "60")
.withBody("{\"error\":{\"message\":\"Rate limit exceeded\"}}"))
);
}
protected void stubOpenAiTimeout() {
wireMockServer.stubFor(
post(urlEqualTo("/v1/chat/completions"))
.willReturn(aResponse()
.withStatus(200)
.withFixedDelay(30000))
);
}
}WireMock录制文件
src/test/resources/wiremock/__files/openai/contract-analysis-success.json:
{
"id": "chatcmpl-test-123",
"object": "chat.completion",
"created": 1706745600,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"riskLevel\":\"HIGH\",\"riskItems\":[{\"clause\":\"第5条违约金条款\",\"risk\":\"违约金比例超出法定上限(30%)\",\"suggestion\":\"建议将违约金比例降至20%以内\"}],\"summary\":\"本合同存在1处高风险条款\"}"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 856,
"completion_tokens": 123,
"total_tokens": 979
}
}集成测试用例
class ContractAnalysisIntegrationTest extends BaseAiIntegrationTest {
@Autowired
private ContractAnalysisController controller;
@Test
void testAnalyzeContract_SuccessPath() {
stubContractAnalysisSuccess();
var request = new ContractAnalysisRequest("合同测试内容...");
var response = controller.analyze(request);
assertThat(response.getStatusCode().is2xxSuccessful()).isTrue();
assertThat(response.getBody().getRiskLevel()).isEqualTo("HIGH");
wireMockServer.verify(1, postRequestedFor(urlEqualTo("/v1/chat/completions")));
}
@Test
void testAnalyzeContract_AiServerError_ReturnsServiceUnavailable() {
stubOpenAiServerError();
var response = controller.analyze(new ContractAnalysisRequest("合同内容..."));
assertThat(response.getStatusCode().value()).isEqualTo(503);
}
@Test
void testAnalyzeContract_RateLimit_RetryAndSucceed() {
wireMockServer.stubFor(
post(urlEqualTo("/v1/chat/completions"))
.inScenario("RateLimit")
.whenScenarioStateIs(STARTED)
.willReturn(aResponse().withStatus(429))
.willSetStateTo("AfterRateLimit")
);
wireMockServer.stubFor(
post(urlEqualTo("/v1/chat/completions"))
.inScenario("RateLimit")
.whenScenarioStateIs("AfterRateLimit")
.willReturn(aResponse()
.withStatus(200)
.withBodyFile("openai/contract-analysis-success.json"))
);
var response = controller.analyze(new ContractAnalysisRequest("合同..."));
assertThat(response.getStatusCode().is2xxSuccessful()).isTrue();
wireMockServer.verify(2, postRequestedFor(urlEqualTo("/v1/chat/completions")));
}
}测试层次三:录制回放测试
package com.laozhang.test.recorder;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Component;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.security.MessageDigest;
import java.util.HexFormat;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
/**
* AI响应录制回放工具
* 录制模式(RECORD):真实调用AI,把响应存到文件
* 回放模式(REPLAY):从文件读取响应,不调用AI
* 透传模式(PASSTHROUGH):直接调用AI,不存储
* 通过环境变量 AI_TEST_MODE=RECORD|REPLAY|PASSTHROUGH 控制
*/
@Slf4j
@Component
public class AiResponseRecorder {
private static final String TEST_MODE_ENV = "AI_TEST_MODE";
private static final Path RECORDINGS_DIR = Paths.get("src/test/resources/ai-recordings");
private final ChatClient chatClient;
private final ObjectMapper objectMapper;
private final Mode mode;
private final ConcurrentHashMap<String, String> cache = new ConcurrentHashMap<>();
public AiResponseRecorder(ChatClient chatClient, ObjectMapper objectMapper) {
this.chatClient = chatClient;
this.objectMapper = objectMapper;
String envMode = System.getenv(TEST_MODE_ENV);
this.mode = envMode != null ? Mode.valueOf(envMode) : Mode.PASSTHROUGH;
log.info("AI测试模式: {}", this.mode);
}
public String call(String systemPrompt, String userMessage) {
String cacheKey = buildCacheKey(systemPrompt, userMessage);
return switch (mode) {
case RECORD -> {
String response = realAiCall(systemPrompt, userMessage);
saveRecording(cacheKey, systemPrompt, userMessage, response);
yield response;
}
case REPLAY -> loadRecording(cacheKey);
default -> realAiCall(systemPrompt, userMessage);
};
}
private String realAiCall(String systemPrompt, String userMessage) {
return chatClient.prompt()
.system(systemPrompt)
.user(userMessage)
.call()
.content();
}
private void saveRecording(String key, String systemPrompt,
String userMessage, String response) {
try {
Files.createDirectories(RECORDINGS_DIR);
Path filePath = RECORDINGS_DIR.resolve(key + ".json");
Map<String, String> recording = Map.of(
"key", key,
"systemPrompt", systemPrompt,
"userMessage", userMessage,
"response", response,
"recordedAt", java.time.Instant.now().toString()
);
objectMapper.writerWithDefaultPrettyPrinter()
.writeValue(filePath.toFile(), recording);
log.info("录制AI响应: {}", filePath);
} catch (IOException e) {
log.error("保存录制失败", e);
}
}
@SuppressWarnings("unchecked")
private String loadRecording(String key) {
return cache.computeIfAbsent(key, k -> {
Path filePath = RECORDINGS_DIR.resolve(k + ".json");
if (!Files.exists(filePath)) {
throw new RecordingNotFoundException(
"找不到录制文件: " + filePath +
"\n请先运行一次 AI_TEST_MODE=RECORD 来录制真实响应");
}
try {
Map<String, String> recording = objectMapper.readValue(
filePath.toFile(), Map.class);
return recording.get("response");
} catch (IOException e) {
throw new RuntimeException("加载录制失败", e);
}
});
}
private String buildCacheKey(String systemPrompt, String userMessage) {
try {
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] hash = md.digest((systemPrompt + "|||" + userMessage).getBytes());
return HexFormat.of().formatHex(hash).substring(0, 16);
} catch (Exception e) {
throw new RuntimeException("生成缓存Key失败", e);
}
}
public enum Mode { RECORD, REPLAY, PASSTHROUGH }
}测试层次四:语义断言(LLM-as-Judge)
package com.laozhang.test.semantic;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Component;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
/**
* 语义断言器
* 使用Judge LLM评估另一个LLM的输出质量
* 解决非确定性输出无法用字符串断言的问题
*/
@Slf4j
@Component
@RequiredArgsConstructor
public class SemanticAssert {
private final ChatClient judgeClient;
public void assertContainsSemantic(String response, String expectation) {
SemanticJudgeResult result = judge(response, expectation);
if (!result.isPassed()) {
throw new AssertionError(String.format(
"语义断言失败!\n期望:%s\n实际响应:%s\n评判理由:%s",
expectation, response, result.getReason()));
}
log.debug("语义断言通过:{}", expectation);
}
public void assertQualityScore(String response, String criteria, int minScore) {
int score = scoreQuality(response, criteria);
if (score < minScore) {
throw new AssertionError(String.format(
"AI输出质量不达标!\n标准:%s\n实际分数:%d(最低要求:%d)\n响应内容:%s",
criteria, score, minScore, response));
}
log.debug("质量评分:{}/10,标准:{}", score, criteria);
}
public void assertNotContains(String response, String forbiddenContent) {
String judgePrompt = String.format("""
判断以下文本是否包含 "%s" 相关的内容。
文本:%s
只回答 YES 或 NO。
""", forbiddenContent, response);
String judgment = judgeClient.prompt()
.user(judgePrompt)
.call()
.content()
.trim()
.toUpperCase();
if (judgment.startsWith("YES")) {
throw new AssertionError(String.format(
"AI响应不应包含:%s\n实际响应:%s", forbiddenContent, response));
}
}
private SemanticJudgeResult judge(String response, String expectation) {
String judgePrompt = String.format("""
评判标准:响应是否满足以下期望
期望:%s
待评判的响应:%s
严格按以下JSON格式回答,不要加任何其他内容:
{"passed": true或false, "reason": "判断理由(50字以内)"}
""", expectation, response);
String judgeResponse = judgeClient.prompt()
.system("你是专业的AI响应质量评判员,只输出JSON格式结果。")
.user(judgePrompt)
.call()
.content();
try {
return parseJudgeResult(judgeResponse);
} catch (Exception e) {
return new SemanticJudgeResult(true, "评判结果解析失败,默认通过");
}
}
private int scoreQuality(String response, String criteria) {
String scorePrompt = String.format("""
按以下标准为AI响应打分(1-10分):
评分标准:%s
待评分响应:%s
只输出一个1到10之间的整数,不要其他内容。
""", criteria, response);
String scoreStr = judgeClient.prompt()
.user(scorePrompt)
.call()
.content()
.trim();
try {
return Integer.parseInt(scoreStr);
} catch (NumberFormatException e) {
return 5;
}
}
private SemanticJudgeResult parseJudgeResult(String json) {
String cleanJson = json.replaceAll("```json", "").replaceAll("```", "").trim();
boolean passed = cleanJson.contains("\"passed\": true") ||
cleanJson.contains("\"passed\":true");
int start = cleanJson.indexOf("\"reason\": \"") + 11;
int end = cleanJson.lastIndexOf("\"");
String reason = start > 10 && end > start
? cleanJson.substring(start, end) : "未能提取原因";
return new SemanticJudgeResult(passed, reason);
}
record SemanticJudgeResult(boolean isPassed, String reason) {}
}语义断言使用示例
@SpringBootTest
class ContractAnalysisSemanticTest {
@Autowired
private ContractAnalysisService analysisService;
@Autowired
private SemanticAssert semanticAssert;
@Test
void testAnalyze_HighRiskContract_ShouldIdentifyMajorRisks() {
String highRiskContract = """
甲方因违约,需支付合同金额50%作为违约金。
所有纠纷提交海牙国际仲裁中心仲裁。
乙方权利可以单方面终止合同,甲方无任何救济权利。
""";
String response = analysisService.analyzeToString(highRiskContract);
semanticAssert.assertContainsSemantic(response,
"识别出违约金条款存在风险,且指出违约金比例过高");
semanticAssert.assertContainsSemantic(response,
"指出仲裁条款存在问题(跨境仲裁对中国甲方不利)");
semanticAssert.assertContainsSemantic(response,
"建议修改合同而不是直接签署");
semanticAssert.assertQualityScore(response,
"法律专业性、风险识别完整性、建议可操作性", 7);
semanticAssert.assertNotContains(response, "歧视性或偏见性内容");
}
}测试层次五:Gatling压力测试
// src/test/scala/simulations/AiServiceSimulation.scala
package simulations
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class AiServiceSimulation extends Simulation {
val httpProtocol = http
.baseUrl("http://localhost:8080")
.acceptHeader("application/json")
.contentTypeHeader("application/json")
.header("Authorization", "Bearer test-token")
val contractData = csv("data/contracts.csv").circular
val contractAnalysisScenario = scenario("合同分析压测")
.feed(contractData)
.exec(
http("提交合同分析")
.post("/api/contracts/analyze")
.body(StringBody("""{"content": "${contractContent}"}"""))
.check(status.is(202))
.check(jsonPath("$.taskId").saveAs("taskId"))
)
.pause(1, 3)
.exec(
doWhileDuring(session => {
val status = session("taskStatus").asOption[String]
status.forall(s => s != "COMPLETED" && s != "FAILED")
}, 60.seconds)(
http("查询任务状态")
.get("/api/tasks/${taskId}/status")
.check(status.is(200))
.check(jsonPath("$.status").saveAs("taskStatus"))
)
)
val constantLoadScenario = scenario("恒定负载")
.feed(contractData)
.exec(
http("同步问答接口")
.post("/api/ai/chat")
.body(StringBody("""{"message": "${question}"}"""))
.check(status.is(200))
.check(responseTimeInMillis.lte(5000))
)
setUp(
contractAnalysisScenario.inject(rampUsers(100).during(5.minutes)),
constantLoadScenario.inject(constantUsersPerSec(50).during(10.minutes))
).protocols(httpProtocol)
.assertions(
global.responseTime.percentile(99).lt(10000),
global.successfulRequests.percent.gt(99),
global.requestsPerSec.gt(50)
)
}测试层次六:混沌测试
package com.laozhang.test.chaos;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.bean.override.mockito.MockitoBean;
import static org.assertj.core.api.Assertions.*;
import static org.mockito.Mockito.*;
@SpringBootTest
class AiChaosTest {
@Autowired
private ContractAnalysisService analysisService;
@MockitoBean
private ChatClient chatClient;
@Test
void testAiProviderDown_SystemDegrades_GracefullyReturnsDefault() {
when(chatClient.prompt()).thenThrow(
new RuntimeException("Connection refused: OpenAI is down"));
assertThatNoException().isThrownBy(() -> {
ContractAnalysisResult result = analysisService.analyze("合同内容...");
assertThat(result).isNotNull();
assertThat(result.isDegraded()).isTrue();
});
}
@Test
void testAiProviderTimeout_RetryWorks_EventuallySucceeds() {
when(chatClient.prompt())
.thenThrow(new java.util.concurrent.TimeoutException("Request timeout"))
.thenThrow(new java.util.concurrent.TimeoutException("Request timeout"))
.thenReturn(buildMockChatClientSpec("正常的分析结果"));
ContractAnalysisResult result = analysisService.analyze("合同内容...");
assertThat(result.isDegraded()).isFalse();
verify(chatClient, times(3)).prompt();
}
@Test
void testCircuitBreakerOpen_SkipsAiCall_ReturnsCached() {
when(chatClient.prompt()).thenThrow(new RuntimeException("AI service error"));
for (int i = 0; i < 10; i++) {
try { analysisService.analyze("合同..."); } catch (Exception ignored) {}
}
reset(chatClient);
analysisService.analyze("合同...");
verifyNoInteractions(chatClient);
}
@Test
void testLargeDocument_MemoryPressure_ProcessesCorrectly() {
String largeDocument = "合同条款 ".repeat(500_000 / 4);
when(chatClient.prompt()).thenAnswer(inv -> buildMockChatClientSpec("分析结果"));
assertThatNoException().isThrownBy(
() -> analysisService.analyze(largeDocument));
}
}测试数据管理
package com.laozhang.test.data;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.Data;
import org.springframework.core.io.ClassPathResource;
import org.springframework.stereotype.Component;
import java.util.List;
import java.util.Map;
/**
* AI测试数据管理器
* 把测试数据从代码里分离,方便产品经理和QA参与维护
*/
@Component
public class AiTestDataManager {
private final ObjectMapper objectMapper;
private Map<String, TestCase> testCases;
public TestCase getTestCase(String caseName) {
if (!testCases.containsKey(caseName)) {
throw new IllegalArgumentException("测试用例不存在: " + caseName);
}
return testCases.get(caseName);
}
public List<TestCase> getTestCasesByTag(String tag) {
return testCases.values().stream()
.filter(tc -> tc.getTags().contains(tag))
.toList();
}
@Data
public static class TestCase {
private String name;
private String description;
private List<String> tags;
private String input;
private ExpectedOutput expectedOutput;
private QualityCriteria qualityCriteria;
}
@Data
public static class ExpectedOutput {
private String riskLevel;
private List<String> mustContain;
private List<String> mustNotContain;
private int minQualityScore;
}
@Data
public static class QualityCriteria {
private String completeness;
private String accuracy;
private String professionalism;
}
}测试用例YAML文件 src/test/resources/ai-test-cases/contract-analysis.yml:
high-risk-penalty-clause:
name: "高风险违约金条款测试"
description: "测试AI能否正确识别超出法定上限的违约金条款"
tags: [high-risk, penalty, regression]
input: |
第五条 违约责任
如甲方违约,须赔偿合同金额的50%作为违约金。
expectedOutput:
riskLevel: HIGH
mustContain:
- "违约金比例过高"
- "建议修改"
mustNotContain:
- "合同合规"
minQualityScore: 7
qualityCriteria:
completeness: "是否识别出违约金比例超出法律规定"
accuracy: "风险定级是否为HIGH"
professionalism: "是否给出具体的修改建议"CI/CD集成策略
# .github/workflows/test.yml
name: AI Application Tests
on: [push, pull_request]
jobs:
unit-test:
runs-on: ubuntu-latest
steps:
- run: mvn test -Dtest="**/*UnitTest"
env:
AI_TEST_MODE: REPLAY
integration-test:
runs-on: ubuntu-latest
steps:
- run: mvn test -Dtest="**/*IntegrationTest"
env:
AI_TEST_MODE: REPLAY
semantic-test:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- run: mvn test -Dtest="**/*SemanticTest"
env:
AI_TEST_MODE: PASSTHROUGH
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
load-test:
if: github.event_name == 'workflow_dispatch'
runs-on: ubuntu-latest
steps:
- run: mvn gatling:test常见问题解答
Q1:语义断言用的Judge LLM会产生额外API费用,值得吗?
A:值得。Judge LLM可以用便宜的模型(GPT-4o-mini),评判一次约0.001美元。相比一次线上事故的损失(客户投诉、紧急修复、信誉损失),这是成本效益最高的投入。
Q2:WireMock录制的响应久了会不会过期?
A:会。OpenAI偶尔升级模型,输出格式可能有微小变化。建议每季度重新录制一次关键场景,CI中设置录制新鲜度检查(超90天发出警告),录制文件在版本控制中做Code Review。
Q3:压力测试用真实AI还是Mock AI?
A:取决于测试目标。测AI服务的吞吐量和延迟,用真实AI;测自己的服务(排队、并发控制),用Mock AI。建议两种都做:Mock AI测业务层性能,真实AI测端到端性能。
Q4:如何测试提示词工程的效果(A/B测试不同提示词)?
A:建立提示词评估套件:准备50-100个典型输入,用两套提示词分别生成输出,用语义断言自动评分,比较平均分。自动化这个过程,提示词变更时自动跑对比测试,防止"提示词退化"。
Q5:测试环境和生产环境的AI模型不同,会有问题吗?
A:一定有细微差异。建议:测试用便宜模型测基本逻辑(GPT-4o-mini),生产用强模型(GPT-4o);同时维护少量用生产级模型的"黄金路径"测试,每次大版本升级必须跑。
Q6:如何追踪AI输出质量的历史趋势?
A:建立质量基准测试(Benchmark):选100个固定测试用例,每次部署后自动跑,把质量分数存入时序数据库(InfluxDB),用Grafana画趋势图。分数下降超过5%触发告警。
总结
赵磊的教训很深刻:测试全通过,上线就崩。根本原因是把Mock当作真实AI,测了一个虚假的世界。
可操作行动清单:
一个没有完善测试的AI系统,就像一辆没有安全带的跑车——跑得很快,但是危险的。
