AI 工程的测试金字塔——单测、集成测试、端到端测试怎么分

老张2026/4/30大约 9 分钟

AI 工程的测试金字塔——单测、集成测试、端到端测试怎么分

适读人群：做 AI 应用工程的团队 | 阅读时长：约15分钟 | 核心价值：建立可落地的 AI 应用测试体系，让 AI 系统的质量有保障

半年前，我们团队有一次让人难受的教训。

一个同事改了一个 RAG 的 Prompt，觉得改得挺好，本地手动测了几个例子，看着没问题，就合并推上了生产。三天后，我们收到用户投诉说 AI 回答开始"答非所问"。

查下来是 Prompt 改动导致检索策略变了，某类问题的召回率掉了一半。如果有测试，这个问题在开发阶段就会被发现。

那次之后，我们认真建立了 AI 应用的测试体系。

AI 应用测试为什么比传统应用难

传统应用测试的逻辑：给定输入 A，断言输出必须等于 B。确定性强，断言好写。

AI 应用的测试逻辑：给定输入 A，输出应该"类似 B"、"包含关键点 X"、"不包含 Y"。输出是概率性的，每次可能不同，断言很难写。

这个差异导致 AI 测试需要一套不同的思路：

不测具体文字，测行为和质量指标
不测单次，测统计通过率（比如100次里90次通过算通过）
用 LLM 评估 LLM 的输出（LLM-as-Judge）

测试金字塔的结构

传统的测试金字塔：底层大量单元测试，中层集成测试，顶层少量 E2E 测试。

AI 应用的测试金字塔和这个基本相同，但每一层的测什么不一样：

        /\
       /  \
      / E2E\          端到端：整体对话质量、用户场景
     /------\
    /  集成   \       集成：RAG 检索效果、工具调用链路
   /------------\
  /  单元测试    \    单元：工具函数、Prompt 渲染、解析器
 /----------------\

第一层：单元测试

单元测试要测的是"不涉及大模型本身"的逻辑，这些可以做确定性断言。

测 Prompt 模板渲染：

import pytest
from jinja2 import Template

# Prompt 模板（实际存在配置文件里）
SYSTEM_PROMPT_TEMPLATE = """你是一个{{ role }}，专注于{{ domain }}领域。

{% if constraints %}
注意事项：
{% for c in constraints %}
- {{ c }}
{% endfor %}
{% endif %}

当前时间：{{ current_date }}
用户背景：{{ user_context | default('未知') }}"""

RAG_PROMPT_TEMPLATE = """根据以下参考资料回答问题：

{% for doc in documents %}
## 来源 {{ loop.index }}：{{ doc.title }}
{{ doc.content }}

{% endfor %}
问题：{{ question }}

要求：
- 只根据上面的参考资料回答
- 如果资料不足以回答，请说明
- 回答要简洁，不超过{{ max_words }}字"""


class TestPromptRendering:
    
    def test_system_prompt_renders_correctly(self):
        template = Template(SYSTEM_PROMPT_TEMPLATE)
        result = template.render(
            role="技术顾问",
            domain="Python 开发",
            constraints=["不推荐已废弃的库", "代码要有注释"],
            current_date="2026-04-23",
            user_context="中级 Python 开发者"
        )
        
        assert "技术顾问" in result
        assert "Python 开发" in result
        assert "不推荐已废弃的库" in result
        assert "2026-04-23" in result
        assert "中级 Python 开发者" in result
    
    def test_system_prompt_handles_empty_constraints(self):
        template = Template(SYSTEM_PROMPT_TEMPLATE)
        result = template.render(
            role="助手",
            domain="通用",
            constraints=[],  # 空列表
            current_date="2026-04-23"
        )
        # 没有 constraints 时，注意事项段落不应该出现
        assert "注意事项" not in result
    
    def test_rag_prompt_renders_multiple_docs(self):
        template = Template(RAG_PROMPT_TEMPLATE)
        docs = [
            {"title": "Python 基础", "content": "Python 是一门解释型语言"},
            {"title": "高级特性", "content": "装饰器是一种语法糖"}
        ]
        result = template.render(
            documents=docs,
            question="Python 装饰器是什么",
            max_words=200
        )
        
        assert "来源 1" in result
        assert "来源 2" in result
        assert "Python 基础" in result
        assert "高级特性" in result
        assert "Python 装饰器是什么" in result
        assert "200" in result
    
    def test_rag_prompt_max_words_boundary(self):
        template = Template(RAG_PROMPT_TEMPLATE)
        # 边界条件：max_words 为 0 时
        result = template.render(
            documents=[{"title": "test", "content": "test content"}],
            question="test",
            max_words=0
        )
        assert "0字" in result

测工具函数（确定性逻辑）：

from app.utils.chunking import split_text_into_chunks, count_tokens
from app.utils.retrieval import rerank_by_score, filter_by_threshold

class TestChunkingUtils:
    
    def test_split_respects_max_tokens(self):
        # 生成一段长文本
        long_text = "这是一段测试文字。" * 200  # 约1600个字
        
        chunks = split_text_into_chunks(long_text, max_tokens=200, overlap=20)
        
        for chunk in chunks:
            token_count = count_tokens(chunk)
            assert token_count <= 200, f"Chunk 超出 max_tokens: {token_count}"
    
    def test_split_preserves_all_content(self):
        """分块后的内容合并应该能恢复原始文本（允许重叠部分）"""
        text = "第一段内容。第二段内容。第三段内容。第四段内容。第五段内容。"
        chunks = split_text_into_chunks(text, max_tokens=50, overlap=0)
        
        # 所有 chunk 的内容应该都在原文里
        for chunk in chunks:
            assert chunk in text or text in chunk
    
    def test_rerank_preserves_all_results(self):
        docs = [
            {"content": "A", "score": 0.7},
            {"content": "B", "score": 0.9},
            {"content": "C", "score": 0.5}
        ]
        reranked = rerank_by_score(docs)
        
        assert len(reranked) == 3
        assert reranked[0]["content"] == "B"  # 最高分在前
        assert reranked[-1]["content"] == "C"
    
    def test_filter_removes_below_threshold(self):
        docs = [
            {"content": "high", "score": 0.85},
            {"content": "mid", "score": 0.6},
            {"content": "low", "score": 0.3}
        ]
        filtered = filter_by_threshold(docs, threshold=0.7)
        
        assert len(filtered) == 1
        assert filtered[0]["content"] == "high"


class TestOutputParsers:
    """测试 AI 输出的解析器"""
    
    def test_json_extractor_handles_markdown_wrapper(self):
        """AI 经常把 JSON 包在 markdown 代码块里，解析器要处理这种情况"""
        from app.parsers import extract_json_from_response
        
        ai_output = (
            "当然，这是您要求的数据：\n\n"
            "```json\n"
            '{"name": "张三", "age": 30, "skills": ["Python", "Java"]}\n'
            "```\n\n"
            "希望这对您有帮助。"
        )
        
        result = extract_json_from_response(ai_output)
        assert result["name"] == "张三"
        assert result["age"] == 30
        assert "Python" in result["skills"]
    
    def test_json_extractor_handles_raw_json(self):
        """也要能处理直接输出的 JSON"""
        from app.parsers import extract_json_from_response
        
        ai_output = '{"status": "ok", "count": 5}'
        result = extract_json_from_response(ai_output)
        assert result["status"] == "ok"

第二层：集成测试

集成测试要测涉及真实 LLM 调用的逻辑，但范围要控制。用小模型或 Mock，关注的是"系统行为"而非输出质量。

测 RAG 检索效果：

import pytest
from app.rag import RAGPipeline

# 这个测试会真实调用向量库（但用测试用 collection）
@pytest.mark.integration
class TestRAGRetrieval:
    
    @pytest.fixture(autouse=True)
    async def setup_test_kb(self, qdrant_client, embedding_service):
        """在测试用的 collection 里插入已知内容"""
        self.rag = RAGPipeline(
            qdrant_client=qdrant_client,
            embedding_service=embedding_service,
            collection_name="test_collection"  # 测试专用，不用生产数据
        )
        
        # 插入已知文档
        test_docs = [
            {"id": "doc1", "content": "Python 的 GIL 全称是 Global Interpreter Lock，限制了多线程的并行执行"},
            {"id": "doc2", "content": "Django 是一个全功能的 Python Web 框架，内置 ORM 和管理后台"},
            {"id": "doc3", "content": "FastAPI 是一个高性能的 Python 异步 Web 框架，自动生成 API 文档"},
            {"id": "doc4", "content": "Redis 是一个内存数据库，常用于缓存和消息队列"},
        ]
        await self.rag.index_documents(test_docs)
        yield
        # 清理测试数据
        await qdrant_client.delete_collection("test_collection")
    
    async def test_retrieves_relevant_doc_for_query(self):
        """相关查询应该能检索到正确的文档"""
        results = await self.rag.search("Python 多线程有什么限制", top_k=3)
        
        assert len(results) > 0
        # doc1（关于 GIL）应该是最相关的
        top_result = results[0]
        assert "GIL" in top_result['content'] or "Global Interpreter Lock" in top_result['content']
    
    async def test_retrieves_correct_doc_for_framework_query(self):
        """框架相关查询"""
        results = await self.rag.search("异步 Web 框架", top_k=2)
        
        # FastAPI 应该在结果里
        contents = " ".join([r['content'] for r in results])
        assert "FastAPI" in contents
    
    async def test_irrelevant_query_returns_low_scores(self):
        """不相关的查询，应该得到低相似度分数"""
        results = await self.rag.search("今天天气怎么样", top_k=3)
        
        if results:
            # 所有结果的相似度应该低于阈值
            for r in results:
                assert r['score'] < 0.7, f"不相关查询得到了高分: {r['score']}"
    
    async def test_top_k_limit_respected(self):
        """top_k 限制应该被遵守"""
        results = await self.rag.search("Python", top_k=2)
        assert len(results) <= 2

测工具调用链路（Function Calling）：

from unittest.mock import AsyncMock, patch
import pytest

@pytest.mark.integration
class TestToolCalling:
    """测试工具调用的路由逻辑"""
    
    async def test_weather_tool_triggered_for_weather_query(self):
        """询问天气时，应该触发天气工具"""
        
        agent = await create_test_agent()
        
        # Mock 工具执行，只测路由逻辑
        with patch.object(agent, 'execute_tool', new_callable=AsyncMock) as mock_tool:
            mock_tool.return_value = {"temperature": 25, "weather": "晴"}
            
            await agent.chat("北京今天天气怎么样")
            
            # 验证天气工具被调用了
            assert mock_tool.called
            call_args = mock_tool.call_args
            assert call_args[0][0] == "get_weather"  # 第一个参数是工具名
    
    async def test_search_tool_triggered_for_search_query(self):
        """查询事实性信息时，应该触发搜索工具"""
        
        agent = await create_test_agent()
        
        with patch.object(agent, 'execute_tool', new_callable=AsyncMock) as mock_tool:
            mock_tool.return_value = {"results": ["相关信息1", "相关信息2"]}
            
            await agent.chat("2024年奥运会在哪里举办")
            
            assert mock_tool.called
            call_args = mock_tool.call_args
            assert call_args[0][0] in ["web_search", "search"]

第三层：端到端测试

E2E 测试关注的是整体对话质量，需要用 LLM-as-Judge 来评估。

import anthropic
from dataclasses import dataclass
from typing import Callable

@dataclass
class TestCase:
    name: str
    input: str
    evaluator: Callable[[str], bool]    # 评估函数
    description: str

# 定义评估器
def contains_code_example(response: str) -> bool:
    """回答应该包含代码示例"""
    return "```" in response or "def " in response or "class " in response

def does_not_hallucinate_api(response: str) -> bool:
    """回答不应该引用不存在的 API（这个需要人工维护已知虚假 API 列表）"""
    hallucinated_apis = [
        "openai.ChatCompletion.acreate_v2",  # 不存在
        "langchain.agents.create_react_agent_v3",  # 不存在
    ]
    return not any(fake in response for fake in hallucinated_apis)

def is_concise_answer(response: str, max_chars: int = 1000) -> bool:
    """回答长度合适，不能过于冗长"""
    return len(response) <= max_chars

def uses_llm_as_judge(question: str, response: str, criteria: str) -> bool:
    """用 LLM 判断回答是否满足特定标准"""
    client = anthropic.Anthropic()
    
    judge_prompt = f"""请评估以下 AI 回答是否满足要求。

用户问题：{question}
AI 回答：{response}

评估标准：{criteria}

只回答 YES 或 NO。"""
    
    result = client.messages.create(
        model="claude-3-5-haiku-20241022",  # 用便宜的小模型做 Judge
        max_tokens=10,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    
    return "YES" in result.content[0].text.upper()


# 测试用例集
E2E_TEST_CASES = [
    TestCase(
        name="code_question_gets_code",
        input="用 Python 写一个读取 CSV 文件的函数",
        evaluator=contains_code_example,
        description="代码类问题应该得到代码示例"
    ),
    TestCase(
        name="factual_question_accuracy",
        input="Python 的 GIL 是什么",
        evaluator=lambda r: uses_llm_as_judge(
            "Python 的 GIL 是什么",
            r,
            "回答应该提到 GIL 是 Global Interpreter Lock，并解释它对多线程的影响"
        ),
        description="事实性问题应该给出准确答案"
    ),
    TestCase(
        name="concise_simple_question",
        input="什么是变量",
        evaluator=lambda r: is_concise_answer(r, 500),
        description="简单问题不需要超长回答"
    ),
    TestCase(
        name="no_hallucination",
        input="用 OpenAI 的 API 发送一个请求",
        evaluator=does_not_hallucinate_api,
        description="不能引用不存在的 API"
    )
]


@pytest.mark.e2e
class TestEndToEnd:
    
    @pytest.mark.parametrize("test_case", E2E_TEST_CASES, ids=[t.name for t in E2E_TEST_CASES])
    async def test_response_quality(self, test_case: TestCase, ai_assistant):
        """参数化 E2E 测试，每个 test case 测试一个质量维度"""
        
        # 跑3次，取多数结果（处理 AI 输出的随机性）
        pass_count = 0
        for _ in range(3):
            response = await ai_assistant.chat(test_case.input)
            if test_case.evaluator(response):
                pass_count += 1
        
        # 3次里至少2次通过
        pass_rate = pass_count / 3
        assert pass_rate >= 0.67, (
            f"测试 [{test_case.name}] 失败: 通过率 {pass_rate:.0%}\n"
            f"描述: {test_case.description}\n"
            f"输入: {test_case.input}"
        )

在 CI/CD 里怎么跑这三层测试

# .github/workflows/ai-tests.yml
name: AI Application Tests

on:
  push:
    branches: [main, dev]
  pull_request:
    branches: [main]

jobs:
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run unit tests
        run: |
          pytest tests/unit/ -v --tb=short
          # 单测：快速，全部跑，不调用真实 AI API
  
  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: unit-tests  # 单测过了才跑集成测试
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - name: Run integration tests
        run: |
          pytest tests/integration/ -v --tb=short -m integration
          # 集成测试：中速，调用真实向量库，调用 AI API（用便宜模型）
  
  e2e-tests:
    name: E2E Quality Tests
    runs-on: ubuntu-latest
    needs: integration-tests
    # E2E 测试只在 main 分支合并时跑，不在每个 PR 跑（太贵）
    if: github.ref == 'refs/heads/main'
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - name: Run E2E tests
        run: |
          pytest tests/e2e/ -v --tb=short -m e2e

测试成本控制

AI 测试的一个特殊问题是成本：每次跑测试都要调用 API，花钱。

控制成本的几个策略：

单测不调用 AI API：Mock 掉所有 LLM 调用，测纯逻辑。
集成测试用小模型：用 gpt-4o-mini 或 claude-haiku，质量够用但便宜很多。
E2E 测试不在 PR 里跑：只在合并 main 时跑，或定时跑（每天一次）。
用 VCR 录制 HTTP 响应：第一次真实调用，之后回放录制的响应，适合 CI 环境。

# 使用 pytest-recording 录制 HTTP 响应
import pytest

@pytest.mark.vcr  # 第一次运行真实调用，之后回放
async def test_with_recorded_llm_response():
    result = await llm_client.chat("你好")
    assert len(result) > 0

测试体系建立之后，我们的 Prompt 改动不再是"手动测几个例子就上线"，而是必须过完整的测试套件。那次 RAG 效果悄悄变差的问题，如果有这套测试，会在 CI 里直接被检测到。