Prompt 版本管理——不要让 Prompt 裸奔在代码里

老张2026/4/30大约 9 分钟

Prompt 版本管理——不要让 Prompt 裸奔在代码里

我以前做过一件很蠢的事：把 Prompt 直接写在 Java 代码里，用字符串常量。

private static final String ANALYSIS_PROMPT = 
    "你是一个合同分析专家。请分析以下合同文本，提取关键信息..." +
    "注意事项：1. 只关注主要条款 2. 用 JSON 格式输出...";

这段 Prompt 维护了大概三个月，中间改了不知道多少版。每次改都是直接修改这个字符串，然后发版。

有一天出了个问题：AI 分析的结果跟以前对不上，有些以前能正确提取的字段现在提取失败了。

我去查，发现上个月某个同事"顺手"改了 Prompt 里的一句话，把"用 JSON 格式输出"改成了"以 JSON 格式返回结果"。就这一个字的差别，导致模型输出的格式有细微变化，下游解析出了问题。

更坑的是，这个改动混在一个大 PR 里，没有单独说明，没有测试，没有记录。想回滚？整个 PR 里其他代码也改了，没法只回滚这一处。

从那之后，我开始认真对待 Prompt 版本管理这件事。

Prompt 是"软件"，不是配置

这是理解这篇文章的关键前提。

很多人把 Prompt 当配置文件看待：放在 properties 里，或者数据库里，需要的时候读出来。这没错，但还不够。

Prompt 需要和软件代码一样，有完整的研发生命周期管理：

版本控制：每次修改有明确的版本号，谁改的、改了什么、为什么改
评估：新版本 Prompt 上线前，要有测试集评估，证明"新版本比旧版本好"
灰度发布：新 Prompt 不全量上线，先给部分用户用，观察效果
回滚能力：新版本出问题，能立刻切回旧版本，不需要重新发版
A/B 测试：两个 Prompt 版本同时运行，用数据决定哪个更好

这五点，缺了任何一个，Prompt 管理都是残缺的。

数据库存储 Prompt 版本的设计

数据库表结构

-- Prompt 定义表（Prompt 的元信息）
CREATE TABLE prompt_definition (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    prompt_key VARCHAR(128) NOT NULL UNIQUE,   -- 唯一标识，如 'contract_analysis'
    description VARCHAR(500),                   -- 功能描述
    owner VARCHAR(64),                          -- 负责人
    created_at DATETIME NOT NULL,
    updated_at DATETIME NOT NULL,
    INDEX idx_prompt_key (prompt_key)
);

-- Prompt 版本表（每次修改一个新版本）
CREATE TABLE prompt_version (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    prompt_key VARCHAR(128) NOT NULL,
    version VARCHAR(32) NOT NULL,              -- 版本号，如 'v1.0', 'v1.1'
    content TEXT NOT NULL,                      -- Prompt 正文
    variables JSON,                            -- 模板变量定义 [{name, description, required}]
    model_constraints JSON,                    -- 适用模型约束 {min_context, recommended_models}
    change_notes TEXT,                         -- 变更说明
    created_by VARCHAR(64),
    created_at DATETIME NOT NULL,
    evaluation_status VARCHAR(16) DEFAULT 'PENDING', -- PENDING/EVALUATED/APPROVED/REJECTED
    evaluation_score DECIMAL(5,2),             -- 评估得分
    is_active BOOLEAN DEFAULT FALSE,           -- 是否为当前生效版本
    UNIQUE KEY uk_key_version (prompt_key, version),
    INDEX idx_prompt_key (prompt_key),
    INDEX idx_is_active (is_active)
);

-- Prompt 发布记录表
CREATE TABLE prompt_deployment (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    prompt_key VARCHAR(128) NOT NULL,
    from_version VARCHAR(32),                  -- 从哪个版本
    to_version VARCHAR(32) NOT NULL,           -- 切换到哪个版本
    deployment_type VARCHAR(16) NOT NULL,      -- FULL/GRAY/ROLLBACK
    gray_percent INT,                          -- 灰度比例（GRAY 类型时使用）
    operator VARCHAR(64),
    reason TEXT,
    deployed_at DATETIME NOT NULL,
    INDEX idx_prompt_key (prompt_key),
    INDEX idx_deployed_at (deployed_at)
);

-- Prompt 效果数据表（用于 A/B 对比）
CREATE TABLE prompt_evaluation_result (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    prompt_key VARCHAR(128) NOT NULL,
    version VARCHAR(32) NOT NULL,
    test_case_id VARCHAR(128),
    input_data JSON,                           -- 测试输入
    expected_output TEXT,                      -- 期望输出
    actual_output TEXT,                        -- 实际输出
    score DECIMAL(5,2),                        -- 得分（人工评分或自动评分）
    evaluator VARCHAR(64),
    evaluated_at DATETIME NOT NULL,
    INDEX idx_key_version (prompt_key, version)
);

Prompt 版本服务

@Service
@Slf4j
public class PromptVersionService {

    @Autowired
    private PromptVersionMapper versionMapper;
    
    @Autowired
    private PromptDeploymentMapper deploymentMapper;
    
    @Autowired
    private StringRedisTemplate redisTemplate;
    
    @Autowired
    private ObjectMapper objectMapper;

    private static final String ACTIVE_VERSION_CACHE = "prompt:active:";
    private static final String GRAY_CONFIG_CACHE = "prompt:gray:";

    /**
     * 获取 Prompt 内容（根据用户决定灰度版本还是生效版本）
     */
    public String getPrompt(String promptKey, String userId) {
        // 检查是否有灰度配置
        GrayConfig grayConfig = getGrayConfig(promptKey);
        
        if (grayConfig != null && isInGrayGroup(userId, promptKey, grayConfig.getGrayPercent())) {
            // 使用灰度版本
            PromptVersion grayVersion = versionMapper.findByKeyAndVersion(promptKey, grayConfig.getGrayVersion());
            if (grayVersion != null) {
                return grayVersion.getContent();
            }
        }
        
        // 使用当前生效版本
        return getActivePrompt(promptKey);
    }
    
    /**
     * 获取当前生效的 Prompt（带缓存）
     */
    public String getActivePrompt(String promptKey) {
        String cacheKey = ACTIVE_VERSION_CACHE + promptKey;
        String cached = redisTemplate.opsForValue().get(cacheKey);
        
        if (cached != null) {
            return cached;
        }
        
        PromptVersion activeVersion = versionMapper.findActiveByKey(promptKey);
        if (activeVersion == null) {
            throw new PromptNotFoundException("No active prompt version for key: " + promptKey);
        }
        
        redisTemplate.opsForValue().set(cacheKey, activeVersion.getContent(), Duration.ofMinutes(5));
        return activeVersion.getContent();
    }
    
    /**
     * 发布新版本（全量切换）
     */
    @Transactional
    public void publish(String promptKey, String newVersion, String operator, String reason) {
        PromptVersion newV = versionMapper.findByKeyAndVersion(promptKey, newVersion);
        if (newV == null) {
            throw new PromptVersionNotFoundException(promptKey, newVersion);
        }
        
        if (newV.getEvaluationStatus() != EvaluationStatus.APPROVED) {
            throw new PromptNotApprovedException("Version " + newVersion + " has not been approved");
        }
        
        // 找到当前版本
        PromptVersion currentActive = versionMapper.findActiveByKey(promptKey);
        String fromVersion = currentActive != null ? currentActive.getVersion() : null;
        
        // 切换版本
        if (currentActive != null) {
            versionMapper.setInactive(promptKey, currentActive.getVersion());
        }
        versionMapper.setActive(promptKey, newVersion);
        
        // 记录发布历史
        deploymentMapper.insert(PromptDeployment.builder()
            .promptKey(promptKey)
            .fromVersion(fromVersion)
            .toVersion(newVersion)
            .deploymentType(DeploymentType.FULL)
            .operator(operator)
            .reason(reason)
            .deployedAt(new Date())
            .build());
        
        // 清除缓存
        clearCache(promptKey);
        
        log.info("Prompt {} published: {} -> {} by {}", promptKey, fromVersion, newVersion, operator);
    }
    
    /**
     * 回滚到指定版本
     */
    @Transactional
    public void rollback(String promptKey, String targetVersion, String operator, String reason) {
        // 直接切换，不需要评估状态检查（历史版本已经评估过了）
        PromptVersion currentActive = versionMapper.findActiveByKey(promptKey);
        String fromVersion = currentActive != null ? currentActive.getVersion() : null;
        
        if (currentActive != null) {
            versionMapper.setInactive(promptKey, currentActive.getVersion());
        }
        versionMapper.setActive(promptKey, targetVersion);
        
        deploymentMapper.insert(PromptDeployment.builder()
            .promptKey(promptKey)
            .fromVersion(fromVersion)
            .toVersion(targetVersion)
            .deploymentType(DeploymentType.ROLLBACK)
            .operator(operator)
            .reason(reason)
            .deployedAt(new Date())
            .build());
        
        clearCache(promptKey);
        
        log.info("Prompt {} rolled back: {} -> {} by {}", promptKey, fromVersion, targetVersion, operator);
    }
    
    /**
     * 开启灰度（新版本只给部分用户使用）
     */
    public void startGray(String promptKey, String grayVersion, int grayPercent, 
                           String operator) {
        GrayConfig config = GrayConfig.builder()
            .grayVersion(grayVersion)
            .grayPercent(grayPercent)
            .startTime(System.currentTimeMillis())
            .operator(operator)
            .build();
        
        try {
            redisTemplate.opsForValue().set(GRAY_CONFIG_CACHE + promptKey,
                objectMapper.writeValueAsString(config), Duration.ofDays(30));
        } catch (JsonProcessingException e) {
            throw new RuntimeException("Failed to serialize gray config", e);
        }
        
        deploymentMapper.insert(PromptDeployment.builder()
            .promptKey(promptKey)
            .toVersion(grayVersion)
            .deploymentType(DeploymentType.GRAY)
            .grayPercent(grayPercent)
            .operator(operator)
            .deployedAt(new Date())
            .build());
        
        log.info("Gray started for prompt {}: version={}, percent={}%", promptKey, grayVersion, grayPercent);
    }
    
    private boolean isInGrayGroup(String userId, String promptKey, int percent) {
        int hash = Math.abs((userId + promptKey).hashCode()) % 100;
        return hash < percent;
    }
    
    private GrayConfig getGrayConfig(String promptKey) {
        String cached = redisTemplate.opsForValue().get(GRAY_CONFIG_CACHE + promptKey);
        if (cached == null) {
            return null;
        }
        try {
            return objectMapper.readValue(cached, GrayConfig.class);
        } catch (JsonProcessingException e) {
            return null;
        }
    }
    
    private void clearCache(String promptKey) {
        redisTemplate.delete(ACTIVE_VERSION_CACHE + promptKey);
    }
}

带模板变量的 Prompt 渲染

@Service
public class PromptRenderer {

    @Autowired
    private PromptVersionService promptVersionService;

    /**
     * 渲染 Prompt（替换模板变量）
     */
    public String render(String promptKey, String userId, Map<String, Object> variables) {
        String template = promptVersionService.getPrompt(promptKey, userId);
        return renderTemplate(template, variables);
    }
    
    /**
     * 支持 {{variableName}} 格式的模板变量
     */
    private String renderTemplate(String template, Map<String, Object> variables) {
        if (variables == null || variables.isEmpty()) {
            return template;
        }
        
        String result = template;
        for (Map.Entry<String, Object> entry : variables.entrySet()) {
            String placeholder = "{{" + entry.getKey() + "}}";
            String value = entry.getValue() != null ? entry.getValue().toString() : "";
            result = result.replace(placeholder, value);
        }
        
        // 检查是否有未替换的变量
        if (result.contains("{{")) {
            log.warn("Template has unreplaced variables: {}", 
                     result.replaceAll("(?s)(?!\\{\\{[^}]+}})[\\s\\S]", ""));
        }
        
        return result;
    }
}

Prompt 评估框架

@Service
@Slf4j
public class PromptEvaluationService {

    @Autowired
    private UnifiedModelService modelService;
    
    @Autowired
    private PromptEvaluationResultMapper resultMapper;
    
    @Autowired
    private PromptVersionMapper versionMapper;

    /**
     * 用测试集评估 Prompt 版本
     */
    public EvaluationReport evaluate(String promptKey, String version, 
                                      List<EvaluationTestCase> testCases) {
        PromptVersion promptVersion = versionMapper.findByKeyAndVersion(promptKey, version);
        if (promptVersion == null) {
            throw new PromptVersionNotFoundException(promptKey, version);
        }
        
        List<EvaluationItem> results = new ArrayList<>();
        int passCount = 0;
        
        for (EvaluationTestCase testCase : testCases) {
            // 用该版本的 Prompt 执行测试
            String renderedPrompt = renderPrompt(promptVersion.getContent(), testCase.getInputVariables());
            
            UnifiedChatRequest request = UnifiedChatRequest.builder()
                .systemPrompt(renderedPrompt)
                .messages(testCase.getMessages())
                .modelConfig(UnifiedChatRequest.UnifiedModelConfig.builder()
                    .temperature(0.0)  // 评估时用 0 temperature，结果更确定
                    .maxTokens(2048)
                    .build())
                .build();
            
            UnifiedChatResponse response = modelService.chat(request);
            String actualOutput = response.getContent();
            
            // 评分（优先使用自动评分，其次标记为待人工评分）
            EvaluationScore score = scoreOutput(testCase, actualOutput);
            
            EvaluationItem item = EvaluationItem.builder()
                .testCaseId(testCase.getId())
                .inputData(testCase.getInputVariables())
                .expectedOutput(testCase.getExpectedOutput())
                .actualOutput(actualOutput)
                .score(score.getScore())
                .passed(score.isPassed())
                .comments(score.getComments())
                .build();
            
            results.add(item);
            
            if (score.isPassed()) {
                passCount++;
            }
            
            // 保存单条评估结果
            resultMapper.insert(buildResultPO(promptKey, version, testCase, actualOutput, score));
        }
        
        double passRate = (double) passCount / testCases.size();
        
        // 更新版本的评估状态
        if (passRate >= 0.9) {  // 90% 通过率自动 APPROVED
            versionMapper.updateEvaluationStatus(promptKey, version, 
                EvaluationStatus.APPROVED, passRate * 100);
            log.info("Prompt {}/{} auto-approved with pass rate {:.1f}%", 
                     promptKey, version, passRate * 100);
        } else {
            versionMapper.updateEvaluationStatus(promptKey, version,
                EvaluationStatus.EVALUATED, passRate * 100);
        }
        
        return EvaluationReport.builder()
            .promptKey(promptKey)
            .version(version)
            .totalCases(testCases.size())
            .passCount(passCount)
            .passRate(passRate)
            .items(results)
            .build();
    }
    
    /**
     * 对两个版本做 A/B 对比评估
     */
    public ABComparisonReport compareVersions(String promptKey, String versionA, 
                                               String versionB, List<EvaluationTestCase> testCases) {
        EvaluationReport reportA = evaluate(promptKey, versionA, testCases);
        EvaluationReport reportB = evaluate(promptKey, versionB, testCases);
        
        return ABComparisonReport.builder()
            .promptKey(promptKey)
            .versionA(versionA)
            .versionB(versionB)
            .passRateA(reportA.getPassRate())
            .passRateB(reportB.getPassRate())
            .winner(reportA.getPassRate() >= reportB.getPassRate() ? versionA : versionB)
            .improvement((reportB.getPassRate() - reportA.getPassRate()) * 100)
            .recommendation(makeRecommendation(reportA, reportB))
            .build();
    }
    
    private EvaluationScore scoreOutput(EvaluationTestCase testCase, String actualOutput) {
        if (testCase.getExpectedOutput() == null) {
            // 没有期望输出，标记为待人工评分
            return EvaluationScore.builder()
                .score(0)
                .passed(false)
                .comments("Requires manual evaluation")
                .build();
        }
        
        EvaluationMethod method = testCase.getEvaluationMethod();
        
        switch (method) {
            case EXACT_MATCH:
                boolean exactMatch = actualOutput.trim().equals(testCase.getExpectedOutput().trim());
                return EvaluationScore.builder()
                    .score(exactMatch ? 100 : 0)
                    .passed(exactMatch)
                    .build();
                
            case CONTAINS_KEYWORDS:
                List<String> keywords = testCase.getRequiredKeywords();
                long matchedCount = keywords.stream()
                    .filter(kw -> actualOutput.contains(kw))
                    .count();
                double score = (double) matchedCount / keywords.size() * 100;
                return EvaluationScore.builder()
                    .score(score)
                    .passed(score >= 80)
                    .build();
                
            case JSON_FIELD_CHECK:
                return scoreJsonOutput(actualOutput, testCase.getRequiredJsonFields());
                
            default:
                return EvaluationScore.builder()
                    .score(0)
                    .passed(false)
                    .comments("Unknown evaluation method")
                    .build();
        }
    }
}

Prompt 版本管理的最佳实践

版本命名规范

我们用语义化版本号：

v1.0.0：初始版本
v1.0.1：小修小补（修复错别字、调整措辞，不影响输出格式）
v1.1.0：功能增强（新增提取字段、调整输出结构）
v2.0.0：重大重写（输出格式或核心逻辑重大变化，下游需要适配）

每次创建新版本必须填写 change_notes，说明改了什么、为什么改。

Prompt 的评估测试集维护

每个 Prompt 都要有对应的测试集，并且测试集要持续维护：

// 测试集管理示例
@Data
public class EvaluationTestCase {
    private String id;
    private String description;     // 这个 case 测试什么场景
    private Map<String, Object> inputVariables;  // Prompt 变量
    private List<UnifiedChatRequest.UnifiedMessage> messages;  // 对话消息
    private String expectedOutput;  // 期望输出（可以为 null，表示需要人工判断）
    private EvaluationMethod evaluationMethod;  // 评估方式
    private List<String> requiredKeywords;  // 必须包含的关键词（CONTAINS_KEYWORDS 方式）
    private List<String> requiredJsonFields;  // 必须有的 JSON 字段（JSON_FIELD_CHECK 方式）
    private String category;        // 分类（正常场景/边界场景/异常场景）
}

测试用例要覆盖：正常场景、边界场景（超长输入、特殊字符）、异常场景（输入不完整、输入格式错误）。

和代码版本控制的关系

Prompt 版本存在数据库里，但 Prompt 的定义（prompt_key 和描述）也要在代码里有对应的常量：

public class PromptKeys {
    public static final String CONTRACT_ANALYSIS = "contract_analysis";
    public static final String SMART_REPLY = "smart_reply";
    public static final String REPORT_SUMMARY = "report_summary";
}

这样做的好处：代码里的常量引用让人知道有哪些 Prompt，但具体内容和版本切换不依赖代码发版。

工程化 Prompt 管理的价值

对比一下"Prompt 裸奔在代码里"和"Prompt 版本管理系统"：

场景	裸奔	版本管理
Prompt 调整	改代码 + 发版，几小时到一天	更新版本 + 发布，几分钟
出问题回滚	改代码 + 紧急发版	点回滚按钮，立即生效
新版本验证	全量上线才知道效果	先灰度 10%，有数据再全量
历史追溯	翻 Git 历史，可能已经找不到	数据库有完整历史
多版本对比	几乎不可能	A/B 测试框架支持

刚开始搭这套系统，觉得有点麻烦。但用了之后，每次 Prompt 调整我都庆幸有它，因为我随时可以回到任意历史版本，再也不用担心一次改动把全部用户的体验都搞坏了。