第1775篇：Prompt缓存的深度实践——Anthropic和OpenAI缓存机制的工程利用

老张2026/4/30大约 9 分钟

第1775篇：Prompt缓存的深度实践——Anthropic和OpenAI缓存机制的工程利用

有一类成本优化机会真的很多人没注意到：Prompt缓存。

大模型的计费里，输入token往往比输出token便宜，但输入token的绝对数量通常更大。特别是那些有超长system prompt、大量参考文档、或者多轮对话的场景，重复输入的内容每次都被计费，这部分其实大有文章可做。

我跑过一个测试：一个RAG应用，system prompt + 检索文档约8000 tokens，每个用户问题约200 tokens。如果每次对话都从头计算，输入成本里有97.5%来自固定内容。如果能把固定内容缓存起来，这97.5%就不用重复付钱了。

这篇文章，我来深挖 Anthropic 和 OpenAI 的缓存机制，以及工程上怎么用好它。

两家缓存机制对比

先把基础搞清楚，两家机制有明显差异。

Anthropic Prompt Caching（Claude）

需要显式标记要缓存的内容，用 cache_control 字段
缓存命中时，缓存的 token 计费按原价的10%（相当于9折）
缓存首次写入有额外费用（约1.25倍原价），但后续命中只需10%
缓存生存时间（TTL）：5分钟（默认），可申请延长到1小时
支持最多4个缓存断点

OpenAI Prompt Caching

自动缓存，无需额外标记，对开发者透明
要求Prompt前缀超过1024 tokens才会被缓存
缓存命中时，缓存部分的 input token 计费按50%
缓存 TTL：约5-10分钟，不可控
生效条件：请求的 prompt 前缀必须和之前完全一致（字节级别匹配）

Anthropic缓存的实战用法

Anthropic的缓存需要主动标记，灵活度高，但也需要更多工程投入。

基础用法：缓存system prompt

@Service
public class AnthropicCachedChatService {
    
    @Autowired
    private AnthropicClient client;
    
    /**
     * 带缓存的对话调用
     * system prompt固定，会被缓存
     */
    public ChatResponse chatWithCachedSystem(String systemPrompt, List<Message> history, String userMessage) {
        
        // system 部分标记缓存
        List<Map<String, Object>> systemContent = List.of(
            Map.of(
                "type", "text",
                "text", systemPrompt,
                "cache_control", Map.of("type", "ephemeral")  // 标记为可缓存
            )
        );
        
        // 构建消息列表
        List<Map<String, Object>> messages = new ArrayList<>();
        
        // 加入历史消息
        for (Message msg : history) {
            messages.add(Map.of(
                "role", msg.getRole(),
                "content", msg.getContent()
            ));
        }
        
        // 加入当前用户消息
        messages.add(Map.of("role", "user", "content", userMessage));
        
        Map<String, Object> requestBody = Map.of(
            "model", "claude-3-5-sonnet-20241022",
            "max_tokens", 4096,
            "system", systemContent,
            "messages", messages
        );
        
        AnthropicResponse response = client.messages(requestBody);
        
        // 打印缓存使用情况
        logCacheStats(response.getUsage());
        
        return ChatResponse.from(response);
    }
    
    private void logCacheStats(TokenUsage usage) {
        log.info("Token usage - input: {}, output: {}, cache_creation: {}, cache_read: {}",
            usage.getInputTokens(),
            usage.getOutputTokens(),
            usage.getCacheCreationInputTokens(),
            usage.getCacheReadInputTokens()
        );
        
        // 计算缓存节省
        if (usage.getCacheReadInputTokens() > 0) {
            double savedPercent = (double) usage.getCacheReadInputTokens() 
                / (usage.getInputTokens() + usage.getCacheReadInputTokens()) * 100;
            log.info("缓存节省: {}% 的输入成本", String.format("%.1f", savedPercent * 0.9));
        }
    }
}

高级用法：RAG场景缓存检索文档

RAG是最适合用缓存的场景。System prompt + 检索到的文档组成固定部分，用户问题是变量部分。

@Service
@Slf4j
public class RAGWithCacheService {
    
    private static final int CACHE_THRESHOLD_TOKENS = 1000;  // 超过1000 token才值得缓存
    
    /**
     * RAG查询，缓存检索文档
     */
    public RAGResponse queryWithCache(String question, List<Document> retrievedDocs) {
        // 构建带缓存标记的 content
        List<Map<String, Object>> systemContent = buildCachedSystemContent(retrievedDocs);
        
        Map<String, Object> request = Map.of(
            "model", "claude-3-5-sonnet-20241022",
            "max_tokens", 2048,
            "system", systemContent,
            "messages", List.of(
                Map.of("role", "user", "content", question)
            )
        );
        
        AnthropicResponse response = client.messages(request);
        
        return RAGResponse.builder()
            .answer(extractAnswer(response))
            .cacheStats(buildCacheStats(response.getUsage()))
            .build();
    }
    
    /**
     * 构建带缓存标记的system内容
     * 
     * 核心技巧：把不变的内容放在前面并标记缓存，变的内容放后面
     */
    private List<Map<String, Object>> buildCachedSystemContent(List<Document> docs) {
        List<Map<String, Object>> content = new ArrayList<>();
        
        // 1. 基础指令（完全固定，最适合缓存）
        content.add(Map.of(
            "type", "text",
            "text", getBaseSystemPrompt(),
            "cache_control", Map.of("type", "ephemeral")
        ));
        
        // 2. 检索到的文档（每次查询可能不同，但同类查询可能相同）
        // 只在文档量超过阈值时才标记缓存
        String docsText = buildDocsText(docs);
        int docTokenCount = estimateTokenCount(docsText);
        
        if (docTokenCount >= CACHE_THRESHOLD_TOKENS) {
            content.add(Map.of(
                "type", "text",
                "text", docsText,
                "cache_control", Map.of("type", "ephemeral")
            ));
        } else {
            // 文档量少，不值得缓存，直接放
            content.add(Map.of("type", "text", "text", docsText));
        }
        
        return content;
    }
    
    private String getBaseSystemPrompt() {
        return """
            你是一个专业的问答助手。请根据提供的参考文档回答用户问题。
            
            回答要求：
            1. 只基于提供的文档内容回答，不要使用外部知识
            2. 如果文档中没有相关信息，明确告知用户
            3. 引用具体文档段落时，注明来源
            4. 回答要简洁准确，避免冗余
            5. 如涉及数字、日期，必须精确引用
            """;
    }
    
    private int estimateTokenCount(String text) {
        // 粗略估算：中文约1.5字/token，英文约4字/token
        long chineseChars = text.chars().filter(c -> c > 0x4E00 && c < 0x9FFF).count();
        long otherChars = text.length() - chineseChars;
        return (int)(chineseChars / 1.5 + otherChars / 4);
    }
}

多轮对话缓存：缓存历史消息

对于长对话场景，历史消息也可以缓存。

@Service
public class LongConversationCacheService {
    
    /**
     * 多轮对话，对历史消息应用缓存
     * 
     * 关键：缓存断点必须放在稳定的位置
     * 错误做法：每次在最后一条消息标记缓存（每次都不同，缓存无效）
     * 正确做法：在固定数量的历史消息后标记缓存
     */
    public ChatResponse multiTurnChat(String sessionId, String newMessage) {
        List<Message> history = sessionStore.getHistory(sessionId);
        
        List<Map<String, Object>> messages = new ArrayList<>();
        int historySize = history.size();
        
        for (int i = 0; i < historySize; i++) {
            Message msg = history.get(i);
            Map<String, Object> msgMap = new LinkedHashMap<>();
            msgMap.put("role", msg.getRole());
            
            // 在第N条消息后设置缓存断点
            // 选择稳定的位置：比如每10条消息设一个断点
            if (i > 0 && i % 10 == 9 && i < historySize - 1) {
                msgMap.put("content", List.of(
                    Map.of(
                        "type", "text",
                        "text", msg.getContent(),
                        "cache_control", Map.of("type", "ephemeral")
                    )
                ));
            } else {
                msgMap.put("content", msg.getContent());
            }
            
            messages.add(msgMap);
        }
        
        // 当前新消息不缓存
        messages.add(Map.of("role", "user", "content", newMessage));
        
        AnthropicResponse response = client.messages(Map.of(
            "model", "claude-3-5-sonnet-20241022",
            "max_tokens", 2048,
            "system", getCachedSystemPrompt(),
            "messages", messages
        ));
        
        // 保存到会话历史
        sessionStore.addMessage(sessionId, new Message("user", newMessage));
        sessionStore.addMessage(sessionId, new Message("assistant", extractAnswer(response)));
        
        return ChatResponse.from(response);
    }
}

OpenAI缓存的工程利用

OpenAI的缓存是自动的，但要让它生效需要了解规则。

核心规则：Prompt前缀必须字节级别完全一致，才能命中缓存

这意味着：如果你的system prompt或者Prompt开头是固定的，而结尾是动态的，缓存就能生效。反过来，如果动态内容在开头，缓存完全没用。

@Service
public class OpenAICacheOptimizedService {
    
    /**
     * 缓存友好的Prompt构建策略
     * 
     * 关键原则：固定内容在前，动态内容在后
     */
    public String buildCacheFriendlyPrompt(String systemPrompt, String context, String question) {
        // ✅ 正确：固定的system prompt在前，动态question在后
        // system message: [固定system prompt] - 这部分会被缓存
        // user message: [context] + [question] - 如果context也是固定的，把它放前面
        
        // 如果context较大且在不同请求间相同，构建为：
        // "以下是相关背景信息：\n" + context + "\n\n用户问题：" + question
        // 这样context部分可以被缓存
        
        return "以下是相关背景信息：\n" + context + "\n\n用户问题：" + question;
    }
    
    /**
     * 检测缓存是否生效
     * OpenAI在响应头中返回缓存信息
     */
    public CacheInfo detectCacheHit(OpenAIResponse response) {
        TokenUsage usage = response.getUsage();
        
        // OpenAI API返回 prompt_tokens_details 包含缓存信息
        Integer cachedTokens = usage.getPromptTokensDetails() != null 
            ? usage.getPromptTokensDetails().getCachedTokens() 
            : 0;
        
        boolean cacheHit = cachedTokens != null && cachedTokens > 0;
        
        return CacheInfo.builder()
            .hit(cacheHit)
            .cachedTokens(cachedTokens != null ? cachedTokens : 0)
            .totalInputTokens(usage.getPromptTokens())
            .cacheHitRate(usage.getPromptTokens() > 0 
                ? (double) cachedTokens / usage.getPromptTokens() 
                : 0.0)
            .build();
    }
    
    /**
     * 高缓存命中率的会话设计
     * 
     * 反模式：每次都在prompt里插入时间戳（导致缓存永远不命中）
     * 正确模式：时间等动态信息放在消息末尾
     */
    public List<Map<String, String>> buildCacheFriendlyMessages(
            String baseSystemPrompt,
            List<ConversationTurn> history,
            String newQuestion) {
        
        List<Map<String, String>> messages = new ArrayList<>();
        
        // System prompt 保持完全固定，不插入任何动态内容
        messages.add(Map.of(
            "role", "system",
            "content", baseSystemPrompt  // 绝对不要在这里加时间、用户名等动态信息
        ));
        
        // 历史消息（前面是固定的，有利于缓存）
        for (ConversationTurn turn : history) {
            messages.add(Map.of("role", "user", "content", turn.getUserMessage()));
            messages.add(Map.of("role", "assistant", "content", turn.getAssistantMessage()));
        }
        
        // 新问题放最后（动态内容）
        messages.add(Map.of("role", "user", "content", newQuestion));
        
        return messages;
    }
}

缓存命中率监控与优化

光用还不够，要知道缓存效果怎么样，才能持续优化。

@Service
public class CacheEfficiencyTracker {
    
    private final MeterRegistry meterRegistry;
    
    /**
     * 记录Anthropic缓存指标
     */
    public void trackAnthropicCache(String featureCode, AnthropicUsage usage) {
        int inputTokens = usage.getInputTokens();
        int cacheCreation = usage.getCacheCreationInputTokens();
        int cacheRead = usage.getCacheReadInputTokens();
        
        // 缓存命中率
        double hitRate = (inputTokens + cacheRead) > 0 
            ? (double) cacheRead / (inputTokens + cacheRead)
            : 0.0;
        
        meterRegistry.gauge("ai.cache.hit_rate",
            Tags.of("provider", "anthropic", "feature", featureCode),
            hitRate
        );
        
        // 节省的成本（缓存读取比全量输入便宜90%）
        double savedTokens = cacheRead * 0.9;  // 节省了90%
        meterRegistry.counter("ai.cache.saved_tokens",
            Tags.of("provider", "anthropic", "feature", featureCode)
        ).increment(savedTokens);
        
        log.debug("Anthropic缓存: feature={}, input={}, cacheCreate={}, cacheRead={}, hitRate={}%",
            featureCode, inputTokens, cacheCreation, cacheRead,
            String.format("%.1f", hitRate * 100));
    }
    
    /**
     * 计算缓存带来的成本节省（月度报告用）
     */
    public CacheSavingsReport generateMonthlySavingsReport(String featureCode) {
        // 从数据库查询月度缓存数据
        MonthlyStats stats = statsRepository.getMonthlyStats(featureCode);
        
        // Anthropic计费规则：
        // 缓存写入：原价 * 1.25
        // 缓存读取：原价 * 0.10
        // 如果都是普通调用：原价 * 1.0
        
        BigDecimal normalCost = new BigDecimal(stats.getTotalInputTokens())
            .multiply(ANTHROPIC_INPUT_PRICE_PER_K)
            .divide(new BigDecimal(1000), 6, RoundingMode.HALF_UP);
        
        BigDecimal actualCost = new BigDecimal(stats.getNonCacheInputTokens())
            .multiply(ANTHROPIC_INPUT_PRICE_PER_K)
            .divide(new BigDecimal(1000), 6, RoundingMode.HALF_UP)
            .add(
                new BigDecimal(stats.getCacheCreationTokens())
                    .multiply(ANTHROPIC_INPUT_PRICE_PER_K.multiply(new BigDecimal("1.25")))
                    .divide(new BigDecimal(1000), 6, RoundingMode.HALF_UP)
            )
            .add(
                new BigDecimal(stats.getCacheReadTokens())
                    .multiply(ANTHROPIC_INPUT_PRICE_PER_K.multiply(new BigDecimal("0.1")))
                    .divide(new BigDecimal(1000), 6, RoundingMode.HALF_UP)
            );
        
        BigDecimal savings = normalCost.subtract(actualCost);
        
        return CacheSavingsReport.builder()
            .featureCode(featureCode)
            .normalCostUsd(normalCost)
            .actualCostUsd(actualCost)
            .savingsUsd(savings)
            .savingsPercent(savings.divide(normalCost, 4, RoundingMode.HALF_UP)
                .multiply(new BigDecimal(100)))
            .cacheHitRate(stats.getCacheHitRate())
            .build();
    }
}

几个容易搞错的点

关于Anthropic缓存的TTL。默认5分钟非常短，如果你的用户请求间隔超过5分钟（比如用户想了一会儿再发下一个问题），缓存就失效了，下次请求又要重新写入缓存（还要多收25%的写入费）。这种场景下缓存不一定划算，要算清楚。

OpenAI缓存的1024 token门槛。如果你的system prompt不到1024 tokens，OpenAI根本不会缓存。我见过有人在system prompt里加大量空格来凑数，这是错误的——空格也是token，浪费钱。正确做法是充实system prompt的内容，让它自然超过1024 tokens。

缓存命中率和请求频率强相关。如果一个功能每天只调用几次，缓存几乎不会命中（因为TTL内不会有重复前缀）。缓存最适合高频、有固定前缀的场景。评估缓存价值时，要先算请求频率。

多租户场景下缓存不共享。OpenAI的缓存是按API Key隔离的，Anthropic也是。如果你有100个租户各自发相同的请求，他们的缓存是独立的，不会互相复用。这在多租户架构设计时要考虑进去。

哪些场景最适合缓存

最后总结一下缓存效益从高到低的场景：

RAG问答：固定的大段参考文档 + 动态问题。缓存效益极高，文档越大效益越明显。
对话系统：固定的角色设定和长system prompt。每条消息都能复用大量缓存内容。
Few-shot学习：固定的示例集合 + 新的待处理内容。示例越多效益越高。
文档处理：先分析整个文档（缓存文档内容），再回答多个关于文档的问题。
报表生成：固定的数据格式说明 + 动态数据。每天批量生成时效益显著。

缓存是个low-hanging fruit，代码改动不大，但成本节省可以很显著。如果你的AI应用有明显的固定前缀内容，建议今天就去看看缓存命中率。