AI应用的缓存架构：多级缓存让AI响应速度提升10倍

老张2026/4/30大约 20 分钟

AI应用的缓存架构：多级缓存让AI响应速度提升10倍

date: 2026-09-19 tags: [缓存, Redis, Caffeine, 多级缓存, Spring AI, Java]

开篇故事：每天烧掉50万Token的教训

王磊是某SaaS公司的技术负责人，他们做了一款企业知识问答产品，接入了GPT-4o。

上线第一个月，Token费用账单让他眼睛一黑：每天消耗了470万个Token，对应费用2800元。

"我们有300个企业用户，每天用470万Token，平均每家每天消耗1.5万Token，感觉有点多。"王磊开始排查。

他把日志捞出来分析，发现一个惊人的规律：

"你们的退款政策是什么？" — 每天被问43次，AI每次都重新计算
"如何导出数据报表？" — 每天被问61次，同样全部重算
"支持哪些格式的文件上传？" — 每天被问38次，答案从来没变过

每天有大约45%的请求，问的是完全相同或极度相似的问题，全部都在重新调用GPT-4o计算。

按当时的价格，每天浪费的Token费用超过1260元，一年就是46万元。

王磊花了两周时间设计并实现了一套多级缓存架构：L1本地缓存（Caffeine）+ L2分布式缓存（Redis）+ 语义相似缓存（向量匹配）。

上线后第一周的数据：

Token消耗降低68%，从每天470万降到150万
P50响应时间从1.2s降到80ms（命中L1缓存）
P99响应时间从6.8s降到210ms（命中L2缓存）
月度AI费用从8.4万降到2.7万，节省5.7万

这篇文章就是他的完整方案。

一、AI应用缓存的特殊性：语义相似≠完全相同

1.1 传统缓存的假设在AI场景下失效

传统缓存逻辑：相同的Key → 相同的Value。

// 传统缓存：精确匹配
String key = "product:detail:" + productId;
String value = redis.get(key);

AI应用的问题：用户问的问题文本不完全一样，但语义相同。

"你们的退款政策是什么？"
"怎么申请退款？"  
"退款流程是什么？"
"我想退款，怎么操作？"
"退款规则说明一下"

这5个问题，精确字符串匹配缓存命中率为0，但它们的答案基本一样。

1.2 三种缓存粒度的对比

缓存粒度	命中率	实现复杂度	适用场景
精确匹配	低（5-15%）	简单	模板化问题、API参数固定的场景
语义相似（向量）	中（40-60%）	较复杂	自然语言问答、FAQ类场景
Prompt前缀缓存	高（按比例）	依赖大模型支持	System Prompt长、上下文长的场景

1.3 整体缓存架构设计

二、L1本地缓存：Caffeine缓存AI响应

2.1 Caffeine vs Guava Cache vs HashMap

选Caffeine的原因：

性能：比Guava Cache快30-40%，接近ConcurrentHashMap
算法：W-TinyLFU淘汰算法，命中率比LRU高30%+
功能：支持异步加载、软/弱引用、统计信息
Spring Boot集成：开箱即用

2.2 完整的Spring Boot Caffeine配置

@Configuration
@EnableCaching
public class CacheConfig {
    
    /**
     * AI响应缓存：本地L1缓存
     * 策略：
     * - 最大10000条（防止内存溢出）
     * - 写入后30分钟过期（AI内容更新较快）
     * - 软引用，内存紧张时JVM自动回收
     */
    @Bean
    public CacheManager caffeineCacheManager() {
        CaffeineCacheManager cacheManager = new CaffeineCacheManager();
        
        // 不同业务用不同的缓存策略
        Map<String, Caffeine<Object, Object>> specs = new HashMap<>();
        
        // AI问答缓存：容量大，过期时间长
        specs.put("aiAnswers", Caffeine.newBuilder()
            .maximumSize(10_000)
            .expireAfterWrite(30, TimeUnit.MINUTES)
            .expireAfterAccess(10, TimeUnit.MINUTES)  // 10分钟未访问也淘汰
            .softValues()                              // 内存紧张时自动回收
            .recordStats()                             // 记录命中率等统计信息
            .removalListener((key, value, cause) ->
                log.debug("Cache evicted: key={}, cause={}", key, cause))
        );
        
        // Embedding向量缓存：容量适中，过期时间长（向量不经常变）
        specs.put("embeddings", Caffeine.newBuilder()
            .maximumSize(50_000)
            .expireAfterWrite(2, TimeUnit.HOURS)
            .recordStats()
        );
        
        // 意图识别缓存：容量小，过期快（意图可能随时间变化）
        specs.put("intents", Caffeine.newBuilder()
            .maximumSize(5_000)
            .expireAfterWrite(5, TimeUnit.MINUTES)
            .recordStats()
        );
        
        cacheManager.setCacheSpecificationMap(
            specs.entrySet().stream()
                .collect(Collectors.toMap(
                    Map.Entry::getKey,
                    e -> e.getValue().build()
                ))
        );
        
        return cacheManager;
    }
    
    // 注册缓存统计指标到Micrometer
    @Bean
    public CacheMetricsRegistrar cacheMetricsRegistrar(
            MeterRegistry registry, 
            CacheManager cacheManager) {
        return new CacheMetricsRegistrar(registry, 
            (CaffeineCacheManager) cacheManager);
    }
}

2.3 Service层使用缓存注解

@Service
@Slf4j
public class AiAnswerService {
    
    private final ChatClient chatClient;
    private final EmbeddingService embeddingService;
    
    /**
     * 获取AI回答 - 带L1本地缓存
     * Key策略：对问题做SHA256，避免Key过长
     */
    @Cacheable(
        cacheNames = "aiAnswers",
        key = "T(com.laozhang.ai.util.HashUtils).sha256(#question + ':' + #knowledgeBaseId)",
        unless = "#result == null || #result.answer.length() < 10"  // 空结果不缓存
    )
    public AiAnswer getAnswer(String question, String knowledgeBaseId) {
        log.info("Cache miss, calling LLM for question: {}", 
            question.substring(0, Math.min(50, question.length())));
        
        // 调用LLM生成答案
        return generateAnswer(question, knowledgeBaseId);
    }
    
    /**
     * 主动失效缓存（知识库更新时调用）
     */
    @CacheEvict(
        cacheNames = "aiAnswers",
        allEntries = true    // 知识库更新，全量失效
    )
    public void invalidateKnowledgeBaseCache(String knowledgeBaseId) {
        log.info("Cache invalidated for knowledge base: {}", knowledgeBaseId);
    }
    
    /**
     * Embedding缓存 - 向量不变，缓存时间长
     */
    @Cacheable(
        cacheNames = "embeddings",
        key = "#text.hashCode()",
        unless = "#result == null"
    )
    public float[] getEmbedding(String text) {
        return embeddingService.embed(text);
    }
    
    private AiAnswer generateAnswer(String question, String knowledgeBaseId) {
        // 实际的LLM调用逻辑
        String response = chatClient.prompt()
            .system("你是企业知识助手，只回答知识库中有的问题。")
            .user(question)
            .call()
            .content();
        
        return AiAnswer.builder()
            .question(question)
            .answer(response)
            .knowledgeBaseId(knowledgeBaseId)
            .generatedAt(Instant.now())
            .build();
    }
}

// Hash工具类
public class HashUtils {
    public static String sha256(String input) {
        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-256");
            byte[] hash = digest.digest(input.getBytes(StandardCharsets.UTF_8));
            return HexFormat.of().formatHex(hash).substring(0, 16);  // 取前16位
        } catch (NoSuchAlgorithmException e) {
            return String.valueOf(input.hashCode());
        }
    }
}

2.4 Caffeine缓存统计监控

@Component
@Slf4j
public class CaffeineCacheMonitor {
    
    private final CacheManager cacheManager;
    
    @Scheduled(fixedDelay = 60_000)  // 每分钟打印一次
    public void reportCacheStats() {
        if (cacheManager instanceof CaffeineCacheManager cm) {
            cm.getCacheNames().forEach(cacheName -> {
                CaffeineCache cache = (CaffeineCache) cm.getCache(cacheName);
                if (cache != null) {
                    CacheStats stats = cache.getNativeCache().stats();
                    log.info("Cache[{}] - hitRate: {:.2f}%, requests: {}, " +
                             "evictions: {}, size: {}",
                        cacheName,
                        stats.hitRate() * 100,
                        stats.requestCount(),
                        stats.evictionCount(),
                        cache.getNativeCache().estimatedSize()
                    );
                }
            });
        }
    }
}

三、L2分布式缓存：Redis缓存AI结果

3.1 AI响应的序列化方案选择

AI的响应对象通常包含：文本内容、Token数、来源文档列表、生成时间。

序列化方案对比：

方案	序列化速度	反序列化速度	空间占用	可读性
JDK序列化	慢	慢	大	不可读
Jackson JSON	中	中	中	好
Kryo	极快	极快	小	不可读
Protobuf	快	快	极小	不可读

AI场景推荐：Jackson JSON（可读性好，运维方便，大小可接受）或 Kryo（高性能场景）。

3.2 完整的Redis缓存配置

@Configuration
public class RedisCacheConfig {
    
    @Bean
    public RedisTemplate<String, Object> aiRedisTemplate(RedisConnectionFactory factory) {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(factory);
        
        // Key序列化：String
        template.setKeySerializer(new StringRedisSerializer());
        template.setHashKeySerializer(new StringRedisSerializer());
        
        // Value序列化：Jackson JSON
        Jackson2JsonRedisSerializer<Object> valueSerializer = 
            new Jackson2JsonRedisSerializer<>(Object.class);
        
        ObjectMapper objectMapper = new ObjectMapper();
        objectMapper.setVisibility(PropertyAccessor.ALL, JsonAutoDetect.Visibility.ANY);
        // 记录类型信息，反序列化时能还原正确的类型
        objectMapper.activateDefaultTyping(
            LaissezFaireSubTypeValidator.instance,
            ObjectMapper.DefaultTyping.NON_FINAL,
            JsonTypeInfo.As.PROPERTY
        );
        objectMapper.registerModule(new JavaTimeModule());
        objectMapper.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);
        
        valueSerializer.setObjectMapper(objectMapper);
        template.setValueSerializer(valueSerializer);
        template.setHashValueSerializer(valueSerializer);
        
        template.afterPropertiesSet();
        return template;
    }
    
    @Bean
    public RedisCacheManager redisCacheManager(RedisConnectionFactory factory) {
        RedisCacheConfiguration defaultConfig = RedisCacheConfiguration.defaultCacheConfig()
            .entryTtl(Duration.ofMinutes(60))           // 默认60分钟过期
            .serializeKeysWith(
                RedisSerializationContext.SerializationPair
                    .fromSerializer(new StringRedisSerializer()))
            .serializeValuesWith(
                RedisSerializationContext.SerializationPair
                    .fromSerializer(new GenericJackson2JsonRedisSerializer()))
            .disableCachingNullValues();
        
        // 不同业务不同TTL
        Map<String, RedisCacheConfiguration> cacheConfigs = Map.of(
            "aiAnswers", defaultConfig.entryTtl(Duration.ofHours(2)),
            "embeddings", defaultConfig.entryTtl(Duration.ofHours(24)),
            "ragResults", defaultConfig.entryTtl(Duration.ofMinutes(30)),
            "userIntents", defaultConfig.entryTtl(Duration.ofMinutes(10))
        );
        
        return RedisCacheManager.builder(factory)
            .cacheDefaults(defaultConfig)
            .withInitialCacheConfigurations(cacheConfigs)
            .transactionAware()
            .build();
    }
}

3.3 两级缓存的联动实现

@Service
@Slf4j
public class TwoLevelCacheService {
    
    private final Cache<String, AiAnswer> localCache;    // L1: Caffeine
    private final RedisTemplate<String, Object> redisTemplate;  // L2: Redis
    
    private static final String REDIS_KEY_PREFIX = "ai:answer:";
    private static final Duration REDIS_TTL = Duration.ofHours(2);
    
    public TwoLevelCacheService(CacheManager caffeineCacheManager,
                                RedisTemplate<String, Object> redisTemplate) {
        // 获取Caffeine原生Cache对象
        this.localCache = ((CaffeineCache) caffeineCacheManager.getCache("aiAnswers"))
            .getNativeCache();
        this.redisTemplate = redisTemplate;
    }
    
    /**
     * 两级缓存读取
     * 读顺序：L1 → L2 → LLM
     */
    public AiAnswer getOrLoad(String question, String kbId, 
                               Supplier<AiAnswer> loader) {
        String cacheKey = buildKey(question, kbId);
        
        // Step 1: 查L1本地缓存
        AiAnswer localResult = localCache.getIfPresent(cacheKey);
        if (localResult != null) {
            log.debug("L1 cache hit: {}", cacheKey);
            recordCacheHit("l1");
            return localResult;
        }
        
        // Step 2: 查L2 Redis缓存
        AiAnswer redisResult = (AiAnswer) redisTemplate.opsForValue()
            .get(REDIS_KEY_PREFIX + cacheKey);
        if (redisResult != null) {
            log.debug("L2 cache hit: {}", cacheKey);
            // 回填L1缓存，防止下次还要查Redis
            localCache.put(cacheKey, redisResult);
            recordCacheHit("l2");
            return redisResult;
        }
        
        // Step 3: 回源LLM计算
        log.info("Cache miss, loading from LLM: {}", cacheKey);
        AiAnswer result = loader.get();
        
        // 写入两级缓存
        localCache.put(cacheKey, result);
        redisTemplate.opsForValue().set(
            REDIS_KEY_PREFIX + cacheKey, result, REDIS_TTL);
        
        recordCacheHit("miss");
        return result;
    }
    
    /**
     * 缓存失效（两级同时失效）
     */
    public void invalidate(String question, String kbId) {
        String cacheKey = buildKey(question, kbId);
        localCache.invalidate(cacheKey);
        redisTemplate.delete(REDIS_KEY_PREFIX + cacheKey);
    }
    
    /**
     * 批量失效（知识库更新时）
     */
    public void invalidateByPattern(String kbId) {
        // 失效L1：全量清除（Caffeine不支持按pattern删除）
        localCache.invalidateAll();
        
        // 失效L2：按pattern删除
        String pattern = REDIS_KEY_PREFIX + "*:" + kbId;
        Set<String> keys = redisTemplate.keys(pattern);
        if (keys != null && !keys.isEmpty()) {
            redisTemplate.delete(keys);
            log.info("Invalidated {} Redis cache entries for kb: {}", keys.size(), kbId);
        }
    }
    
    private String buildKey(String question, String kbId) {
        return HashUtils.sha256(question) + ":" + kbId;
    }
    
    private void recordCacheHit(String level) {
        // 上报Micrometer指标
        Metrics.counter("cache.hit", "level", level).increment();
    }
}

四、语义缓存：用向量相似度判断是否命中缓存

4.1 语义缓存的核心思路

传统缓存："你们退款政策是什么？" → SHA256 → Redis查询 → Miss

语义缓存：

将问题转成向量
在已缓存问题的向量库中做最近邻搜索
如果找到相似度 > 0.92 的已缓存问题，直接返回该问题的答案

4.2 完整实现

@Service
@Slf4j
public class SemanticCacheService {
    
    private final EmbeddingModel embeddingModel;
    private final RedisTemplate<String, Object> redisTemplate;
    private final MeterRegistry meterRegistry;
    
    private static final String SEMANTIC_CACHE_KEY = "ai:semantic:cache";
    private static final String SEMANTIC_VECTOR_KEY = "ai:semantic:vectors";
    private static final float SIMILARITY_THRESHOLD = 0.92f;
    private static final int MAX_CACHE_SIZE = 10_000;
    
    /**
     * 语义缓存查询
     * 返回：如果找到相似问题，返回缓存的答案；否则返回空
     */
    public Optional<AiAnswer> findSemanticMatch(String question) {
        long start = System.currentTimeMillis();
        
        try {
            // Step 1: 向量化当前问题（走L1缓存，embedding不重复计算）
            float[] queryVector = embedQuestion(question);
            
            // Step 2: 从Redis中取出所有缓存的向量（使用HGETALL）
            Map<Object, Object> vectorMap = redisTemplate.opsForHash()
                .entries(SEMANTIC_VECTOR_KEY);
            
            if (vectorMap.isEmpty()) {
                return Optional.empty();
            }
            
            // Step 3: 计算余弦相似度，找最相似的问题
            String bestMatchKey = null;
            float bestScore = 0f;
            
            for (Map.Entry<Object, Object> entry : vectorMap.entrySet()) {
                float[] cachedVector = (float[]) entry.getValue();
                float score = cosineSimilarity(queryVector, cachedVector);
                
                if (score > bestScore) {
                    bestScore = score;
                    bestMatchKey = (String) entry.getKey();
                }
            }
            
            // Step 4: 如果相似度超过阈值，取出对应答案
            if (bestScore >= SIMILARITY_THRESHOLD && bestMatchKey != null) {
                AiAnswer cachedAnswer = (AiAnswer) redisTemplate.opsForHash()
                    .get(SEMANTIC_CACHE_KEY, bestMatchKey);
                
                if (cachedAnswer != null) {
                    log.info("Semantic cache hit! score={:.3f}, matched: {}", 
                        bestScore, bestMatchKey);
                    meterRegistry.counter("semantic.cache.hit",
                        "score_range", scoreRange(bestScore)).increment();
                    
                    long elapsed = System.currentTimeMillis() - start;
                    log.debug("Semantic cache lookup took {}ms", elapsed);
                    
                    return Optional.of(cachedAnswer);
                }
            }
            
            return Optional.empty();
            
        } catch (Exception e) {
            log.error("Semantic cache lookup failed", e);
            return Optional.empty();  // 缓存失败不影响主流程
        }
    }
    
    /**
     * 向语义缓存写入新的问答对
     */
    public void putSemanticCache(String question, AiAnswer answer) {
        try {
            float[] vector = embedQuestion(question);
            String key = HashUtils.sha256(question);
            
            // 存储向量
            redisTemplate.opsForHash().put(SEMANTIC_VECTOR_KEY, key, vector);
            
            // 存储答案
            redisTemplate.opsForHash().put(SEMANTIC_CACHE_KEY, key, answer);
            
            // 设置过期时间（Hash整体过期）
            redisTemplate.expire(SEMANTIC_VECTOR_KEY, Duration.ofHours(24));
            redisTemplate.expire(SEMANTIC_CACHE_KEY, Duration.ofHours(24));
            
            // 检查缓存大小，超过上限时清理最旧的
            Long size = redisTemplate.opsForHash().size(SEMANTIC_VECTOR_KEY);
            if (size != null && size > MAX_CACHE_SIZE) {
                trimSemanticCache();
            }
            
        } catch (Exception e) {
            log.error("Failed to put semantic cache", e);
        }
    }
    
    /**
     * 余弦相似度计算
     */
    private float cosineSimilarity(float[] a, float[] b) {
        if (a.length != b.length) {
            throw new IllegalArgumentException("Vector dimensions mismatch");
        }
        
        float dotProduct = 0f;
        float normA = 0f;
        float normB = 0f;
        
        for (int i = 0; i < a.length; i++) {
            dotProduct += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        
        if (normA == 0f || normB == 0f) return 0f;
        return dotProduct / (float) (Math.sqrt(normA) * Math.sqrt(normB));
    }
    
    @Cacheable(cacheNames = "embeddings", key = "#text.hashCode()")
    private float[] embedQuestion(String text) {
        EmbeddingResponse response = embeddingModel.embedForResponse(List.of(text));
        return response.getResults().get(0).getOutput();
    }
    
    private String scoreRange(float score) {
        if (score >= 0.98) return "0.98+";
        if (score >= 0.95) return "0.95-0.98";
        return "0.92-0.95";
    }
    
    private void trimSemanticCache() {
        // 简单策略：删除20%的缓存（随机淘汰）
        // 生产环境可以改为LRU：记录访问时间后按时间排序删除
        log.info("Trimming semantic cache (current size exceeds {})", MAX_CACHE_SIZE);
        // TODO: 实现LRU淘汰
    }
}

// 整合三级缓存的入口服务
@Service
@Slf4j
public class SmartAiCacheService {
    
    private final TwoLevelCacheService twoLevelCache;
    private final SemanticCacheService semanticCache;
    private final AiAnswerGenerator answerGenerator;
    
    public AiAnswer getAnswer(String question, String kbId) {
        // 1. 先走精确的两级缓存（L1+L2）
        String exactKey = buildExactKey(question, kbId);
        AiAnswer exactResult = twoLevelCache.getIfPresent(exactKey);
        if (exactResult != null) {
            log.debug("Exact cache hit for: {}", question);
            return exactResult;
        }
        
        // 2. 再走语义缓存
        Optional<AiAnswer> semanticResult = semanticCache.findSemanticMatch(question);
        if (semanticResult.isPresent()) {
            // 语义命中：把当前问题也加入精确缓存（避免下次还走语义匹配）
            twoLevelCache.put(exactKey, semanticResult.get());
            return semanticResult.get();
        }
        
        // 3. 全部未命中：调用LLM
        AiAnswer freshAnswer = answerGenerator.generate(question, kbId);
        
        // 4. 写入所有缓存层
        twoLevelCache.put(exactKey, freshAnswer);
        semanticCache.putSemanticCache(question, freshAnswer);
        
        return freshAnswer;
    }
    
    private String buildExactKey(String question, String kbId) {
        return HashUtils.sha256(question) + ":" + kbId;
    }
}

五、Prompt缓存：大模型原生支持的前缀缓存

5.1 什么是Prompt前缀缓存

OpenAI、Anthropic、Google都支持Prompt缓存：

原理： 如果多次请求的System Prompt（或消息前缀）相同，大模型只需要计算一次，后续请求直接复用KV Cache，节省输入Token费用。

支持情况（2026年）：

提供商	是否支持	折扣力度	最小前缀长度
OpenAI (GPT-4o)	是	输入Token -50%	1024 tokens
Anthropic (Claude 3.5)	是	输入Token -90%	1024 tokens
DeepSeek	是	输入Token -75%	64 tokens
Qwen (通义千问)	是	输入Token -50%	512 tokens

5.2 Spring AI中使用Prompt缓存

@Service
@Slf4j
public class PromptCacheAwareService {
    
    private final ChatClient chatClient;
    
    // System Prompt（应该尽可能长且稳定）
    private static final String SYSTEM_PROMPT = """
        你是"智能客服助手"，服务于XX科技公司。
        
        公司介绍：
        XX科技成立于2015年，专注于企业SaaS服务...
        [这里放入大量的公司背景信息、产品说明、服务政策...]
        [越长越好，达到1024+ tokens才能触发缓存折扣]
        
        服务规范：
        1. 始终保持专业、友善的语气
        2. 只回答与公司产品相关的问题
        3. 遇到投诉，先致歉再解决
        4. 无法回答时，提供人工客服联系方式
        
        产品说明：
        [详细的产品功能说明...]
        
        退款政策：
        [详细的退款规则...]
        
        常见问题解答：
        [FAQ内容...]
        """;
    
    /**
     * 使用Prompt缓存的对话
     * 关键：每次请求保持System Prompt完全一致
     */
    public String chat(String userMessage, List<ChatMessage> history) {
        // 构建消息列表，保持System Prompt在最前面
        List<Message> messages = new ArrayList<>();
        messages.add(new SystemMessage(SYSTEM_PROMPT));  // 始终相同 → 触发缓存
        
        // 添加历史记录
        history.forEach(h -> {
            if ("user".equals(h.getRole())) {
                messages.add(new UserMessage(h.getContent()));
            } else {
                messages.add(new AssistantMessage(h.getContent()));
            }
        });
        
        // 添加当前用户消息
        messages.add(new UserMessage(userMessage));
        
        // 调用API（OpenAI会自动识别前缀缓存）
        ChatResponse response = chatClient.prompt()
            .messages(messages)
            .call()
            .chatResponse();
        
        // 记录缓存使用情况
        if (response.getMetadata() != null && response.getMetadata().getUsage() != null) {
            Usage usage = response.getMetadata().getUsage();
            log.info("Token usage - prompt: {}, completion: {}, cached_prompt: {}",
                usage.getPromptTokens(),
                usage.getGenerationTokens(),
                usage.getPromptTokens()  // 实际缓存token需要从原始响应里解析
            );
        }
        
        return response.getResult().getOutput().getContent();
    }
    
    /**
     * 针对长文档问答，使用显式的缓存控制（Anthropic API）
     */
    public String queryWithDocumentCache(String question, String documentContent) {
        // Anthropic支持通过cache_control显式标记缓存区域
        // Spring AI的Anthropic适配器支持此特性
        String systemWithDoc = SYSTEM_PROMPT + "\n\n参考文档：\n" + documentContent;
        
        // 每次问同一份文档的不同问题，文档部分命中缓存
        // 可节省文档Token计算费用的90%
        return chatClient.prompt()
            .system(systemWithDoc)
            .user(question)
            .call()
            .content();
    }
}

六、缓存穿透防护：布隆过滤器防止缓存穿透

6.1 AI应用的缓存穿透场景

恶意请求："dkasj829fjks293" （乱码，永远不会有答案）
每次都打到LLM → 浪费Token，可能触发API限流

6.2 Redisson布隆过滤器实现

@Component
@Slf4j
public class BloomFilterCacheProtection {
    
    private final RedissonClient redissonClient;
    private RBloomFilter<String> bloomFilter;
    
    @PostConstruct
    public void initBloomFilter() {
        bloomFilter = redissonClient.getBloomFilter("ai:questions:bloom");
        
        // 初始化：预期1000万个元素，误判率0.01%
        if (!bloomFilter.isExists()) {
            bloomFilter.tryInit(10_000_000L, 0.001);
            log.info("Bloom filter initialized");
        }
        
        // 从Redis中加载已有的问题ID
        preloadFromCache();
    }
    
    /**
     * 检查问题是否有可能有答案
     * 返回false：肯定没有答案（直接拦截）
     * 返回true：可能有答案（需要进一步查询）
     */
    public boolean mightHaveAnswer(String question) {
        // 短于3个字符的问题，直接过滤
        if (question == null || question.trim().length() < 3) {
            return false;
        }
        
        // 只包含特殊字符/数字的请求，直接过滤
        if (!question.matches(".*[\\u4e00-\\u9fa5a-zA-Z]+.*")) {
            log.warn("Suspicious request blocked: {}", question);
            return false;
        }
        
        return bloomFilter.contains(HashUtils.sha256(question));
    }
    
    /**
     * 新问题处理完成后，加入布隆过滤器
     */
    public void addToBloomFilter(String question) {
        bloomFilter.add(HashUtils.sha256(question));
    }
    
    /**
     * 预热：把Redis中已有的问题加载到布隆过滤器
     */
    private void preloadFromCache() {
        // 从持久化的问题列表加载
        // 实际项目中从数据库读取历史问题
        log.info("Bloom filter preloaded");
    }
}

// 在Service中集成布隆过滤器
@Service
public class ProtectedAiService {
    
    private final SmartAiCacheService cacheService;
    private final BloomFilterCacheProtection bloomFilter;
    
    public AiAnswer safeGetAnswer(String question, String kbId) {
        // 布隆过滤器快速拦截无效请求
        if (!bloomFilter.mightHaveAnswer(question)) {
            return AiAnswer.invalid("问题格式不正确");
        }
        
        AiAnswer answer = cacheService.getAnswer(question, kbId);
        
        // 处理成功后加入布隆过滤器
        if (answer != null && answer.isValid()) {
            bloomFilter.addToBloomFilter(question);
        }
        
        return answer;
    }
}

七、缓存雪崩预防：随机过期时间 + 热点数据永不过期

7.1 随机过期时间

@Configuration
public class AntiAvalancheConfig {
    
    private static final Random RANDOM = new Random();
    
    /**
     * 带随机抖动的TTL计算
     * 基础TTL + 随机抖动（±30%）
     */
    public static Duration withJitter(Duration baseTtl) {
        long baseSeconds = baseTtl.toSeconds();
        // ±30%的随机抖动
        long jitter = (long) (baseSeconds * 0.3 * (RANDOM.nextDouble() * 2 - 1));
        long finalSeconds = Math.max(baseSeconds / 2, baseSeconds + jitter);
        return Duration.ofSeconds(finalSeconds);
    }
}

@Service
public class JitterAwareCacheService {
    
    private final RedisTemplate<String, Object> redisTemplate;
    
    private static final Duration BASE_TTL = Duration.ofHours(2);
    
    public void cacheWithJitter(String key, Object value) {
        Duration jitteredTtl = AntiAvalancheConfig.withJitter(BASE_TTL);
        redisTemplate.opsForValue().set(key, value, jitteredTtl);
    }
}

7.2 热点数据识别与永不过期处理

@Service
@Slf4j
public class HotspotCacheManager {
    
    private final RedisTemplate<String, Object> redisTemplate;
    
    // 热点判断阈值：每小时访问100次以上认为是热点
    private static final int HOTSPOT_THRESHOLD = 100;
    private static final String ACCESS_COUNT_KEY = "ai:cache:access:count";
    
    /**
     * 记录缓存访问，判断是否是热点
     */
    public void recordAccess(String cacheKey) {
        String countKey = ACCESS_COUNT_KEY + ":" + cacheKey;
        Long count = redisTemplate.opsForValue().increment(countKey);
        
        // 计数器1小时后自动清零
        if (count != null && count == 1) {
            redisTemplate.expire(countKey, Duration.ofHours(1));
        }
        
        // 达到热点阈值，设置为永不过期
        if (count != null && count >= HOTSPOT_THRESHOLD) {
            markAsHotspot(cacheKey);
        }
    }
    
    private void markAsHotspot(String cacheKey) {
        String valueKey = "ai:answer:" + cacheKey;
        
        // 检查Key是否存在
        Boolean exists = redisTemplate.hasKey(valueKey);
        if (Boolean.TRUE.equals(exists)) {
            // 热点数据：TTL设为-1（永不过期）
            redisTemplate.persist(valueKey);
            
            // 同时加入热点集合，方便管理
            redisTemplate.opsForSet().add("ai:hotspot:keys", cacheKey);
            
            log.info("Key promoted to hotspot (no expiry): {}", cacheKey);
        }
    }
    
    /**
     * 知识库更新时，手动失效热点缓存
     */
    public void invalidateHotspot(String kbId) {
        Set<Object> hotspotKeys = redisTemplate.opsForSet().members("ai:hotspot:keys");
        if (hotspotKeys == null) return;
        
        int count = 0;
        for (Object key : hotspotKeys) {
            String fullKey = "ai:answer:" + key;
            redisTemplate.delete(fullKey);
            redisTemplate.opsForSet().remove("ai:hotspot:keys", key);
            count++;
        }
        
        log.info("Invalidated {} hotspot cache entries for kb: {}", count, kbId);
    }
}

八、缓存预热：系统启动时预热高频问题

8.1 为什么需要缓存预热

系统刚启动时，所有缓存都是空的。如果没有预热：

第一批用户请求全部打到LLM，响应慢
可能瞬间触发LLM的QPS限制
冷启动期间用户体验差

8.2 启动时预热实现

@Component
@Slf4j
public class AiCacheWarmer implements ApplicationRunner {
    
    private final SmartAiCacheService cacheService;
    private final FrequentQuestionRepository questionRepository;
    private final MeterRegistry meterRegistry;
    
    @Value("${ai.cache.warm.enabled:true}")
    private boolean warmEnabled;
    
    @Value("${ai.cache.warm.top-n:200}")
    private int topN;
    
    @Override
    public void run(ApplicationArguments args) throws Exception {
        if (!warmEnabled) {
            log.info("Cache warming is disabled");
            return;
        }
        
        log.info("Starting cache warming, loading top {} questions...", topN);
        long startTime = System.currentTimeMillis();
        
        // 从数据库加载高频问题（按访问频率排序）
        List<FrequentQuestion> questions = questionRepository.findTopN(topN);
        
        // 使用虚拟线程并发预热（Java 21+）
        try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
            List<CompletableFuture<Void>> futures = questions.stream()
                .map(q -> CompletableFuture.runAsync(() -> {
                    try {
                        cacheService.getAnswer(q.getQuestion(), q.getKbId());
                        log.debug("Warmed cache for: {}", q.getQuestion().substring(0, 20));
                    } catch (Exception e) {
                        log.warn("Failed to warm cache for question: {}", q.getId(), e);
                    }
                }, executor))
                .collect(Collectors.toList());
            
            // 等待所有预热完成（最多2分钟）
            CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
                .get(2, TimeUnit.MINUTES);
        }
        
        long elapsed = System.currentTimeMillis() - startTime;
        log.info("Cache warming completed: {} questions in {}ms", questions.size(), elapsed);
        
        // 记录预热指标
        meterRegistry.gauge("cache.warm.count", questions.size());
        meterRegistry.timer("cache.warm.duration").record(elapsed, TimeUnit.MILLISECONDS);
    }
}

// 高频问题数据库表
@Entity
@Table(name = "frequent_questions")
@Data
public class FrequentQuestion {
    @Id
    private String id;
    private String question;
    private String kbId;
    private int accessCount;  // 过去7天访问次数
    private Instant lastAccessedAt;
    
    @Index
    private int rankScore;    // 综合排名分数
}

// Repository
@Repository
public interface FrequentQuestionRepository extends JpaRepository<FrequentQuestion, String> {
    
    @Query("SELECT q FROM FrequentQuestion q ORDER BY q.accessCount DESC LIMIT :n")
    List<FrequentQuestion> findTopN(@Param("n") int n);
    
    // 每天凌晨更新访问统计
    @Modifying
    @Query("""
        UPDATE FrequentQuestion q 
        SET q.accessCount = (
            SELECT COUNT(*) FROM AuditLog a 
            WHERE a.questionId = q.id 
            AND a.createdAt > :since
        )
        WHERE q.id IN :ids
        """)
    void updateAccessCounts(@Param("ids") List<String> ids, 
                            @Param("since") Instant since);
}

九、缓存监控：命中率/内存/驱逐率的Grafana面板

9.1 Micrometer指标埋点

@Aspect
@Component
@Slf4j
public class CacheMetricsAspect {
    
    private final MeterRegistry meterRegistry;
    
    @Around("@annotation(org.springframework.cache.annotation.Cacheable)")
    public Object monitorCacheableMethod(ProceedingJoinPoint joinPoint) throws Throwable {
        String methodName = joinPoint.getSignature().getName();
        Timer.Sample sample = Timer.start(meterRegistry);
        
        Object result = joinPoint.proceed();
        
        // 判断是否命中缓存（通过自定义CacheManager追踪）
        boolean cacheHit = CacheHitTracker.isLastOperationHit();
        
        sample.stop(meterRegistry.timer("ai.cache.operation",
            "method", methodName,
            "result", cacheHit ? "hit" : "miss"
        ));
        
        meterRegistry.counter("ai.cache.requests",
            "method", methodName,
            "result", cacheHit ? "hit" : "miss"
        ).increment();
        
        return result;
    }
}

@Component
public class CacheMetricsExporter {
    
    private final Cache<String, AiAnswer> caffeineCache;
    
    @EventListener
    public void exportMetrics(MetricsEvent event) {
        CacheStats stats = caffeineCache.stats();
        
        // Caffeine指标
        Gauge.builder("caffeine.cache.hit.rate", stats, CacheStats::hitRate)
            .description("Caffeine cache hit rate")
            .register(Metrics.globalRegistry);
        
        Gauge.builder("caffeine.cache.size", caffeineCache, Cache::estimatedSize)
            .description("Estimated cache size")
            .register(Metrics.globalRegistry);
        
        Gauge.builder("caffeine.cache.eviction.count", stats, CacheStats::evictionCount)
            .description("Cache eviction count")
            .register(Metrics.globalRegistry);
    }
}

9.2 Prometheus + Grafana面板配置

# prometheus.yml - 抓取配置
scrape_configs:
  - job_name: 'ai-services'
    static_configs:
      - targets: 
          - 'embedding-service:8080'
          - 'rag-service:8080'
          - 'conversation-service:8080'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s

关键Grafana面板查询（PromQL）：

# L1缓存命中率
rate(ai_cache_requests_total{result="hit", cache="l1"}[5m]) 
/ rate(ai_cache_requests_total[5m]) * 100

# L2 Redis缓存命中率  
rate(ai_cache_requests_total{result="hit", cache="l2"}[5m])
/ rate(ai_cache_requests_total[5m]) * 100

# 语义缓存命中率
rate(semantic_cache_hit_total[5m])
/ rate(semantic_cache_lookup_total[5m]) * 100

# 整体缓存节省的Token数
increase(tokens_saved_by_cache_total[1d])

# Redis内存使用
redis_memory_used_bytes{instance="redis:6379"} / 1024 / 1024

# Caffeine驱逐率（高说明容量设置太小）
rate(caffeine_cache_eviction_count[5m])

十、实战性能数据：三级缓存效果

10.1 各层缓存命中率（王磊团队实测）

上线2周后稳定状态的数据：

缓存层	命中率	平均响应时间	节省Token数/天
L1 Caffeine	32%	8ms	150万
L2 Redis	21%	45ms	99万
语义缓存	15%	280ms	70万
Prompt前缀缓存	覆盖所有请求	—	60万（输入Token）
全部miss（LLM）	32%	2,100ms	0

累计效果： 68%的请求通过缓存处理，Token消耗降低68%，P50响应时间降低93%。

10.2 成本计算

优化前：
- 日均Token：470万
- GPT-4o输入：$15/1M tokens → 日成本：$42
- GPT-4o输出：$60/1M tokens → 日成本：$168（假设输入:输出=7:3）
- 日均总成本：$210 ≈ ¥1,470

优化后：
- L1/L2命中（53%）：完全0 Token消耗
- 语义缓存命中（15%）：只消耗embedding Token（极便宜）
- LLM实际处理（32%）：150万Token/天
- Prompt前缀缓存：输入Token节省约50%
- 日均总成本：$210 × 32% × 50% ≈ $33 ≈ ¥231

节省：¥1,239/天 ≈ ¥45,000/月

常见问题 FAQ

Q1：语义缓存的阈值0.92怎么确定的？

A：这个值需要根据业务场景调整。阈值越高，误判越少但命中率越低；阈值越低，命中率高但可能返回不相关答案。建议做A/B测试：从0.95开始，逐步降低，观察用户反馈和投诉率。王磊的团队是0.92，某些对准确性要求极高的法律咨询场景建议0.97+。

Q2：用向量库做语义缓存，当缓存条目很多时性能会下降吗？

A：Redis Hash存10万条向量（每条1536维float = 6KB），内存占用约600MB，线性扫描计算相似度的延迟约50-200ms（取决于服务器配置）。推荐用Milvus或pgvector做语义缓存索引，支持HNSW索引，10万条向量查询延迟可控制在5ms以内。

Q3：Prompt缓存需要代码改动吗？

A：对于OpenAI，不需要任何代码改动，只要System Prompt相同就自动生效。对于Anthropic Claude，需要在消息上加cache_control字段显式标记。Spring AI的Anthropic适配器已经支持这个特性。

Q4：缓存的内容会不会过期导致用户得到旧答案？

A：知识库内容更新时，必须主动失效相关缓存（@CacheEvict）。建议将知识库更新和缓存失效绑定为一个事务操作，发布KnowledgeBaseUpdatedEvent事件，监听器负责失效对应的缓存。

Q5：布隆过滤器误判怎么办？

A：布隆过滤器只有假阳性（把没有的误判为有），不会有假阴性。所以最坏情况是一个无效请求通过了布隆过滤器，继续走后面的缓存和LLM流程——只是浪费了一次查询，不影响正确性。设置误判率0.001（0.1%）就足够生产环境使用。

总结

多级缓存架构的核心原则：

按命中率分层：越快越贵的存储，只存最热的数据
语义缓存是AI特有的：传统缓存框架做不到，必须自己实现
缓存失效是难点：宁可多一次失效，不要让用户看到错误数据
监控先行：没有命中率数据，缓存优化就是瞎猜

王磊团队的月度账单从8.4万降到2.7万，这省下来的5.7万，是3个月的云服务器成本。缓存这件事，做早一天少烧一天钱。