RAG缓存层设计:用语义缓存把LLM调用成本降低70%
RAG缓存层设计:用语义缓存把LLM调用成本降低70%
同一个问题问了500次,每次都花钱
那是小王入职AI团队后的第三个月。月底,他去财务那边取AI成本报告,看到那个数字,手抖了一下。
"这个月LLM调用费:¥28,600。"
公司是一家中型制造企业,内部上了一套员工HR知识库,解答请假、报销、社保之类的问题。系统刚上线时大家都很兴奋,使用量很高。
小王拉了一下查询日志分析,心里一凉。
日志里,过去一个月的Top 10高频问题是:
- "如何申请年假?" —— 查询了847次
- "报销流程是什么?" —— 查询了723次
- "产假有多少天?" —— 查询了612次
- "公积金怎么提取?" —— 查询了589次
- "请病假需要什么材料?" —— 查询了541次
前5个问题,合计查询了3312次。
但这5个问题对应的答案,几乎一字不变——因为HR制度就是那几条规定,不会因为谁问就变不同的答案。
这3312次查询,每次都调用了LLM,每次都花了约0.02元。
仅这5个问题,就花了66元。但如果缓存了第一次的答案,后面3311次全是0成本。
更要命的是,这还只是前5个问题。整个日志里,"重复性"查询占了总量的61%——就算答案完全相同,每次都重新跑了一遍LLM。
小王做了一个测算:如果把这61%的重复查询全部缓存,每月成本可以降到约11,100元,节省60%。
但问题来了——用传统的Key-Value缓存(把问题原文作为key)完全不够用。
因为用户不会每次都精确地输入"如何申请年假?",他们会说:
- "我想请假,怎么办?"
- "年假如何申请呀"
- "公司年假申请流程是啥"
- "怎么申请年假"
这四个表述语义完全相同,但字符串不同,传统K-V缓存全部MISS。
语义缓存就是解决这个问题的:用向量相似度判断"这个问题是否问过了",而不是精确字符串匹配。
小王上线语义缓存后,缓存命中率达到68%,每月成本降到了9,800元,节省了65%。
先说结论(TL;DR)
| 方案 | 命中条件 | 命中率 | 适用场景 |
|---|---|---|---|
| 不缓存 | N/A | 0% | 每个问题都唯一 |
| K-V缓存(精确匹配) | 完全相同的字符串 | 5-15% | 固定格式查询 |
| 语义缓存(向量相似度) | 语义相似超过阈值 | 40-70% | 自然语言问答 |
本文方案:三级缓存架构
- L1:JVM内存缓存(Caffeine,<1ms,缓存近期热点)
- L2:Redis缓存(<5ms,缓存高频问答)
- L3:向量语义缓存(<50ms,缓存语义相似问答)
为什么普通K-V缓存不够用
用一段代码来直观展示问题:
// 传统K-V缓存逻辑
public String getCachedAnswer(String question) {
// 用问题原文作为key
String key = "rag:cache:" + question.hashCode();
String cached = redis.get(key);
if (cached != null) {
return cached; // 缓存命中
}
return null; // 缓存MISS
}
// 这会发生什么:
getCachedAnswer("如何申请年假?"); // MISS → 调用LLM → 缓存答案A
getCachedAnswer("如何申请年假?"); // HIT → 返回答案A ✓
getCachedAnswer("年假如何申请呀"); // MISS → 再次调用LLM(浪费!)
getCachedAnswer("我想请年假,流程是什么"); // MISS → 再次调用LLM(浪费!)
getCachedAnswer("公司年假怎么请"); // MISS → 再次调用LLM(浪费!)传统缓存对语言的细微变化完全无感,而自然语言天然就存在大量表述变体。
语义缓存原理
语义缓存的核心思路:
- 用户提问时,先把问题转成向量
- 在缓存库里搜索向量最近邻
- 如果最近邻的相似度超过阈值,直接返回缓存答案
- 否则调用LLM,然后把新的问答对存入缓存
Maven依赖与配置
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.laozhang.ai</groupId>
<artifactId>semantic-cache-demo</artifactId>
<version>1.0.0</version>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.2.5</version>
</parent>
<properties>
<java.version>17</java.version>
<spring-ai.version>1.0.0-M6</spring-ai.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Spring AI -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
<version>${spring-ai.version}</version>
</dependency>
<!-- Spring AI Redis Vector Store(语义缓存用) -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-redis-store-spring-boot-starter</artifactId>
<version>${spring-ai.version}</version>
</dependency>
<!-- Redis -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
<!-- Caffeine L1内存缓存 -->
<dependency>
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-cache</artifactId>
</dependency>
<!-- Spring Data JPA(存储缓存统计) -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
</dependency>
<!-- Micrometer -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<repositories>
<repository>
<id>spring-milestones</id>
<url>https://repo.spring.io/milestone</url>
</repository>
</repositories>
</project>spring:
application:
name: semantic-cache-demo
ai:
openai:
api-key: ${OPENAI_API_KEY}
base-url: ${OPENAI_BASE_URL:https://api.openai.com}
chat:
options:
model: gpt-4o
temperature: 0.1
embedding:
options:
model: text-embedding-3-small
# Redis向量存储(用于语义缓存)
vectorstore:
redis:
index: semantic-cache-index
prefix: "rag:"
initialize-schema: true
redis:
host: localhost
port: 6379
lettuce:
pool:
max-active: 20
max-idle: 10
min-idle: 5
datasource:
url: jdbc:postgresql://localhost:5432/semantic_cache
username: postgres
password: postgres
jpa:
hibernate:
ddl-auto: update
# Caffeine L1缓存配置
cache:
type: caffeine
caffeine:
spec: maximumSize=1000,expireAfterWrite=10m
# 语义缓存配置
semantic-cache:
# 缓存命中阈值(0-1),越高越严格
# 0.92:几乎完全相同的问题才命中(安全但命中率低)
# 0.85:语义相似的问题命中(推荐值)
# 0.75:宽松匹配(命中率高但可能错误命中)
similarity-threshold: 0.88
# L2 Redis缓存TTL
redis-ttl-hours: 24
# L1 内存缓存最大条数
l1-max-size: 200
# L1 缓存过期时间(分钟)
l1-expire-minutes: 10
# 是否启用缓存预热
warmup-enabled: true
# 预热时从数据库加载的历史问答数量
warmup-size: 100
# 成本统计:每次LLM调用的平均成本(元)
cost-per-llm-call: 0.02
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus
logging:
level:
com.laozhang.ai: DEBUG语义缓存核心实现
package com.laozhang.ai.cache.service;
import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.document.Document;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import jakarta.annotation.PostConstruct;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.*;
import java.util.concurrent.TimeUnit;
/**
* 三级语义缓存服务
* L1: Caffeine内存缓存(热点问题,<1ms)
* L2: Redis KV缓存(高频问题,<5ms)
* L3: Redis向量缓存(语义相似问题,<50ms)
*/
@Service
@Slf4j
public class SemanticCacheService {
private final EmbeddingModel embeddingModel;
private final VectorStore vectorStore;
private final org.springframework.data.redis.core.RedisTemplate<String, String> redisTemplate;
private final MeterRegistry meterRegistry;
@Value("${semantic-cache.similarity-threshold:0.88}")
private double similarityThreshold;
@Value("${semantic-cache.redis-ttl-hours:24}")
private long redisTtlHours;
@Value("${semantic-cache.l1-max-size:200}")
private int l1MaxSize;
@Value("${semantic-cache.l1-expire-minutes:10}")
private long l1ExpireMinutes;
@Value("${semantic-cache.cost-per-llm-call:0.02}")
private double costPerLlmCall;
// L1: Caffeine内存缓存
private Cache<String, CacheEntry> l1Cache;
// L2: Redis KV缓存(key = 精确问题hash,value = 答案JSON)
private static final String L2_KEY_PREFIX = "rag:l2:";
// L3: 向量索引名(在VectorStore配置中定义)
// 成本统计
private volatile long totalQueries = 0;
private volatile long cacheHits = 0;
private volatile double totalSavedCost = 0;
// Micrometer计数器
private Counter l1HitCounter;
private Counter l2HitCounter;
private Counter l3HitCounter;
private Counter cacheMissCounter;
public SemanticCacheService(
EmbeddingModel embeddingModel,
VectorStore vectorStore,
org.springframework.data.redis.core.RedisTemplate<String, String> redisTemplate,
MeterRegistry meterRegistry) {
this.embeddingModel = embeddingModel;
this.vectorStore = vectorStore;
this.redisTemplate = redisTemplate;
this.meterRegistry = meterRegistry;
}
@PostConstruct
public void init() {
// 初始化L1缓存
this.l1Cache = Caffeine.newBuilder()
.maximumSize(l1MaxSize)
.expireAfterWrite(l1ExpireMinutes, TimeUnit.MINUTES)
.recordStats() // 开启统计
.build();
// 初始化Micrometer计数器
this.l1HitCounter = Counter.builder("semantic.cache.hit")
.tag("level", "l1")
.description("L1内存缓存命中次数")
.register(meterRegistry);
this.l2HitCounter = Counter.builder("semantic.cache.hit")
.tag("level", "l2")
.description("L2 Redis缓存命中次数")
.register(meterRegistry);
this.l3HitCounter = Counter.builder("semantic.cache.hit")
.tag("level", "l3")
.description("L3语义缓存命中次数")
.register(meterRegistry);
this.cacheMissCounter = Counter.builder("semantic.cache.miss")
.description("缓存未命中次数")
.register(meterRegistry);
log.info("SemanticCacheService初始化完成,相似度阈值={}", similarityThreshold);
}
/**
* 查询缓存(三级查询)
*
* @param question 用户问题
* @return 缓存的答案,如果未命中返回Optional.empty()
*/
public Optional<CacheEntry> get(String question) {
totalQueries++;
Timer.Sample sample = Timer.start(meterRegistry);
try {
// === L1: 内存缓存查询(最快)===
String exactKey = buildExactKey(question);
CacheEntry l1Result = l1Cache.getIfPresent(exactKey);
if (l1Result != null) {
l1HitCounter.increment();
cacheHits++;
totalSavedCost += costPerLlmCall;
log.debug("L1缓存命中,question={}", truncate(question, 50));
return Optional.of(l1Result.withHitLevel("L1"));
}
// === L2: Redis精确匹配(快)===
String l2Key = L2_KEY_PREFIX + exactKey;
String l2Json = redisTemplate.opsForValue().get(l2Key);
if (l2Json != null) {
CacheEntry entry = parseEntry(l2Json);
if (entry != null) {
// 回填L1
l1Cache.put(exactKey, entry);
l2HitCounter.increment();
cacheHits++;
totalSavedCost += costPerLlmCall;
log.debug("L2缓存命中,question={}", truncate(question, 50));
return Optional.of(entry.withHitLevel("L2"));
}
}
// === L3: 向量语义缓存(较慢但覆盖语义相似查询)===
Optional<CacheEntry> l3Result = semanticSearch(question);
if (l3Result.isPresent()) {
CacheEntry entry = l3Result.get();
// 回填L1和L2(提升下次命中速度)
l1Cache.put(exactKey, entry);
saveToL2(exactKey, entry);
l3HitCounter.increment();
cacheHits++;
totalSavedCost += costPerLlmCall;
log.debug("L3语义缓存命中,question={},匹配问题={},相似度={:.3f}",
truncate(question, 50),
truncate(entry.getOriginalQuestion(), 50),
entry.getSimilarityScore());
return Optional.of(entry.withHitLevel("L3"));
}
// 所有缓存未命中
cacheMissCounter.increment();
log.debug("缓存MISS,question={}", truncate(question, 50));
return Optional.empty();
} finally {
sample.stop(meterRegistry.timer("semantic.cache.query.duration"));
}
}
/**
* 将问答对存入缓存
*
* @param question 问题
* @param answer 答案
* @param contexts 检索到的上下文(用于缓存有效性判断)
*/
public void put(String question, String answer, List<String> contexts) {
String exactKey = buildExactKey(question);
CacheEntry entry = CacheEntry.builder()
.question(question)
.answer(answer)
.contexts(contexts)
.createdAt(LocalDateTime.now())
.hitCount(0)
.build();
// 存入L1
l1Cache.put(exactKey, entry);
// 存入L2(带TTL)
saveToL2(exactKey, entry);
// 存入L3(向量语义缓存)
saveToL3(question, answer, contexts);
log.debug("答案已缓存,question={}", truncate(question, 50));
}
/**
* L3语义搜索
*/
private Optional<CacheEntry> semanticSearch(String question) {
try {
SearchRequest request = SearchRequest.builder()
.query(question)
.topK(1)
.similarityThreshold(similarityThreshold)
.build();
List<Document> results = vectorStore.similaritySearch(request);
if (results.isEmpty()) {
return Optional.empty();
}
Document topResult = results.get(0);
double similarity = (double) topResult.getMetadata()
.getOrDefault("distance", 0.0);
// Spring AI的Redis VectorStore返回的是余弦距离,需要转成相似度
// cosine_similarity = 1 - cosine_distance
double cosineSimilarity = 1 - similarity;
if (cosineSimilarity < similarityThreshold) {
return Optional.empty();
}
// 从文档元数据中恢复缓存条目
Map<String, Object> metadata = topResult.getMetadata();
String cachedAnswer = (String) metadata.get("answer");
String originalQuestion = (String) metadata.get("question");
if (cachedAnswer == null) {
return Optional.empty();
}
CacheEntry entry = CacheEntry.builder()
.question(originalQuestion)
.answer(cachedAnswer)
.similarityScore(cosineSimilarity)
.createdAt(LocalDateTime.now())
.build();
return Optional.of(entry);
} catch (Exception e) {
log.error("L3语义搜索失败", e);
return Optional.empty();
}
}
/**
* 存入L2 Redis KV缓存
*/
private void saveToL2(String key, CacheEntry entry) {
try {
String json = serializeEntry(entry);
redisTemplate.opsForValue().set(
L2_KEY_PREFIX + key,
json,
Duration.ofHours(redisTtlHours)
);
} catch (Exception e) {
log.warn("L2缓存写入失败", e);
}
}
/**
* 存入L3向量语义缓存
*/
private void saveToL3(String question, String answer, List<String> contexts) {
try {
Map<String, Object> metadata = new HashMap<>();
metadata.put("question", question);
metadata.put("answer", answer);
metadata.put("contexts_count", contexts != null ? contexts.size() : 0);
metadata.put("cached_at", LocalDateTime.now().toString());
Document doc = new Document(
UUID.randomUUID().toString(),
question, // 用问题文本作为向量化内容
metadata
);
vectorStore.add(List.of(doc));
} catch (Exception e) {
log.warn("L3语义缓存写入失败", e);
}
}
/**
* 构建精确匹配key
* 对问题做标准化处理(去空格、转小写)再hash
*/
private String buildExactKey(String question) {
String normalized = question.trim().toLowerCase()
.replaceAll("\\s+", " ");
return String.valueOf(normalized.hashCode());
}
/**
* 获取缓存统计
*/
public CacheStats getStats() {
double hitRate = totalQueries > 0 ? (double) cacheHits / totalQueries : 0;
return CacheStats.builder()
.totalQueries(totalQueries)
.cacheHits(cacheHits)
.cacheMisses(totalQueries - cacheHits)
.hitRate(hitRate)
.totalSavedCost(totalSavedCost)
.l1Size(l1Cache.estimatedSize())
.l1HitRate(l1Cache.stats().hitRate())
.build();
}
/**
* 主动失效缓存(当知识库更新时调用)
*
* @param topic 相关主题(会清除向量语义缓存中相关条目)
*/
public void invalidateByTopic(String topic) {
log.info("主动失效缓存,topic={}", topic);
// 清空L1(全量清空,简单安全)
l1Cache.invalidateAll();
// L2 Redis的TTL自然过期,或者按topic前缀清除(需要key设计支持)
// L3 向量缓存清除相关文档(需要向量库支持按metadata过滤删除)
log.info("L1缓存已清空,L2和L3将在TTL过期后自动失效");
}
// ==================== 工具方法 ====================
private String truncate(String s, int maxLen) {
if (s == null) return "null";
return s.length() > maxLen ? s.substring(0, maxLen) + "..." : s;
}
private String serializeEntry(CacheEntry entry) {
try {
return new com.fasterxml.jackson.databind.ObjectMapper()
.writeValueAsString(entry);
} catch (Exception e) {
return null;
}
}
private CacheEntry parseEntry(String json) {
try {
return new com.fasterxml.jackson.databind.ObjectMapper()
.readValue(json, CacheEntry.class);
} catch (Exception e) {
return null;
}
}
// ==================== 数据模型 ====================
@lombok.Data
@lombok.Builder
@lombok.NoArgsConstructor
@lombok.AllArgsConstructor
public static class CacheEntry {
private String question;
private String answer;
private List<String> contexts;
private String originalQuestion;
private double similarityScore;
private String hitLevel;
private int hitCount;
private LocalDateTime createdAt;
public CacheEntry withHitLevel(String level) {
this.hitLevel = level;
this.hitCount++;
return this;
}
}
@lombok.Data
@lombok.Builder
@lombok.NoArgsConstructor
@lombok.AllArgsConstructor
public static class CacheStats {
private long totalQueries;
private long cacheHits;
private long cacheMisses;
private double hitRate;
private double totalSavedCost;
private long l1Size;
private double l1HitRate;
}
}相似度阈值调优
阈值是语义缓存最关键的参数,过高或过低都有问题:
package com.laozhang.ai.cache.service;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;
import java.util.*;
/**
* 相似度阈值调优工具
* 通过分析历史问答对,找到最佳阈值
*/
@Service
@RequiredArgsConstructor
@Slf4j
public class ThresholdTuningService {
private final org.springframework.ai.embedding.EmbeddingModel embeddingModel;
/**
* 测试不同阈值的效果
*
* 思路:
* 1. 准备"正例对":语义相同的问题对(应该命中缓存)
* 2. 准备"负例对":语义不同的问题对(不应该命中缓存)
* 3. 对每个阈值,计算正例命中率和负例误判率
* 4. 找到F1最高的阈值
*/
public ThresholdAnalysis analyzeThresholds(
List<SimilarPair> positivePairs,
List<DifferentPair> negativePairs) {
double[] thresholdsToTest = {0.75, 0.80, 0.83, 0.85, 0.88, 0.90, 0.92, 0.95};
List<ThresholdResult> results = new ArrayList<>();
// 计算所有对的相似度(只需一次)
List<Double> positiveSimilarities = computeSimilarities(positivePairs);
List<Double> negativeSimilarities = computeSimilaritiesForDifferent(negativePairs);
for (double threshold : thresholdsToTest) {
// 正例命中率(True Positive Rate / Recall)
long truePositives = positiveSimilarities.stream()
.filter(sim -> sim >= threshold)
.count();
double recall = (double) truePositives / positivePairs.size();
// 负例误判率(False Positive Rate)
long falsePositives = negativeSimilarities.stream()
.filter(sim -> sim >= threshold)
.count();
double fpr = (double) falsePositives / negativePairs.size();
// Precision = TP / (TP + FP)
double precision = (truePositives + falsePositives) > 0
? (double) truePositives / (truePositives + falsePositives)
: 1.0;
// F1 = 2 * precision * recall / (precision + recall)
double f1 = (precision + recall) > 0
? 2 * precision * recall / (precision + recall)
: 0;
results.add(ThresholdResult.builder()
.threshold(threshold)
.recall(recall)
.precision(precision)
.fpr(fpr)
.f1(f1)
.build());
log.info("阈值={}: Recall={:.2f}, Precision={:.2f}, FPR={:.2f}, F1={:.2f}",
threshold, recall, precision, fpr, f1);
}
// 找最佳阈值(F1最高)
ThresholdResult best = results.stream()
.max(Comparator.comparingDouble(ThresholdResult::getF1))
.orElse(results.get(0));
log.info("推荐阈值: {} (F1={:.3f})", best.getThreshold(), best.getF1());
return ThresholdAnalysis.builder()
.results(results)
.recommendedThreshold(best.getThreshold())
.bestF1(best.getF1())
.build();
}
/**
* 计算相似问题对的余弦相似度
*/
private List<Double> computeSimilarities(List<SimilarPair> pairs) {
return pairs.stream()
.map(pair -> {
List<Double> v1 = embeddingModel.embed(pair.getQuestion1());
List<Double> v2 = embeddingModel.embed(pair.getQuestion2());
return cosineSimilarity(v1, v2);
})
.toList();
}
private List<Double> computeSimilaritiesForDifferent(List<DifferentPair> pairs) {
return pairs.stream()
.map(pair -> {
List<Double> v1 = embeddingModel.embed(pair.getQuestion1());
List<Double> v2 = embeddingModel.embed(pair.getQuestion2());
return cosineSimilarity(v1, v2);
})
.toList();
}
/**
* 计算两个向量的余弦相似度
*/
private double cosineSimilarity(List<Double> v1, List<Double> v2) {
double dot = 0, norm1 = 0, norm2 = 0;
for (int i = 0; i < v1.size(); i++) {
dot += v1.get(i) * v2.get(i);
norm1 += v1.get(i) * v1.get(i);
norm2 += v2.get(i) * v2.get(i);
}
if (norm1 == 0 || norm2 == 0) return 0;
return dot / (Math.sqrt(norm1) * Math.sqrt(norm2));
}
@lombok.Data @lombok.Builder @lombok.NoArgsConstructor @lombok.AllArgsConstructor
public static class SimilarPair {
private String question1;
private String question2;
}
@lombok.Data @lombok.Builder @lombok.NoArgsConstructor @lombok.AllArgsConstructor
public static class DifferentPair {
private String question1;
private String question2;
}
@lombok.Data @lombok.Builder @lombok.NoArgsConstructor @lombok.AllArgsConstructor
public static class ThresholdResult {
private double threshold;
private double recall;
private double precision;
private double fpr;
private double f1;
}
@lombok.Data @lombok.Builder @lombok.NoArgsConstructor @lombok.AllArgsConstructor
public static class ThresholdAnalysis {
private List<ThresholdResult> results;
private double recommendedThreshold;
private double bestF1;
}
}阈值分析表(基于小王系统的实测数据):
| 阈值 | 缓存命中率 | 错误命中率 | 适合场景 |
|---|---|---|---|
| 0.75 | 74% | 12% | 不推荐,错误命中太多 |
| 0.80 | 68% | 6% | 宽松场景,容错高 |
| 0.85 | 61% | 2.8% | 推荐:一般业务系统 |
| 0.88 | 56% | 1.1% | 推荐:对准确性要求高 |
| 0.90 | 48% | 0.4% | 严格场景,法律/医疗 |
| 0.95 | 31% | 0.05% | 极严格,几乎等于精确匹配 |
缓存预热:基于历史数据提前加载热点
package com.laozhang.ai.cache.service;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.context.event.ApplicationReadyEvent;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Service;
import java.util.List;
/**
* 缓存预热服务
* 应用启动时,将历史高频问答预先加载到缓存
*/
@Service
@RequiredArgsConstructor
@Slf4j
public class CacheWarmupService {
private final SemanticCacheService semanticCacheService;
@Value("${semantic-cache.warmup-enabled:true}")
private boolean warmupEnabled;
@Value("${semantic-cache.warmup-size:100}")
private int warmupSize;
/**
* 应用启动完成后执行预热
*/
@EventListener(ApplicationReadyEvent.class)
public void warmup() {
if (!warmupEnabled) {
log.info("缓存预热已禁用,跳过");
return;
}
log.info("开始缓存预热,加载Top{}历史问答...", warmupSize);
long start = System.currentTimeMillis();
try {
// 从数据库加载历史高频问答
List<HistoricalQaPair> topQaPairs = loadTopQaPairs(warmupSize);
int loaded = 0;
for (HistoricalQaPair pair : topQaPairs) {
try {
semanticCacheService.put(
pair.getQuestion(),
pair.getAnswer(),
pair.getContexts()
);
loaded++;
} catch (Exception e) {
log.warn("预热单条失败,question={}", pair.getQuestion(), e);
}
}
long elapsed = System.currentTimeMillis() - start;
log.info("缓存预热完成:加载{}条,耗时{}ms", loaded, elapsed);
} catch (Exception e) {
log.error("缓存预热失败(不影响正常启动)", e);
}
}
/**
* 从数据库加载历史高频问答
* 实际项目中注入Repository读取
*/
private List<HistoricalQaPair> loadTopQaPairs(int limit) {
// 从历史日志/数据库中加载点击率最高的问答对
// SELECT question, answer, contexts FROM rag_query_log
// WHERE created_at > NOW() - INTERVAL '30 days'
// ORDER BY query_count DESC LIMIT ?
return List.of(); // 占位
}
@lombok.Data
@lombok.Builder
@lombok.NoArgsConstructor
@lombok.AllArgsConstructor
public static class HistoricalQaPair {
private String question;
private String answer;
private List<String> contexts;
private long queryCount;
}
}集成到RAG问答流程
package com.laozhang.ai.cache.api;
import com.laozhang.ai.cache.service.SemanticCacheService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.web.bind.annotation.*;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;
/**
* 带语义缓存的RAG问答API
*/
@RestController
@RequestMapping("/api/qa")
@RequiredArgsConstructor
@Slf4j
public class CachedRagController {
private final SemanticCacheService semanticCacheService;
private final VectorStore knowledgeVectorStore;
private final ChatClient chatClient;
/**
* 问答接口(带三级缓存)
*/
@PostMapping("/ask")
public QaResponse ask(@RequestBody QaRequest request) {
String question = request.getQuestion();
long start = System.currentTimeMillis();
// ===== 1. 查询缓存 =====
Optional<SemanticCacheService.CacheEntry> cached =
semanticCacheService.get(question);
if (cached.isPresent()) {
SemanticCacheService.CacheEntry entry = cached.get();
log.info("缓存命中({}),question={}", entry.getHitLevel(),
question.substring(0, Math.min(50, question.length())));
return QaResponse.builder()
.answer(entry.getAnswer())
.fromCache(true)
.cacheLevel(entry.getHitLevel())
.cachedQuestion(entry.getOriginalQuestion())
.similarityScore(entry.getSimilarityScore())
.processingTimeMs(System.currentTimeMillis() - start)
.build();
}
// ===== 2. 缓存未命中,走完整RAG流程 =====
log.debug("缓存MISS,执行RAG检索,question={}", question);
// 检索相关文档
List<Document> docs = knowledgeVectorStore.similaritySearch(
SearchRequest.builder()
.query(question)
.topK(5)
.similarityThreshold(0.5)
.build()
);
List<String> contexts = docs.stream()
.map(Document::getFormattedContent)
.collect(Collectors.toList());
// 构建上下文
String contextText = buildContextText(contexts);
// 调用LLM
String answer = chatClient.prompt()
.system("""
你是一个企业内部知识库助手。
请基于以下参考资料回答问题,如果资料不足以回答,请明确说明。
参考资料:
""" + contextText)
.user(question)
.call()
.content();
// ===== 3. 存入缓存 =====
semanticCacheService.put(question, answer, contexts);
return QaResponse.builder()
.answer(answer)
.fromCache(false)
.cacheLevel("MISS")
.processingTimeMs(System.currentTimeMillis() - start)
.build();
}
/**
* 查看缓存统计
*/
@GetMapping("/cache/stats")
public SemanticCacheService.CacheStats getCacheStats() {
return semanticCacheService.getStats();
}
/**
* 手动触发缓存失效(知识库更新时调用)
*/
@PostMapping("/cache/invalidate")
public void invalidateCache(@RequestParam String topic) {
semanticCacheService.invalidateByTopic(topic);
}
private String buildContextText(List<String> contexts) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < contexts.size(); i++) {
sb.append("[").append(i + 1).append("] ").append(contexts.get(i)).append("\n\n");
}
return sb.toString();
}
@lombok.Data @lombok.NoArgsConstructor @lombok.AllArgsConstructor
public static class QaRequest {
private String question;
}
@lombok.Data @lombok.Builder @lombok.NoArgsConstructor @lombok.AllArgsConstructor
public static class QaResponse {
private String answer;
private boolean fromCache;
private String cacheLevel;
private String cachedQuestion;
private double similarityScore;
private long processingTimeMs;
}
}缓存效果监控看板
需要监控的核心指标:
package com.laozhang.ai.cache.monitor;
import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Component;
import com.laozhang.ai.cache.service.SemanticCacheService;
import jakarta.annotation.PostConstruct;
/**
* 缓存监控指标注册
*/
@Component
@RequiredArgsConstructor
public class CacheMetricsRegistrar {
private final SemanticCacheService cacheService;
private final MeterRegistry meterRegistry;
@PostConstruct
public void registerMetrics() {
// 命中率
Gauge.builder("semantic.cache.hit.rate",
cacheService, svc -> svc.getStats().getHitRate())
.description("语义缓存命中率")
.register(meterRegistry);
// 节省成本
Gauge.builder("semantic.cache.saved.cost",
cacheService, svc -> svc.getStats().getTotalSavedCost())
.description("累计节省的LLM调用成本(元)")
.register(meterRegistry);
// L1缓存大小
Gauge.builder("semantic.cache.l1.size",
cacheService, svc -> svc.getStats().getL1Size())
.description("L1内存缓存条目数")
.register(meterRegistry);
// L1命中率
Gauge.builder("semantic.cache.l1.hit.rate",
cacheService, svc -> svc.getStats().getL1HitRate())
.description("L1内存缓存命中率")
.register(meterRegistry);
}
}实际数据:某企业知识库上线语义缓存后的成本变化
小王上线三级语义缓存后,整理了一份真实数据报告:
上线前(传统K-V精确缓存):
| 指标 | 数值 |
|---|---|
| 月均查询量 | 48,000次 |
| 缓存命中率 | 8% |
| LLM调用次数 | 44,160次/月 |
| 月均成本 | ¥28,600 |
| 平均响应时间 | 2.3s |
上线后(三级语义缓存,阈值0.88):
| 指标 | 数值 |
|---|---|
| 月均查询量 | 51,000次(因体验好,使用量增加) |
| 缓存命中率 | 68% |
| LLM调用次数 | 16,320次/月 |
| 月均成本 | ¥9,800 |
| 平均响应时间 | 0.8s(缓存命中<50ms,总体均值下降) |
效果:成本降低66%,响应时间降低65%。
命中率分布:
- L1命中(内存):14%(超热点问题)
- L2命中(Redis精确):18%(同一问题多次)
- L3命中(语义相似):36%(语义相似问题)
- 缓存MISS:32%(真正的新问题)
生产注意事项
1. 缓存污染问题
如果某个问题的答案写错了,会通过语义缓存"传染"给相似问题。解决方案:
// 给每个缓存条目记录版本和来源
// 当知识库更新时,强制失效相关topic的缓存
@PostMapping("/knowledge/update")
public void updateKnowledge(@RequestBody KnowledgeUpdateRequest req) {
// 1. 更新知识库
knowledgeService.update(req);
// 2. 强制失效相关缓存
semanticCacheService.invalidateByTopic(req.getTopic());
log.info("知识库更新完成,已清除相关缓存,topic={}", req.getTopic());
}2. Redis内存管理
向量缓存占用的内存比KV缓存大得多(每个向量1536维 * 4字节 = 6KB)。要设置内存上限:
# redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru # LRU淘汰策略3. 向量索引维护
随着缓存条目增多,向量搜索性能会下降。建议定期清理低质量和过期的缓存条目:
@Scheduled(cron = "0 0 4 * * ?") // 每天凌晨4点清理
public void cleanupExpiredCache() {
// 删除30天前的向量缓存条目
// 根据你用的向量库API实现
log.info("执行缓存清理任务");
}常见问题解答
Q1:语义缓存会不会把不相关的问题错误命中?
会的,这就是为什么阈值设置很重要。0.88的阈值意味着只有余弦相似度超过88%的问题才会命中,在实测中错误命中率约1.1%。对于高风险场景(法律、医疗),建议提高到0.92以上。
Q2:缓存的答案过期了怎么办?
设置TTL(默认24小时)可以控制缓存有效期。对于"什么是公司法第X条"这类问题,答案几乎永远不变,TTL可以设长(7天甚至30天)。对于"当前汇率是多少"这类时效性强的问题,不应该缓存(或TTL设成1小时以内)。
Q3:为什么用Redis存向量而不是专门的向量数据库?
Redis Stack集成了向量搜索功能(RedisSearch),对于缓存场景(不超过50万条)性能完全够用,而且运维简单,不需要额外部署Milvus/Qdrant。专门向量库适合检索场景(百万条以上的大规模文档库)。
Q4:多实例部署时,L1内存缓存不一致怎么解决?
L1是各实例独立的,天然不一致。但这不是问题:L1缓存的miss会命中L2或L3,L2和L3是共享的(基于Redis),所以全局一致性由L2/L3保证。L1只是本地加速,数据短暂不一致是可接受的。
Q5:如何验证语义缓存没有在"错误命中"?
建议抽样监控L3命中的记录,比较查询问题和缓存问题的文本,人工验证语义确实相似。同时监控用户对缓存答案的差评率,如果缓存答案的差评率明显高于非缓存答案,说明阈值设置太宽松。
Q6:LLM的embedding调用也要花钱,语义缓存划算吗?
每次查询都需要embedding(约100 tokens),大约花费$0.00002。即使缓存MISS,也已经花了embedding费用。但这比完整RAG调用(可能消耗2000+ tokens)便宜100倍。命中率68%时,每100次查询节省约68次LLM调用,embedding成本几乎可以忽略不计。
总结
语义缓存是RAG成本优化最直接有效的手段。
可操作行动清单:
