AI应用的成本工程：把月账单从10万降到2万的系统方法

老张2026/8/2大约 18 分钟成本优化Token优化模型路由缓存Spring AIJava

AI应用的成本工程：把月账单从10万降到2万的系统方法

那封让CEO看红眼的账单邮件

2025年10月的最后一天，张磊盯着手机上的一封邮件，头皮发麻。

邮件是AWS财务发来的：10月AI服务账单，¥103,247.36元。

他是一家B2B SaaS初创公司的CTO，公司刚完成A轮融资800万，按照这个烧法，AI成本3个月就能吃掉20%的融资。CEO王总在飞书上直接发了一个截图过来："张磊，这个月你们AI多少钱？这不是做生意的节奏。"

张磊坐下来仔细分析账单：

OpenAI API调用费用:
  GPT-4 (gpt-4-turbo):      ¥64,300   （占62%）
  GPT-3.5-turbo:             ¥8,200    （占8%）
  text-embedding-ada-002:    ¥12,400   （占12%）

存储和推理服务:
  向量数据库 (Pinecone):     ¥11,000   （占11%）
  其他服务:                  ¥7,347    （占7%）

总计:                       ¥103,247

问题很清楚：GPT-4的使用量太大，而且大量请求根本不需要GPT-4级别的能力。

经过6周的系统性成本优化，11月账单：¥19,843.52元，降幅达80.8%。

这篇文章，把张磊团队总结的6个降本策略完整告诉你。

AI成本全景：Token费用的构成分析

在优化之前，必须先搞清楚钱花在哪里。

成本构成图谱

主流模型价格对比（2025年12月）

模型	Input (每1M Token)	Output (每1M Token)	适用场景
GPT-4o	$5.00	$15.00	复杂推理、创意写作
GPT-4o-mini	$0.15	$0.60	简单问答、分类
Claude 3.5 Sonnet	$3.00	$15.00	代码、分析
Claude 3 Haiku	$0.25	$1.25	简单任务
Qwen2.5-72B（API）	$0.56	$2.25	中文任务
Qwen2.5-7B（本地）	$0	$0	超高频简单任务
DeepSeek-V3	$0.27	$1.10	高性价比通用

关键发现：GPT-4o的价格是GPT-4o-mini的33倍（Input）和25倍（Output）！很多任务根本不需要GPT-4的能力。

成本分析工具：精细化成本统计

在实施任何优化之前，先建立成本可观测性——按功能、用户、租户维度分析成本。

package com.laozhang.cost.tracking;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import org.aspectj.lang.ProceedingJoinPoint;
import org.aspectj.lang.annotation.Around;
import org.aspectj.lang.annotation.Aspect;
import org.springframework.stereotype.Component;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicLong;

/**
 * AI成本追踪切面
 * 自动拦截所有AI调用，记录Token使用和费用
 */
@Aspect
@Component
public class AiCostTrackingAspect {

    private static final double GPT4O_INPUT_PRICE  = 5.0 / 1_000_000;   // 每Token价格（美元）
    private static final double GPT4O_OUTPUT_PRICE = 15.0 / 1_000_000;
    private static final double MINI_INPUT_PRICE   = 0.15 / 1_000_000;
    private static final double MINI_OUTPUT_PRICE  = 0.60 / 1_000_000;
    private static final double EMBED_PRICE        = 0.02 / 1_000_000;

    private final MeterRegistry meterRegistry;
    private final CostRepository costRepository;

    // 实时成本缓存（按功能模块）
    private final ConcurrentHashMap<String, AtomicLong> costByFeature = new ConcurrentHashMap<>();

    public AiCostTrackingAspect(MeterRegistry meterRegistry, CostRepository costRepository) {
        this.meterRegistry = meterRegistry;
        this.costRepository = costRepository;
    }

    /**
     * 拦截所有AI服务调用，记录成本
     */
    @Around("@annotation(trackCost)")
    public Object trackAiCost(ProceedingJoinPoint joinPoint, TrackAiCost trackCost)
            throws Throwable {

        long startMs = System.currentTimeMillis();
        Object result = joinPoint.proceed();
        long latencyMs = System.currentTimeMillis() - startMs;

        // 从结果中提取Token使用信息
        if (result instanceof AiResponse aiResponse) {
            TokenUsage usage = aiResponse.getTokenUsage();
            String model = aiResponse.getModel();
            String feature = trackCost.feature();
            String userId = extractUserId(joinPoint);

            // 计算费用
            double cost = calculateCost(model, usage.inputTokens(), usage.outputTokens());

            // 记录到各个维度
            recordCostMetrics(feature, userId, model, cost, usage, latencyMs);

            // 异步持久化
            costRepository.saveAsync(new CostRecord(
                feature, userId, model, cost,
                usage.inputTokens(), usage.outputTokens(),
                latencyMs, System.currentTimeMillis()
            ));
        }

        return result;
    }

    private double calculateCost(String model, int inputTokens, int outputTokens) {
        return switch (model) {
            case "gpt-4o"      -> inputTokens * GPT4O_INPUT_PRICE + outputTokens * GPT4O_OUTPUT_PRICE;
            case "gpt-4o-mini" -> inputTokens * MINI_INPUT_PRICE  + outputTokens * MINI_OUTPUT_PRICE;
            case "text-embedding-ada-002" -> inputTokens * EMBED_PRICE;
            default -> 0.0;
        };
    }

    private void recordCostMetrics(String feature, String userId, String model,
                                    double cost, TokenUsage usage, long latencyMs) {
        // Prometheus指标
        meterRegistry.counter("ai.cost.dollars",
            "feature", feature, "model", model).increment(cost);
        meterRegistry.counter("ai.tokens.input",
            "feature", feature, "model", model).increment(usage.inputTokens());
        meterRegistry.counter("ai.tokens.output",
            "feature", feature, "model", model).increment(usage.outputTokens());
        meterRegistry.timer("ai.latency",
            "feature", feature, "model", model).record(
                java.time.Duration.ofMillis(latencyMs));

        // 功能级别成本累加
        costByFeature.computeIfAbsent(feature, k -> new AtomicLong(0))
            .addAndGet((long)(cost * 1_000_000));  // 存微美元，避免浮点精度问题
    }

    private String extractUserId(ProceedingJoinPoint joinPoint) {
        // 从ThreadLocal、Spring Security上下文等获取
        return UserContext.getCurrentUserId();
    }
}

成本分析报告生成

package com.laozhang.cost.analysis;

import org.springframework.stereotype.Service;
import java.time.*;
import java.util.*;
import java.util.stream.Collectors;

/**
 * 成本分析报告服务
 * 生成按维度细分的成本报告，帮助识别优化机会
 */
@Service
public class CostAnalysisService {

    private final CostRepository costRepository;

    public CostAnalysisService(CostRepository costRepository) {
        this.costRepository = costRepository;
    }

    /**
     * 生成月度成本分析报告
     */
    public CostReport generateMonthlyReport(YearMonth month) {
        LocalDateTime start = month.atDay(1).atStartOfDay();
        LocalDateTime end = month.atEndOfMonth().atTime(23, 59, 59);

        List<CostRecord> records = costRepository.findByDateRange(start, end);

        // 按功能维度汇总
        Map<String, DoubleSummaryStatistics> byFeature = records.stream()
            .collect(Collectors.groupingBy(
                CostRecord::feature,
                Collectors.summarizingDouble(CostRecord::cost)
            ));

        // 按模型维度汇总
        Map<String, DoubleSummaryStatistics> byModel = records.stream()
            .collect(Collectors.groupingBy(
                CostRecord::model,
                Collectors.summarizingDouble(CostRecord::cost)
            ));

        // 按天维度汇总（识别成本异常）
        Map<LocalDate, Double> byDay = records.stream()
            .collect(Collectors.groupingBy(
                r -> r.timestamp().toLocalDate(),
                Collectors.summingDouble(CostRecord::cost)
            ));

        // 找出成本最高的Top 10 功能
        List<FeatureCost> topFeatures = byFeature.entrySet().stream()
            .map(e -> new FeatureCost(e.getKey(), e.getValue().getSum()))
            .sorted(Comparator.comparingDouble(FeatureCost::totalCost).reversed())
            .limit(10)
            .toList();

        // 计算每次请求的平均成本
        double avgCostPerRequest = records.stream()
            .mapToDouble(CostRecord::cost)
            .average()
            .orElse(0);

        // 识别优化机会
        List<OptimizationOpportunity> opportunities =
            identifyOpportunities(records, byFeature, byModel);

        double totalCost = records.stream().mapToDouble(CostRecord::cost).sum();

        return new CostReport(month, totalCost, records.size(), avgCostPerRequest,
            byFeature, byModel, byDay, topFeatures, opportunities);
    }

    /**
     * 自动识别成本优化机会
     */
    private List<OptimizationOpportunity> identifyOpportunities(
            List<CostRecord> records,
            Map<String, DoubleSummaryStatistics> byFeature,
            Map<String, DoubleSummaryStatistics> byModel) {

        List<OptimizationOpportunity> opportunities = new ArrayList<>();

        // 机会1：GPT-4被用于简单任务
        long gpt4SimpleTaskCount = records.stream()
            .filter(r -> r.model().contains("gpt-4"))
            .filter(r -> r.outputTokens() < 100)    // 输出很短=简单任务
            .filter(r -> r.inputTokens() < 500)     // 输入也短
            .count();

        if (gpt4SimpleTaskCount > 1000) {
            double wastedCost = gpt4SimpleTaskCount * 0.005;  // 估算浪费
            opportunities.add(new OptimizationOpportunity(
                "模型降级",
                String.format("发现%d次用GPT-4处理简单任务，切换到mini可节省约$%.0f",
                    gpt4SimpleTaskCount, wastedCost),
                wastedCost,
                "高"
            ));
        }

        // 机会2：高重复率请求（适合语义缓存）
        long totalEmbeddingCalls = records.stream()
            .filter(r -> r.model().contains("embedding"))
            .count();
        if (totalEmbeddingCalls > 5000) {
            opportunities.add(new OptimizationOpportunity(
                "语义缓存",
                "Embedding调用量大，添加缓存预计减少60%调用",
                totalEmbeddingCalls * 0.00002 * 0.6,
                "高"
            ));
        }

        // 机会3：Prompt过长
        double avgInputTokens = records.stream()
            .mapToInt(CostRecord::inputTokens)
            .average()
            .orElse(0);
        if (avgInputTokens > 2000) {
            opportunities.add(new OptimizationOpportunity(
                "Prompt压缩",
                String.format("平均Input Token %.0f，高于合理水平2000，压缩30%%可节省约$%.0f/月",
                    avgInputTokens,
                    records.stream().mapToDouble(CostRecord::cost).sum() * 0.25),
                records.stream().mapToDouble(CostRecord::cost).sum() * 0.25,
                "中"
            ));
        }

        return opportunities;
    }

    public record FeatureCost(String feature, double totalCost) {}

    public record OptimizationOpportunity(
            String strategy, String description,
            double estimatedSaving, String priority) {}
}

降本策略1：Prompt压缩（节省30-50% Input Token）

原理

Input Token的价格通常是Output Token的1/3，但很多应用的System Prompt冗余巨大。

package com.laozhang.cost.optimization;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Component;

import java.util.*;
import java.util.regex.Pattern;

/**
 * Prompt压缩器
 * 在不影响效果的前提下，最大化减少Input Token
 */
@Component
public class PromptCompressor {

    private static final Pattern MULTI_SPACE = Pattern.compile("\\s{2,}");
    private static final Pattern MULTI_NEWLINE = Pattern.compile("\n{3,}");

    /**
     * 策略1：去除冗余空白
     * 平均节省：3-8% Token
     */
    public String removeRedundantWhitespace(String prompt) {
        return MULTI_NEWLINE.matcher(
            MULTI_SPACE.matcher(prompt).replaceAll(" ")
        ).replaceAll("\n\n");
    }

    /**
     * 策略2：压缩系统提示词中的示例
     * 平均节省：15-25% Token
     */
    public String compressExamples(String systemPrompt, int maxExamples) {
        // 找到示例部分并截取
        String[] lines = systemPrompt.split("\n");
        List<String> kept = new ArrayList<>();
        int exampleCount = 0;
        boolean inExample = false;

        for (String line : lines) {
            if (line.contains("示例") || line.contains("Example")) {
                inExample = true;
                exampleCount++;
            }
            if (!inExample || exampleCount <= maxExamples) {
                kept.add(line);
            }
        }
        return String.join("\n", kept);
    }

    /**
     * 策略3：LLMLingua风格的Token级压缩
     * 使用小模型对Prompt做关键词提取式压缩
     * 平均节省：30-50% Token（需要牺牲少量效果）
     */
    public String compressContext(String longContext, String question,
                                   double compressionRatio) {
        // 实现关键句子提取
        // 1. 将上下文分句
        String[] sentences = longContext.split("[。！？\\.!?]");

        // 2. 计算每句与问题的相关度（TF-IDF简化版）
        Set<String> questionWords = tokenize(question);
        List<SentenceScore> scored = new ArrayList<>();

        for (int i = 0; i < sentences.length; i++) {
            String sent = sentences[i].trim();
            if (sent.isEmpty()) continue;

            Set<String> sentWords = tokenize(sent);
            Set<String> intersection = new HashSet<>(sentWords);
            intersection.retainAll(questionWords);

            double score = questionWords.isEmpty() ? 0 :
                (double) intersection.size() / questionWords.size();

            // 位置权重：开头和结尾的句子更重要
            double posWeight = (i < 3 || i >= sentences.length - 3) ? 1.5 : 1.0;

            scored.add(new SentenceScore(sent, score * posWeight, i));
        }

        // 3. 按分数排序，保留 compressionRatio 比例的句子
        int keepCount = (int) Math.ceil(sentences.length * compressionRatio);
        List<SentenceScore> topSentences = scored.stream()
            .sorted(Comparator.comparingDouble(SentenceScore::score).reversed())
            .limit(keepCount)
            .sorted(Comparator.comparingInt(SentenceScore::originalIndex))  // 保持原始顺序
            .toList();

        return topSentences.stream()
            .map(SentenceScore::sentence)
            .reduce("", (a, b) -> a + "。" + b);
    }

    /**
     * 策略4：RAG上下文窗口优化
     * 只传入最相关的N个片段，而不是全部检索结果
     */
    public String optimizeRagContext(List<String> retrievedChunks,
                                      String question,
                                      int maxTokens) {
        StringBuilder context = new StringBuilder();
        int estimatedTokens = 0;

        // 按相关度排序（假设retrievedChunks已按相关度排序）
        for (String chunk : retrievedChunks) {
            int chunkTokens = estimateTokenCount(chunk);
            if (estimatedTokens + chunkTokens > maxTokens) break;

            context.append(chunk).append("\n---\n");
            estimatedTokens += chunkTokens;
        }

        return context.toString();
    }

    // 简单的Token数量估算（中文约1.5字/Token，英文约4字/Token）
    public int estimateTokenCount(String text) {
        int chineseChars = 0;
        int otherChars = 0;
        for (char c : text.toCharArray()) {
            if (c >= 0x4E00 && c <= 0x9FFF) {
                chineseChars++;
            } else {
                otherChars++;
            }
        }
        return (int)(chineseChars / 1.5 + otherChars / 4.0);
    }

    private Set<String> tokenize(String text) {
        Set<String> words = new HashSet<>();
        for (String w : text.split("[\\s，。！？,\\.!?]+")) {
            if (w.length() >= 2) words.add(w);
        }
        return words;
    }

    private record SentenceScore(String sentence, double score, int originalIndex) {}
}

降本策略2：模型路由（简单问题用便宜模型）

这是张磊团队最大的降本点，也是最复杂的策略。

package com.laozhang.cost.routing;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.regex.Pattern;

/**
 * 智能模型路由器
 * 根据任务复杂度自动选择性价比最优的模型
 *
 * 路由逻辑：
 * GPT-4o    → 复杂推理、代码生成、创意写作
 * GPT-4o-mini → 简单问答、分类、摘要、翻译
 * 本地Qwen  → 模板填充、格式转换、简单提取
 */
@Service
public class ModelRouter {

    private final ChatClient gpt4oClient;
    private final ChatClient miniClient;
    private final ChatClient localModelClient;
    private final TaskComplexityClassifier classifier;
    private final CostTracker costTracker;

    public ModelRouter(
            ChatClient gpt4oClient,
            ChatClient miniClient,
            ChatClient localModelClient,
            TaskComplexityClassifier classifier,
            CostTracker costTracker) {
        this.gpt4oClient = gpt4oClient;
        this.miniClient = miniClient;
        this.localModelClient = localModelClient;
        this.classifier = classifier;
        this.costTracker = costTracker;
    }

    /**
     * 路由并执行
     */
    public RoutedResponse route(String systemPrompt, String userMessage,
                                 String featureTag) {
        // 1. 分类任务复杂度
        TaskComplexity complexity = classifier.classify(userMessage, systemPrompt);

        // 2. 选择模型
        String selectedModel = selectModel(complexity);
        ChatClient selectedClient = getClient(selectedModel);

        // 3. 执行推理
        long start = System.currentTimeMillis();
        String response = selectedClient.prompt()
            .system(systemPrompt)
            .user(userMessage)
            .call()
            .content();
        long latencyMs = System.currentTimeMillis() - start;

        // 4. 记录路由决策和成本
        costTracker.recordRouting(featureTag, selectedModel, complexity,
            estimateTokens(userMessage), estimateTokens(response), latencyMs);

        return new RoutedResponse(response, selectedModel, complexity, latencyMs);
    }

    private String selectModel(TaskComplexity complexity) {
        return switch (complexity) {
            case SIMPLE    -> "local-qwen";     // 免费本地模型
            case MODERATE  -> "gpt-4o-mini";   // 便宜小模型
            case COMPLEX   -> "gpt-4o";         // 强力大模型
        };
    }

    private ChatClient getClient(String model) {
        return switch (model) {
            case "local-qwen" -> localModelClient;
            case "gpt-4o-mini" -> miniClient;
            default -> gpt4oClient;
        };
    }

    private int estimateTokens(String text) {
        return text.length() / 3;  // 粗略估算
    }

    public record RoutedResponse(String content, String model,
                                  TaskComplexity complexity, long latencyMs) {}
}

package com.laozhang.cost.routing;

import org.springframework.stereotype.Component;

import java.util.*;
import java.util.regex.Pattern;

/**
 * 任务复杂度分类器
 * 快速判断一个任务需要多强的模型
 */
@Component
public class TaskComplexityClassifier {

    // 复杂任务的关键词
    private static final List<String> COMPLEX_KEYWORDS = List.of(
        "分析", "设计", "架构", "优化", "比较", "评估", "生成代码",
        "写一篇", "论文", "方案", "复杂", "详细说明"
    );

    // 简单任务的关键词
    private static final List<String> SIMPLE_KEYWORDS = List.of(
        "翻译", "总结", "提取", "格式化", "转换",
        "是否", "多少", "什么时候", "列举"
    );

    // 简单任务的结构模式
    private static final Pattern YES_NO_PATTERN = Pattern.compile(
        "是[不否]|有[没无]没有|能[不]能|可[不]可以|对不对", Pattern.CASE_INSENSITIVE);

    /**
     * 三维评分：关键词 + 长度 + 结构
     */
    public TaskComplexity classify(String userMessage, String systemPrompt) {
        int complexScore = 0;
        int simpleScore = 0;

        // 维度1：关键词匹配
        for (String keyword : COMPLEX_KEYWORDS) {
            if (userMessage.contains(keyword)) complexScore += 2;
        }
        for (String keyword : SIMPLE_KEYWORDS) {
            if (userMessage.contains(keyword)) simpleScore += 2;
        }

        // 维度2：消息长度（长消息通常更复杂）
        int msgLen = userMessage.length();
        if (msgLen > 500) complexScore += 3;
        else if (msgLen > 200) complexScore += 1;
        else if (msgLen < 50) simpleScore += 2;

        // 维度3：结构特征
        if (YES_NO_PATTERN.matcher(userMessage).find()) simpleScore += 3;
        if (userMessage.contains("```") || userMessage.contains("代码")) complexScore += 2;
        if (userMessage.split("[，。？！\n]").length > 5) complexScore += 1;

        // 维度4：System Prompt类型提示
        if (systemPrompt.contains("代码") || systemPrompt.contains("分析")) {
            complexScore += 1;
        }
        if (systemPrompt.contains("摘要") || systemPrompt.contains("翻译")) {
            simpleScore += 1;
        }

        // 综合判断
        int diff = complexScore - simpleScore;
        if (diff >= 3) return TaskComplexity.COMPLEX;
        if (diff <= -2) return TaskComplexity.SIMPLE;
        return TaskComplexity.MODERATE;
    }
}

模型路由实测效果（优化2周后）：

路由分布（日均50,000次调用）：
  → 本地Qwen（SIMPLE）:  18,500次  37%   成本：$0
  → GPT-4o-mini（MODERATE）: 24,000次  48%   成本：$12.5/天
  → GPT-4o（COMPLEX）:   7,500次   15%   成本：$18.8/天

优化前：100% GPT-4o → $62.5/天
优化后：分层路由   → $31.3/天
节省：$31.2/天（-50%）

降本策略3：语义缓存（减少60%重复API调用）

传统的精确缓存（完全相同的问题才命中）命中率很低。语义缓存通过向量相似度来判断"意思相同"的问题。

package com.laozhang.cost.cache;

import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.data.redis.core.RedisTemplate;
import org.springframework.stereotype.Service;

import java.time.Duration;
import java.util.*;

/**
 * 语义缓存服务
 * 基于向量相似度实现"相似问题"的缓存命中
 * 可将重复API调用减少60%
 */
@Service
public class SemanticCacheService {

    private static final double SIMILARITY_THRESHOLD = 0.92;  // 相似度阈值
    private static final int MAX_CACHE_SIZE = 10_000;          // 最大缓存条目
    private static final Duration TTL = Duration.ofHours(24);

    private final EmbeddingModel embeddingModel;
    private final RedisTemplate<String, CacheEntry> redisTemplate;
    private final VectorIndexService vectorIndex;   // 向量索引（用于快速相似搜索）

    // 缓存统计
    private long hits = 0;
    private long misses = 0;

    public SemanticCacheService(EmbeddingModel embeddingModel,
                                 RedisTemplate<String, CacheEntry> redisTemplate,
                                 VectorIndexService vectorIndex) {
        this.embeddingModel = embeddingModel;
        this.redisTemplate = redisTemplate;
        this.vectorIndex = vectorIndex;
    }

    /**
     * 查询缓存
     * @return 缓存命中时返回缓存结果，否则返回 empty
     */
    public Optional<String> get(String question) {
        try {
            // 1. 计算问题的向量表示
            float[] queryEmbedding = embeddingModel.embed(question);

            // 2. 在向量索引中搜索最相似的缓存问题
            List<SimilarEntry> similar = vectorIndex.search(queryEmbedding, 1);

            if (!similar.isEmpty()) {
                SimilarEntry top = similar.get(0);
                if (top.similarity() >= SIMILARITY_THRESHOLD) {
                    // 缓存命中！
                    CacheEntry entry = redisTemplate.opsForValue()
                        .get("semantic_cache:" + top.cacheKey());

                    if (entry != null) {
                        hits++;
                        // 更新访问时间
                        redisTemplate.expire("semantic_cache:" + top.cacheKey(), TTL);
                        return Optional.of(entry.response());
                    }
                }
            }

            misses++;
            return Optional.empty();

        } catch (Exception e) {
            misses++;
            return Optional.empty();
        }
    }

    /**
     * 写入缓存
     */
    public void put(String question, String response) {
        try {
            float[] embedding = embeddingModel.embed(question);
            String cacheKey = generateCacheKey(question);

            // 存储到Redis
            CacheEntry entry = new CacheEntry(question, response,
                System.currentTimeMillis());
            redisTemplate.opsForValue().set(
                "semantic_cache:" + cacheKey, entry, TTL);

            // 在向量索引中注册
            vectorIndex.add(cacheKey, embedding);

        } catch (Exception e) {
            // 缓存写入失败不影响主流程
        }
    }

    /**
     * 带缓存的AI调用（包装器模式）
     */
    public String callWithCache(String question, java.util.function.Supplier<String> aiCall) {
        // 先查缓存
        Optional<String> cached = get(question);
        if (cached.isPresent()) {
            return cached.get();
        }

        // 缓存未命中，调用AI
        String response = aiCall.get();

        // 异步写入缓存
        CompletableFuture.runAsync(() -> put(question, response));

        return response;
    }

    public double getHitRate() {
        long total = hits + misses;
        return total > 0 ? (double) hits / total : 0;
    }

    private String generateCacheKey(String question) {
        return Integer.toHexString(question.hashCode()) +
               Long.toHexString(System.currentTimeMillis());
    }

    public record CacheEntry(String question, String response, long createdAt) {}
    public record SimilarEntry(String cacheKey, double similarity) {}
}

Spring AI集成示例

@Service
public class CachedAiService {

    private final ChatClient chatClient;
    private final SemanticCacheService cache;
    private final PromptCompressor compressor;

    @TrackAiCost(feature = "knowledge_qa")
    public String answerQuestion(String userId, String question) {
        // 1. 先查语义缓存
        Optional<String> cached = cache.get(question);
        if (cached.isPresent()) {
            return cached.get();  // 缓存命中，0成本！
        }

        // 2. Prompt压缩
        String compressedQuestion = compressor.removeRedundantWhitespace(question);

        // 3. 调用AI
        String response = chatClient.prompt()
            .system("你是专业的知识助手。请简洁、准确地回答问题。")
            .user(compressedQuestion)
            .call()
            .content();

        // 4. 写入缓存
        cache.put(question, response);

        return response;
    }
}

语义缓存实测数据（运行2周后）：

总查询量:    287,543次
缓存命中:    168,120次（58.5%命中率）
API实际调用: 119,423次
节省API调用: 168,120次

成本节省估算：
  每次API调用平均成本：$0.003
  节省：168,120 × $0.003 = $504.4/月（¥3,630）

降本策略4：批量请求（利用Batch API 50%折扣）

OpenAI的Batch API对非实时任务提供50%折扣，但需要异步处理。

package com.laozhang.cost.batch;

import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.concurrent.*;

/**
 * AI批量请求服务
 * 将非实时任务积累后批量提交，享受50%折扣
 * 适合：内容审核、批量翻译、数据标注、离线报告生成
 */
@Service
public class BatchRequestService {

    private static final int BATCH_SIZE = 1000;     // 每批最大请求数
    private static final int FLUSH_INTERVAL_MS = 300_000;  // 5分钟强制提交

    private final BlockingQueue<BatchRequest> pendingQueue =
        new LinkedBlockingQueue<>(10_000);
    private final Map<String, CompletableFuture<String>> pendingFutures =
        new ConcurrentHashMap<>();

    private final BatchApiClient batchApiClient;

    public BatchRequestService(BatchApiClient batchApiClient) {
        this.batchApiClient = batchApiClient;
    }

    /**
     * 提交批量请求（异步，最长24小时内返回）
     * 适合：内容审核、离线分析、批量翻译
     */
    public CompletableFuture<String> submitAsync(String requestId, String prompt) {
        CompletableFuture<String> future = new CompletableFuture<>();
        pendingFutures.put(requestId, future);

        boolean offered = pendingQueue.offer(new BatchRequest(requestId, prompt));
        if (!offered) {
            // 队列满了，直接走实时API
            pendingFutures.remove(requestId);
            future.completeExceptionally(
                new RuntimeException("批量队列已满，请使用实时API"));
        }

        return future;
    }

    /**
     * 定时批量提交
     */
    @Scheduled(fixedDelay = FLUSH_INTERVAL_MS)
    public void flushBatch() {
        List<BatchRequest> batch = new ArrayList<>();
        pendingQueue.drainTo(batch, BATCH_SIZE);

        if (batch.isEmpty()) return;

        System.out.printf("提交批量请求：%d条%n", batch.size());

        try {
            // 提交到OpenAI Batch API
            String batchJobId = batchApiClient.submit(batch);

            // 异步轮询结果
            pollBatchResults(batchJobId, batch);

        } catch (Exception e) {
            // 批量提交失败，降级到实时API
            batch.forEach(req -> {
                CompletableFuture<String> future = pendingFutures.remove(req.requestId());
                if (future != null) {
                    future.completeExceptionally(e);
                }
            });
        }
    }

    private void pollBatchResults(String batchJobId, List<BatchRequest> batch) {
        // 轮询批量任务状态（通常几分钟到几小时）
        CompletableFuture.runAsync(() -> {
            while (true) {
                try {
                    Thread.sleep(60_000);  // 每分钟检查一次

                    BatchJobStatus status = batchApiClient.getStatus(batchJobId);

                    if (status.isCompleted()) {
                        List<BatchResult> results = batchApiClient.getResults(batchJobId);
                        results.forEach(result -> {
                            CompletableFuture<String> future =
                                pendingFutures.remove(result.requestId());
                            if (future != null) {
                                future.complete(result.response());
                            }
                        });
                        break;
                    } else if (status.isFailed()) {
                        batch.forEach(req -> {
                            CompletableFuture<String> future = pendingFutures.remove(req.requestId());
                            if (future != null) {
                                future.completeExceptionally(new RuntimeException("批量任务失败"));
                            }
                        });
                        break;
                    }
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    break;
                }
            }
        });
    }

    public record BatchRequest(String requestId, String prompt) {}
}

降本策略5：本地模型补充（零成本处理部分请求）

用Ollama在服务器上部署本地模型，处理不需要GPT-4能力的请求。

package com.laozhang.cost.local;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.ollama.OllamaChatModel;
import org.springframework.ai.ollama.api.OllamaApi;
import org.springframework.ai.ollama.api.OllamaOptions;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

/**
 * 本地Ollama模型配置
 * 部署在内网服务器，零API成本
 */
@Configuration
public class LocalModelConfig {

    /**
     * Qwen2.5-7B 本地模型
     * 适合：简单问答、分类、格式转换、中文任务
     */
    @Bean("localChatClient")
    public ChatClient localChatClient() {
        OllamaApi ollamaApi = new OllamaApi("http://local-ai-server:11434");

        OllamaOptions options = OllamaOptions.create()
            .withModel("qwen2.5:7b")
            .withTemperature(0.7)
            .withNumCtx(4096);          // 上下文窗口

        return ChatClient.create(new OllamaChatModel(ollamaApi, options));
    }

    /**
     * 本地Embedding模型
     * 替代text-embedding-ada-002，向量化成本降为零
     */
    @Bean("localEmbeddingModel")
    public org.springframework.ai.embedding.EmbeddingModel localEmbeddingModel() {
        OllamaApi ollamaApi = new OllamaApi("http://local-ai-server:11434");
        return new org.springframework.ai.ollama.OllamaEmbeddingModel(
            ollamaApi,
            OllamaOptions.create().withModel("nomic-embed-text")
        );
    }
}

Ollama部署脚本：

# 服务器端部署（16GB内存的普通服务器即可）
curl -fsSL https://ollama.ai/install.sh | sh

# 拉取模型
ollama pull qwen2.5:7b          # 约4.7GB，适合简单中文任务
ollama pull nomic-embed-text    # 约274MB，用于向量化

# 启动服务（监听所有网络接口）
OLLAMA_HOST=0.0.0.0 ollama serve

# 验证
curl http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5:7b","prompt":"你好","stream":false}'

本地模型可处理的任务（实测效果）：

任务类型	本地Qwen效果	GPT-4o-mini效果	推荐选择
中文意图分类	91.3%	93.8%	本地（差距小，成本差100倍）
简单信息提取	88.7%	92.1%	本地
格式转换（JSON）	96.2%	97.5%	本地
多轮对话	82.4%	91.0%	mini（差距明显）
复杂推理	74.1%	89.3%	GPT-4o（差距很大）
代码生成	78.6%	91.7%	GPT-4o

降本策略6：Embedding优化（批量嵌入+缓存）

package com.laozhang.cost.embedding;

import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.concurrent.*;

/**
 * Embedding优化服务
 * 批量嵌入 + 持久化缓存，显著降低向量化成本
 */
@Service
public class OptimizedEmbeddingService {

    private static final int BATCH_SIZE = 100;   // OpenAI最大批次

    private final EmbeddingModel cloudEmbeddingModel;    // OpenAI Embedding
    private final EmbeddingModel localEmbeddingModel;    // 本地Nomic Embedding
    private final EmbeddingCacheRepository cacheRepo;

    public OptimizedEmbeddingService(
            EmbeddingModel cloudEmbeddingModel,
            EmbeddingModel localEmbeddingModel,
            EmbeddingCacheRepository cacheRepo) {
        this.cloudEmbeddingModel = cloudEmbeddingModel;
        this.localEmbeddingModel = localEmbeddingModel;
        this.cacheRepo = cacheRepo;
    }

    /**
     * 单个文本嵌入（带缓存）
     */
    public float[] embed(String text, boolean useLocal) {
        // 检查缓存
        Optional<float[]> cached = cacheRepo.get(text);
        if (cached.isPresent()) {
            return cached.get();
        }

        // 调用嵌入模型
        EmbeddingModel model = useLocal ? localEmbeddingModel : cloudEmbeddingModel;
        float[] embedding = model.embed(text);

        // 持久化缓存
        cacheRepo.put(text, embedding);

        return embedding;
    }

    /**
     * 批量嵌入（单次API调用处理多个文本，节省延迟和费用）
     */
    public Map<String, float[]> embedBatch(List<String> texts, boolean useLocal) {
        Map<String, float[]> results = new LinkedHashMap<>();

        // 先从缓存获取
        List<String> cacheMisses = new ArrayList<>();
        for (String text : texts) {
            Optional<float[]> cached = cacheRepo.get(text);
            if (cached.isPresent()) {
                results.put(text, cached.get());
            } else {
                cacheMisses.add(text);
            }
        }

        if (cacheMisses.isEmpty()) return results;

        // 批量调用API（分批，每批BATCH_SIZE条）
        EmbeddingModel model = useLocal ? localEmbeddingModel : cloudEmbeddingModel;

        for (int i = 0; i < cacheMisses.size(); i += BATCH_SIZE) {
            List<String> batch = cacheMisses.subList(
                i, Math.min(i + BATCH_SIZE, cacheMisses.size()));

            List<float[]> embeddings = model.embed(batch);

            for (int j = 0; j < batch.size(); j++) {
                String text = batch.get(j);
                float[] embedding = embeddings.get(j);
                results.put(text, embedding);
                cacheRepo.put(text, embedding);  // 持久化
            }
        }

        return results;
    }

    /**
     * 文档入库时的嵌入优化（知识库建设场景）
     */
    public void indexDocuments(List<String> documents) {
        System.out.println("开始批量索引 " + documents.size() + " 个文档...");

        // 使用本地模型（零成本），效果接近云端
        Map<String, float[]> embeddings = embedBatch(documents, true);

        // 存入向量数据库
        // vectorDb.upsertBatch(embeddings);

        System.out.println("索引完成，节省API调用：" + documents.size() + " 次");
    }
}

ROI分析：张磊团队的完整成本优化效果

各策略贡献分析：

优化策略	实施成本	月均节省	ROI倍数	实施难度
Prompt压缩	1天工程	¥8,000	高	低
模型路由	1周工程	¥28,000	高	中
语义缓存	3天工程	¥18,000	高	中
批量API	2天工程	¥5,000	高	低
本地模型	1天配置	¥12,000	极高	低
Embedding优化	2天工程	¥9,000	高	低

FAQ

Q1：模型路由会不会导致用户体验下降？

会有轻微影响，但可以用A/B测试量化。张磊团队测试发现：使用mini模型处理简单问题，用户满意度仅下降0.3分（4.7→4.4，5分制），属于可接受范围。

Q2：语义缓存的相似度阈值如何调整？

0.92是一个平衡点。提高到0.95减少误命中但命中率降低，降低到0.88提高命中率但可能返回不相关的缓存。建议：用一周的真实问题对跑离线评估，找到适合你业务的阈值。

Q3：本地模型会不会有数据安全风险？

反而更安全！数据不离开内网，特别适合金融、医疗、法律等对数据合规有要求的场景。

Q4：这些策略哪个应该最先做？

优先级建议：

Prompt压缩（1天，立竿见影）
本地模型部署（半天，长期受益）
语义缓存（3天，通常效果最大）
模型路由（1周，需要精细调优）
批量API（适合有大量离线任务时）

Q5：优化后效果会随时间衰减吗？

会的。随着业务扩张，成本会增长，需要持续监控。建议每月生成成本分析报告，设置成本告警（比如日成本超过阈值自动通知）。

总结

张磊团队的6个策略，没有一个是技术上的"黑科技"，都是工程上的"基本功"：

Prompt压缩：不写废话，模型也不喜欢啰嗦的指令
模型路由：不同的活用不同的工具，99%的问题不需要核武器
语义缓存：用户的问题总是在重复，缓存是最便宜的"AI"
批量请求：时间换金钱，非实时任务不要急着实时
本地模型：一次投入，永久受益，让内网服务器真正发挥价值
Embedding优化：向量化是基础设施，应该做好缓存和批量

把这6个策略全部落地，月账单降到原来的20%是完全可以实现的。