多模型协作架构:让不同模型各司其职
多模型协作架构:让不同模型各司其职
开篇故事
李明是某电商平台的Java架构师,工作4年,去年开始负责公司的AI客服项目。
项目上线3个月后,他拿着账单找到我:每月AI调用费用高达28万元,其中82%的请求其实只是简单的意图分类和FAQ匹配,根本用不着GPT-4这种重炮。
"老张,我们现在所有请求都打到GPT-4,一个'查询订单状态'的请求和一个'分析我的消费趋势并给出建议'的请求,收的钱一模一样。"
他给我看了统计数据:
- 日均请求量:12万次
- 简单意图识别(查订单/查物流/退款状态):占68%
- 中等复杂度(比较商品/推荐搭配):占24%
- 真正需要深度推理的请求:只有8%
我跟他说:你现在的架构就像所有快递都用顺丰特快,送一张贺卡也用,这不是钱的问题,是架构思维的问题。
解决方案:多模型协作路由架构。
三周后,他的账单从28万降到6.8万,响应速度从平均1.2秒降到0.3秒,用户满意度反而从72%提升到81%。
这篇文章,我把这套架构完整地拆给你看。
TL;DR
- 多模型协作 = 分类路由(小模型)+ 专项处理(中模型)+ 深度推理(大模型)
- 路由决策基于意图复杂度分级,成本可降低 60-80%
- Spring AI 提供统一抽象,切换模型无需改业务代码
- 关键指标:路由准确率 > 95%,P99 延迟 < 500ms
一、为什么需要多模型协作
1.1 单模型架构的痛点
当前大多数AI应用的架构是这样的:所有请求 → 单一大模型 → 返回结果。
这带来三个核心问题:
| 问题 | 影响 |
|---|---|
| 成本过高 | 简单任务使用昂贵模型,浪费70%以上token预算 |
| 延迟不优 | 大模型响应慢,简单意图识别也要1-2秒 |
| 质量欠佳 | 大模型有时"过度思考"简单问题,反而出错 |
1.2 多模型协作的核心思路
不同的任务,匹配不同的模型:
1.3 模型分级参考
Level 1 - 分类/路由层(成本:0.0002$/1K tokens)
适用:意图识别、实体提取、情感分类
模型:text-embedding-ada / fine-tuned small LLM
延迟:< 100ms
Level 2 - 对话/生成层(成本:0.0015$/1K tokens)
适用:FAQ回答、简单对话、模板填充
模型:GPT-3.5-turbo / Claude Haiku / Qwen-7B
延迟:< 300ms
Level 3 - 推理/创作层(成本:0.015$/1K tokens)
适用:复杂分析、代码生成、长文写作
模型:GPT-4 / Claude 3.5 Sonnet / Qwen-72B
延迟:< 2000ms二、架构设计
2.1 整体架构图
2.2 路由决策矩阵
复杂度评分 = f(意图类型, 上下文长度, 历史轮次, 专业术语密度)
0-30分 → Level 1(简单意图/FAQ)
31-60分 → Level 2(对话/简单生成)
61-100分 → Level 3(推理/创作/分析)三、核心代码实现
3.1 项目结构
multi-model-router/
├── src/main/java/com/laozhang/router/
│ ├── config/
│ │ ├── ModelRouterConfig.java
│ │ └── MultiModelProperties.java
│ ├── classifier/
│ │ ├── IntentClassifier.java
│ │ ├── ComplexityEvaluator.java
│ │ └── RoutingDecisionEngine.java
│ ├── model/
│ │ ├── ModelLevel.java
│ │ ├── RoutingRequest.java
│ │ ├── RoutingResult.java
│ │ └── ModelMetrics.java
│ ├── pool/
│ │ ├── ModelPool.java
│ │ └── ModelPoolManager.java
│ ├── service/
│ │ ├── MultiModelService.java
│ │ └── ModelMetricsService.java
│ └── controller/
│ └── MultiModelController.java3.2 核心枚举和模型定义
// ModelLevel.java
package com.laozhang.router.model;
import lombok.Getter;
@Getter
public enum ModelLevel {
LEVEL_1("level-1", "快速分类层", 0, 30, 0.0002),
LEVEL_2("level-2", "标准对话层", 31, 60, 0.0015),
LEVEL_3("level-3", "深度推理层", 61, 100, 0.015);
private final String code;
private final String description;
private final int minScore;
private final int maxScore;
private final double costPer1KTokens;
ModelLevel(String code, String description, int minScore,
int maxScore, double costPer1KTokens) {
this.code = code;
this.description = description;
this.minScore = minScore;
this.maxScore = maxScore;
this.costPer1KTokens = costPer1KTokens;
}
public static ModelLevel fromScore(int score) {
for (ModelLevel level : values()) {
if (score >= level.minScore && score <= level.maxScore) {
return level;
}
}
return LEVEL_3; // 默认使用最高级别确保质量
}
}// RoutingRequest.java
package com.laozhang.router.model;
import lombok.Builder;
import lombok.Data;
import java.util.List;
import java.util.Map;
@Data
@Builder
public class RoutingRequest {
/** 用户输入文本 */
private String userInput;
/** 对话历史(最近N轮)*/
private List<ConversationTurn> conversationHistory;
/** 系统上下文信息 */
private Map<String, Object> contextMetadata;
/** 业务场景标识 */
private String businessScene;
/** 是否强制使用特定级别(管理员覆盖)*/
private ModelLevel forcedLevel;
/** 请求唯一ID */
private String requestId;
/** 用户ID(用于个性化路由)*/
private String userId;
@Data
@Builder
public static class ConversationTurn {
private String role; // "user" or "assistant"
private String content;
private long timestamp;
}
}// RoutingResult.java
package com.laozhang.router.model;
import lombok.Builder;
import lombok.Data;
import java.time.Duration;
@Data
@Builder
public class RoutingResult {
/** 最终生成的内容 */
private String content;
/** 实际使用的模型级别 */
private ModelLevel usedLevel;
/** 实际使用的模型名称 */
private String usedModelName;
/** 路由决策分数 */
private int routingScore;
/** 消耗的token数 */
private int tokensUsed;
/** 本次请求成本(美元)*/
private double costUsd;
/** 端到端延迟 */
private Duration latency;
/** 路由决策依据 */
private String routingReason;
/** 是否来自缓存 */
private boolean fromCache;
}3.3 意图分类器
// IntentClassifier.java
package com.laozhang.router.classifier;
import com.laozhang.router.model.RoutingRequest;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Component;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@Slf4j
@Component
public class IntentClassifier {
// 使用轻量级模型做分类(GPT-3.5-turbo 或本地 fine-tuned 模型)
private final ChatClient classifierModel;
// 简单意图的关键词匹配(零成本路由)
private static final List<String> SIMPLE_INTENT_KEYWORDS = Arrays.asList(
"查订单", "订单状态", "物流信息", "查快递", "退款进度",
"店铺地址", "营业时间", "联系电话", "客服电话"
);
private static final String CLASSIFICATION_PROMPT = """
你是一个意图分类器。分析用户输入,返回JSON格式的分类结果。
分类维度:
1. intent_type: simple/medium/complex
- simple: 查询类、确认类、简单是否判断
- medium: 比较分析、多步骤、需要上下文
- complex: 深度推理、创作、专业分析、情感复杂
2. domain: ecommerce/technical/general/sensitive
3. requires_context: true/false(是否需要历史对话)
只返回JSON,不要解释:
{"intent_type": "xxx", "domain": "xxx", "requires_context": false}
用户输入:%s
""";
public IntentClassifier(@Qualifier("classifierChatClient") ChatClient classifierModel) {
this.classifierModel = classifierModel;
}
/**
* 快速关键词匹配(0成本,<1ms)
*/
public boolean isSimpleIntentByKeyword(String input) {
String lowerInput = input.toLowerCase();
return SIMPLE_INTENT_KEYWORDS.stream()
.anyMatch(keyword -> lowerInput.contains(keyword));
}
/**
* 使用小模型进行精确意图分类
*/
public IntentClassification classify(RoutingRequest request) {
// 第一层:关键词快速判断
if (isSimpleIntentByKeyword(request.getUserInput())) {
log.debug("关键词命中简单意图: {}", request.getUserInput());
return IntentClassification.builder()
.intentType("simple")
.domain("ecommerce")
.requiresContext(false)
.classificationMethod("keyword_match")
.build();
}
// 第二层:小模型精确分类
try {
String prompt = String.format(CLASSIFICATION_PROMPT, request.getUserInput());
String response = classifierModel.call(
new SystemMessage("你是专业的意图分类器,只输出JSON"),
new UserMessage(prompt)
);
return parseClassification(response);
} catch (Exception e) {
log.warn("意图分类失败,使用默认值: {}", e.getMessage());
// 分类失败时保守地升级到 medium,确保质量
return IntentClassification.builder()
.intentType("medium")
.domain("general")
.requiresContext(false)
.classificationMethod("fallback")
.build();
}
}
private IntentClassification parseClassification(String jsonResponse) {
// 解析 JSON 响应
Pattern intentPattern = Pattern.compile("\"intent_type\":\\s*\"(\\w+)\"");
Pattern domainPattern = Pattern.compile("\"domain\":\\s*\"(\\w+)\"");
Pattern contextPattern = Pattern.compile("\"requires_context\":\\s*(true|false)");
Matcher intentMatcher = intentPattern.matcher(jsonResponse);
Matcher domainMatcher = domainPattern.matcher(jsonResponse);
Matcher contextMatcher = contextPattern.matcher(jsonResponse);
String intentType = intentMatcher.find() ? intentMatcher.group(1) : "medium";
String domain = domainMatcher.find() ? domainMatcher.group(1) : "general";
boolean requiresContext = contextMatcher.find() &&
"true".equals(contextMatcher.group(1));
return IntentClassification.builder()
.intentType(intentType)
.domain(domain)
.requiresContext(requiresContext)
.classificationMethod("llm_classify")
.build();
}
}// IntentClassification.java(内部数据类)
package com.laozhang.router.classifier;
import lombok.Builder;
import lombok.Data;
@Data
@Builder
public class IntentClassification {
private String intentType; // simple / medium / complex
private String domain; // ecommerce / technical / general / sensitive
private boolean requiresContext; // 是否需要历史上下文
private String classificationMethod; // keyword_match / llm_classify / fallback
}3.4 复杂度评估器
// ComplexityEvaluator.java
package com.laozhang.router.classifier;
import com.laozhang.router.model.RoutingRequest;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
@Slf4j
@Component
public class ComplexityEvaluator {
// 专业术语列表(技术/金融/医疗等高复杂度指标)
private static final List<String> TECHNICAL_TERMS = Arrays.asList(
"算法", "架构", "性能优化", "分布式", "微服务", "机器学习",
"深度学习", "NLP", "并发", "事务", "索引", "正则表达式"
);
// 高复杂度句式模式
private static final List<Pattern> COMPLEX_PATTERNS = Arrays.asList(
Pattern.compile("为什么.*(原因|导致|影响)"),
Pattern.compile("如何.*(设计|实现|优化|构建)"),
Pattern.compile("分析.*(利弊|优缺点|对比)"),
Pattern.compile("帮我.*(写|生成|创建).*(代码|方案|报告)")
);
/**
* 计算请求复杂度分数(0-100)
* 分数越高,需要越强的模型
*/
public ComplexityScore evaluate(RoutingRequest request,
IntentClassification classification) {
int score = 0;
StringBuilder reasons = new StringBuilder();
// 1. 基础意图复杂度(0-30分)
int intentScore = switch (classification.getIntentType()) {
case "simple" -> 10;
case "medium" -> 25;
case "complex" -> 40;
default -> 25;
};
score += intentScore;
reasons.append(String.format("意图类型(%s)+%d; ",
classification.getIntentType(), intentScore));
// 2. 文本长度复杂度(0-20分)
int textLength = request.getUserInput().length();
int lengthScore = Math.min(20, textLength / 10);
score += lengthScore;
reasons.append(String.format("文本长度(%d字)+%d; ", textLength, lengthScore));
// 3. 对话轮次(0-15分)
int historyRounds = request.getConversationHistory() != null
? request.getConversationHistory().size() : 0;
int historyScore = Math.min(15, historyRounds * 3);
score += historyScore;
reasons.append(String.format("历史轮次(%d轮)+%d; ", historyRounds, historyScore));
// 4. 专业术语密度(0-20分)
long termCount = TECHNICAL_TERMS.stream()
.filter(term -> request.getUserInput().contains(term))
.count();
int termScore = (int) Math.min(20, termCount * 5);
score += termScore;
if (termScore > 0) {
reasons.append(String.format("专业术语(%d个)+%d; ", termCount, termScore));
}
// 5. 复杂句式匹配(0-15分)
long patternCount = COMPLEX_PATTERNS.stream()
.filter(p -> p.matcher(request.getUserInput()).find())
.count();
int patternScore = (int) Math.min(15, patternCount * 8);
score += patternScore;
if (patternScore > 0) {
reasons.append(String.format("复杂句式(%d个)+%d; ", patternCount, patternScore));
}
// 6. 敏感域提升(敏感话题强制高级模型)
if ("sensitive".equals(classification.getDomain())) {
score = Math.max(score, 70);
reasons.append("敏感域强制提升到70+; ");
}
log.debug("复杂度评估: score={}, reasons={}", score, reasons);
return ComplexityScore.builder()
.score(Math.min(100, score))
.reasons(reasons.toString())
.intentScore(intentScore)
.lengthScore(lengthScore)
.historyScore(historyScore)
.termScore(termScore)
.patternScore(patternScore)
.build();
}
}// ComplexityScore.java
package com.laozhang.router.classifier;
import lombok.Builder;
import lombok.Data;
@Data
@Builder
public class ComplexityScore {
private int score; // 总分 0-100
private String reasons; // 评分依据说明
private int intentScore;
private int lengthScore;
private int historyScore;
private int termScore;
private int patternScore;
}3.5 路由决策引擎
// RoutingDecisionEngine.java
package com.laozhang.router.classifier;
import com.laozhang.router.model.ModelLevel;
import com.laozhang.router.model.RoutingRequest;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
@Slf4j
@Component
@RequiredArgsConstructor
public class RoutingDecisionEngine {
private final IntentClassifier intentClassifier;
private final ComplexityEvaluator complexityEvaluator;
/**
* 核心路由决策方法
*/
public RoutingDecision decide(RoutingRequest request) {
// 0. 强制覆盖(管理员/测试场景)
if (request.getForcedLevel() != null) {
log.info("强制路由到: {}", request.getForcedLevel());
return RoutingDecision.builder()
.level(request.getForcedLevel())
.score(0)
.reason("管理员强制指定")
.build();
}
// 1. 意图分类
IntentClassification classification = intentClassifier.classify(request);
// 2. 复杂度评估
ComplexityScore complexity = complexityEvaluator.evaluate(request, classification);
// 3. 路由决策
ModelLevel level = ModelLevel.fromScore(complexity.getScore());
// 4. 安全检查:敏感域或上下文依赖强制升级
if (classification.isRequiresContext() && level == ModelLevel.LEVEL_1) {
level = ModelLevel.LEVEL_2;
log.debug("因需要上下文,从L1升级到L2");
}
String reason = String.format("意图=%s, 复杂度=%d, 路由到%s. 评分依据: %s",
classification.getIntentType(), complexity.getScore(),
level.getCode(), complexity.getReasons());
log.info("[路由决策] requestId={}, level={}, score={}",
request.getRequestId(), level, complexity.getScore());
return RoutingDecision.builder()
.level(level)
.score(complexity.getScore())
.reason(reason)
.classification(classification)
.complexityScore(complexity)
.build();
}
}// RoutingDecision.java
package com.laozhang.router.classifier;
import com.laozhang.router.model.ModelLevel;
import lombok.Builder;
import lombok.Data;
@Data
@Builder
public class RoutingDecision {
private ModelLevel level;
private int score;
private String reason;
private IntentClassification classification;
private ComplexityScore complexityScore;
}3.6 模型池管理
// ModelPoolManager.java
package com.laozhang.router.pool;
import com.laozhang.router.model.ModelLevel;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.ChatClient;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Component;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;
@Slf4j
@Component
public class ModelPoolManager {
// 各级别模型客户端(通过 Spring AI 抽象)
private final ChatClient level1Client; // GPT-3.5-turbo 或 Claude Haiku
private final ChatClient level2Client; // GPT-4o-mini
private final ChatClient level3Client; // GPT-4 或 Claude 3.5 Sonnet
// 模型使用统计
private final Map<ModelLevel, AtomicInteger> usageCounters = new ConcurrentHashMap<>();
private final Map<ModelLevel, AtomicInteger> errorCounters = new ConcurrentHashMap<>();
public ModelPoolManager(
@Qualifier("level1ChatClient") ChatClient level1Client,
@Qualifier("level2ChatClient") ChatClient level2Client,
@Qualifier("level3ChatClient") ChatClient level3Client) {
this.level1Client = level1Client;
this.level2Client = level2Client;
this.level3Client = level3Client;
// 初始化计数器
for (ModelLevel level : ModelLevel.values()) {
usageCounters.put(level, new AtomicInteger(0));
errorCounters.put(level, new AtomicInteger(0));
}
}
/**
* 根据级别获取对应的模型客户端
*/
public ChatClient getClient(ModelLevel level) {
usageCounters.get(level).incrementAndGet();
return switch (level) {
case LEVEL_1 -> level1Client;
case LEVEL_2 -> level2Client;
case LEVEL_3 -> level3Client;
};
}
/**
* 降级处理:当某级别模型不可用时,尝试升级
*/
public ChatClient getFallbackClient(ModelLevel level) {
log.warn("模型降级: {} 不可用,尝试上级模型", level);
return switch (level) {
case LEVEL_1 -> level2Client; // L1不可用 → L2
case LEVEL_2 -> level3Client; // L2不可用 → L3
case LEVEL_3 -> level3Client; // L3不可用 → 还是L3(无更高级别)
};
}
public void recordError(ModelLevel level) {
errorCounters.get(level).incrementAndGet();
}
/**
* 获取各模型使用统计
*/
public Map<String, Object> getStats() {
Map<String, Object> stats = new ConcurrentHashMap<>();
for (ModelLevel level : ModelLevel.values()) {
stats.put(level.getCode() + "_usage", usageCounters.get(level).get());
stats.put(level.getCode() + "_errors", errorCounters.get(level).get());
}
return stats;
}
}3.7 核心服务:MultiModelService
// MultiModelService.java
package com.laozhang.router.service;
import com.laozhang.router.classifier.RoutingDecision;
import com.laozhang.router.classifier.RoutingDecisionEngine;
import com.laozhang.router.model.*;
import com.laozhang.router.pool.ModelPoolManager;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.messages.Message;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.cache.annotation.Cacheable;
import org.springframework.stereotype.Service;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
@Slf4j
@Service
@RequiredArgsConstructor
public class MultiModelService {
private final RoutingDecisionEngine decisionEngine;
private final ModelPoolManager poolManager;
private final ModelMetricsService metricsService;
/**
* 核心路由调用方法
*/
public RoutingResult route(RoutingRequest request) {
// 确保有requestId
if (request.getRequestId() == null) {
request.setRequestId(UUID.randomUUID().toString());
}
Instant start = Instant.now();
// 1. 路由决策
RoutingDecision decision = decisionEngine.decide(request);
ModelLevel targetLevel = decision.getLevel();
log.info("[多模型路由] requestId={}, targetLevel={}, score={}",
request.getRequestId(), targetLevel, decision.getScore());
// 2. 构建提示词
List<Message> messages = buildMessages(request);
// 3. 调用对应级别模型(带降级逻辑)
String content = null;
ModelLevel usedLevel = targetLevel;
String usedModelName = "unknown";
try {
ChatClient client = poolManager.getClient(targetLevel);
content = client.call(new Prompt(messages)).getResult().getOutput().getContent();
usedModelName = getModelName(targetLevel);
} catch (Exception e) {
log.error("模型调用失败: level={}, error={}", targetLevel, e.getMessage());
poolManager.recordError(targetLevel);
// 降级尝试
try {
ChatClient fallbackClient = poolManager.getFallbackClient(targetLevel);
content = fallbackClient.call(new Prompt(messages))
.getResult().getOutput().getContent();
usedLevel = getNextLevel(targetLevel);
usedModelName = getModelName(usedLevel) + "(fallback)";
log.warn("降级成功: {} -> {}", targetLevel, usedLevel);
} catch (Exception e2) {
log.error("降级也失败了: {}", e2.getMessage());
throw new RuntimeException("所有模型调用失败", e2);
}
}
Duration latency = Duration.between(start, Instant.now());
// 4. 估算成本(简化计算)
int estimatedTokens = (content.length() + request.getUserInput().length()) / 4;
double cost = estimatedTokens * usedLevel.getCostPer1KTokens() / 1000;
RoutingResult result = RoutingResult.builder()
.content(content)
.usedLevel(usedLevel)
.usedModelName(usedModelName)
.routingScore(decision.getScore())
.tokensUsed(estimatedTokens)
.costUsd(cost)
.latency(latency)
.routingReason(decision.getReason())
.fromCache(false)
.build();
// 5. 记录指标
metricsService.record(request, result, decision);
return result;
}
/**
* 带缓存的路由(适用于高频重复查询)
*/
@Cacheable(value = "ai-responses", key = "#request.userInput",
condition = "#request.conversationHistory == null || #request.conversationHistory.isEmpty()")
public RoutingResult routeWithCache(RoutingRequest request) {
RoutingResult result = route(request);
result.setFromCache(false); // 首次不是缓存
return result;
}
private List<Message> buildMessages(RoutingRequest request) {
List<Message> messages = new ArrayList<>();
// 系统消息
String systemPrompt = getSystemPromptForScene(request.getBusinessScene());
messages.add(new SystemMessage(systemPrompt));
// 历史对话(如果有)
if (request.getConversationHistory() != null) {
for (RoutingRequest.ConversationTurn turn : request.getConversationHistory()) {
if ("user".equals(turn.getRole())) {
messages.add(new UserMessage(turn.getContent()));
}
// 简化:只添加用户消息历史
}
}
// 当前用户输入
messages.add(new UserMessage(request.getUserInput()));
return messages;
}
private String getSystemPromptForScene(String businessScene) {
return switch (businessScene != null ? businessScene : "general") {
case "ecommerce" -> "你是专业的电商客服助手,熟悉订单、物流、退款流程。请简洁、友好地回答问题。";
case "technical" -> "你是专业的技术支持工程师,提供准确的技术解答。使用代码示例时请注明语言。";
default -> "你是一个专业、友好的AI助手。请提供准确、有用的回答。";
};
}
private String getModelName(ModelLevel level) {
return switch (level) {
case LEVEL_1 -> "gpt-3.5-turbo";
case LEVEL_2 -> "gpt-4o-mini";
case LEVEL_3 -> "gpt-4";
};
}
private ModelLevel getNextLevel(ModelLevel current) {
return switch (current) {
case LEVEL_1 -> ModelLevel.LEVEL_2;
case LEVEL_2 -> ModelLevel.LEVEL_3;
case LEVEL_3 -> ModelLevel.LEVEL_3;
};
}
}3.8 配置类
// ModelRouterConfig.java
package com.laozhang.router.config;
import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.openai.OpenAiChatClient;
import org.springframework.ai.openai.OpenAiChatOptions;
import org.springframework.ai.openai.api.OpenAiApi;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
@Configuration
public class ModelRouterConfig {
@Value("${spring.ai.openai.api-key}")
private String openAiApiKey;
/**
* Level-1 模型:GPT-3.5-turbo(快速分类)
*/
@Bean("level1ChatClient")
public ChatClient level1ChatClient() {
OpenAiApi api = new OpenAiApi(openAiApiKey);
OpenAiChatOptions options = OpenAiChatOptions.builder()
.withModel("gpt-3.5-turbo")
.withTemperature(0.3f) // 分类任务低温度,更确定
.withMaxTokens(500) // 分类响应不需要太长
.build();
return new OpenAiChatClient(api, options);
}
/**
* Level-2 模型:GPT-4o-mini(标准对话)
*/
@Bean("level2ChatClient")
public ChatClient level2ChatClient() {
OpenAiApi api = new OpenAiApi(openAiApiKey);
OpenAiChatOptions options = OpenAiChatOptions.builder()
.withModel("gpt-4o-mini")
.withTemperature(0.7f)
.withMaxTokens(1500)
.build();
return new OpenAiChatClient(api, options);
}
/**
* Level-3 模型:GPT-4(深度推理)
*/
@Bean("level3ChatClient")
public ChatClient level3ChatClient() {
OpenAiApi api = new OpenAiApi(openAiApiKey);
OpenAiChatOptions options = OpenAiChatOptions.builder()
.withModel("gpt-4")
.withTemperature(0.8f)
.withMaxTokens(4000)
.build();
return new OpenAiChatClient(api, options);
}
/**
* 分类器专用客户端(使用最便宜的模型)
*/
@Bean("classifierChatClient")
public ChatClient classifierChatClient() {
OpenAiApi api = new OpenAiApi(openAiApiKey);
OpenAiChatOptions options = OpenAiChatOptions.builder()
.withModel("gpt-3.5-turbo")
.withTemperature(0.1f) // 极低温度确保分类一致性
.withMaxTokens(100) // 分类只需要少量token
.build();
return new OpenAiChatClient(api, options);
}
}3.9 指标服务
// ModelMetricsService.java
package com.laozhang.router.service;
import com.laozhang.router.classifier.RoutingDecision;
import com.laozhang.router.model.ModelLevel;
import com.laozhang.router.model.RoutingRequest;
import com.laozhang.router.model.RoutingResult;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicLong;
@Slf4j
@Service
public class ModelMetricsService {
private final MeterRegistry registry;
// 成本统计(单位:微美元,避免浮点问题)
private final AtomicLong totalCostMicroUsd = new AtomicLong(0);
private final ConcurrentHashMap<ModelLevel, AtomicLong> costByLevel =
new ConcurrentHashMap<>();
public ModelMetricsService(MeterRegistry registry) {
this.registry = registry;
for (ModelLevel level : ModelLevel.values()) {
costByLevel.put(level, new AtomicLong(0));
}
}
public void record(RoutingRequest request, RoutingResult result,
RoutingDecision decision) {
// 1. 请求计数
Counter.builder("ai.routing.requests")
.tag("level", result.getUsedLevel().getCode())
.tag("business_scene", request.getBusinessScene() != null ?
request.getBusinessScene() : "unknown")
.register(registry)
.increment();
// 2. 延迟统计
Timer.builder("ai.routing.latency")
.tag("level", result.getUsedLevel().getCode())
.register(registry)
.record(result.getLatency());
// 3. Token使用量
registry.counter("ai.routing.tokens",
"level", result.getUsedLevel().getCode())
.increment(result.getTokensUsed());
// 4. 成本追踪
long costMicro = (long) (result.getCostUsd() * 1_000_000);
totalCostMicroUsd.addAndGet(costMicro);
costByLevel.get(result.getUsedLevel()).addAndGet(costMicro);
// 5. 路由分数分布
registry.summary("ai.routing.score")
.record(decision.getScore());
log.debug("[指标] level={}, latency={}ms, tokens={}, cost=${:.6f}",
result.getUsedLevel().getCode(),
result.getLatency().toMillis(),
result.getTokensUsed(),
result.getCostUsd());
}
public double getTotalCostUsd() {
return totalCostMicroUsd.get() / 1_000_000.0;
}
public double getCostByLevel(ModelLevel level) {
return costByLevel.get(level).get() / 1_000_000.0;
}
}3.10 Controller接口
// MultiModelController.java
package com.laozhang.router.controller;
import com.laozhang.router.model.RoutingRequest;
import com.laozhang.router.model.RoutingResult;
import com.laozhang.router.pool.ModelPoolManager;
import com.laozhang.router.service.ModelMetricsService;
import com.laozhang.router.service.MultiModelService;
import jakarta.validation.Valid;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import java.util.HashMap;
import java.util.Map;
@Slf4j
@RestController
@RequestMapping("/api/v1/ai")
@RequiredArgsConstructor
public class MultiModelController {
private final MultiModelService multiModelService;
private final ModelPoolManager poolManager;
private final ModelMetricsService metricsService;
/**
* 标准路由调用
*/
@PostMapping("/route")
public ResponseEntity<RoutingResult> route(@Valid @RequestBody RoutingRequest request) {
RoutingResult result = multiModelService.route(request);
return ResponseEntity.ok(result);
}
/**
* 带缓存的路由调用(适合无状态查询)
*/
@PostMapping("/route/cached")
public ResponseEntity<RoutingResult> routeWithCache(
@Valid @RequestBody RoutingRequest request) {
RoutingResult result = multiModelService.routeWithCache(request);
return ResponseEntity.ok(result);
}
/**
* 获取路由统计和成本报告
*/
@GetMapping("/stats")
public ResponseEntity<Map<String, Object>> getStats() {
Map<String, Object> stats = new HashMap<>();
stats.put("model_usage", poolManager.getStats());
stats.put("total_cost_usd", metricsService.getTotalCostUsd());
stats.put("timestamp", System.currentTimeMillis());
return ResponseEntity.ok(stats);
}
}3.11 单元测试
// MultiModelServiceTest.java
package com.laozhang.router.service;
import com.laozhang.router.classifier.RoutingDecisionEngine;
import com.laozhang.router.model.ModelLevel;
import com.laozhang.router.model.RoutingRequest;
import com.laozhang.router.model.RoutingResult;
import com.laozhang.router.pool.ModelPoolManager;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.junit.jupiter.MockitoExtension;
import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.ChatResponse;
import static org.assertj.core.api.Assertions.assertThat;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.Mockito.*;
@ExtendWith(MockitoExtension.class)
class MultiModelServiceTest {
@Test
void simpleQuery_shouldRouteToLevel1() {
// given: 简单的订单查询
RoutingRequest request = RoutingRequest.builder()
.userInput("查询订单123456的状态")
.businessScene("ecommerce")
.requestId("test-001")
.build();
// 预期结果:路由到 Level-1
// (实际测试中需要 mock 依赖)
assertThat(request.getUserInput()).contains("订单");
}
@Test
void complexQuery_shouldRouteToLevel3() {
// given: 复杂的分析请求
RoutingRequest request = RoutingRequest.builder()
.userInput("帮我分析过去6个月的消费数据,找出消费最高的品类," +
"并给出节省开支的优化建议,包括替代商品推荐")
.businessScene("ecommerce")
.requestId("test-002")
.build();
assertThat(request.getUserInput().length()).isGreaterThan(30);
}
@Test
void forcedLevel_shouldSkipRouting() {
// given: 强制指定级别
RoutingRequest request = RoutingRequest.builder()
.userInput("任意请求")
.forcedLevel(ModelLevel.LEVEL_3)
.requestId("test-003")
.build();
assertThat(request.getForcedLevel()).isEqualTo(ModelLevel.LEVEL_3);
}
}四、配置文件
# application.yml
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
cache:
type: redis
redis:
time-to-live: 300000 # 5分钟缓存
# 路由策略配置
model-router:
level1:
model: gpt-3.5-turbo
max-tokens: 500
temperature: 0.3
timeout-ms: 3000
level2:
model: gpt-4o-mini
max-tokens: 1500
temperature: 0.7
timeout-ms: 8000
level3:
model: gpt-4
max-tokens: 4000
temperature: 0.8
timeout-ms: 30000
routing:
# 各级别分数阈值
level1-max-score: 30
level2-max-score: 60
# 启用缓存
cache-enabled: true
# 是否记录路由决策日志
log-routing-decisions: true
# Actuator 暴露指标
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
metrics:
tags:
application: multi-model-router五、实测数据与成本分析
5.1 李明项目的真实数据对比
| 指标 | 改造前(单一GPT-4) | 改造后(多模型路由) | 提升 |
|---|---|---|---|
| 月均成本 | ¥28万 | ¥6.8万 | 节省75.7% |
| 平均响应时间 | 1.2秒 | 0.3秒 | 快4倍 |
| P99延迟 | 4.8秒 | 1.1秒 | 降低77% |
| 用户满意度 | 72% | 81% | 提升9% |
| 路由准确率 | - | 96.3% | - |
5.2 路由分布(日均12万请求)
Level-1 (GPT-3.5): 68% = 81,600次/天 ← 订单查询/物流/FAQ
Level-2 (GPT-4o-mini): 24% = 28,800次/天 ← 商品比较/推荐
Level-3 (GPT-4): 8% = 9,600次/天 ← 复杂分析/投诉处理5.3 成本估算公式
// 成本估算辅助工具
public class CostEstimator {
private static final double GPT35_COST = 0.0015 / 1000; // $/token
private static final double GPT4MINI_COST = 0.00015 / 1000;
private static final double GPT4_COST = 0.03 / 1000;
public static double estimateMonthlyCost(
long dailyRequests,
double l1Ratio, double l2Ratio, double l3Ratio,
int avgTokensPerRequest) {
double l1Daily = dailyRequests * l1Ratio * avgTokensPerRequest * GPT35_COST;
double l2Daily = dailyRequests * l2Ratio * avgTokensPerRequest * GPT4MINI_COST;
double l3Daily = dailyRequests * l3Ratio * avgTokensPerRequest * GPT4_COST;
double dailyTotal = l1Daily + l2Daily + l3Daily;
return dailyTotal * 30;
}
public static void main(String[] args) {
// 改造前:全部GPT-4,平均600 tokens/请求
double before = 120000 * 1.0 * 600 * GPT4_COST * 30;
System.out.printf("改造前月成本: $%.2f%n", before);
// 改造后:68%/24%/8%分布
double after = estimateMonthlyCost(120000, 0.68, 0.24, 0.08, 600);
System.out.printf("改造后月成本: $%.2f%n", after);
System.out.printf("节省比例: %.1f%%%n", (before - after) / before * 100);
}
}六、生产注意事项
6.1 路由准确率监控
路由错误有两种:
- 过度路由(Under-routing):复杂问题给了弱模型 → 质量差
- 浪费路由(Over-routing):简单问题给了强模型 → 成本高
建议设置路由质量监控:
// 路由质量采样检查(1%的请求做双重验证)
@Scheduled(fixedDelay = 60000)
public void sampleRoutingQuality() {
// 随机抽取最近1%的Level-1和Level-2路由结果
// 用GPT-4重新评估是否路由正确
// 如果错误率 > 5%,触发告警
}6.2 冷启动问题
分类模型本身也需要时间响应(约100-200ms),可能导致总延迟增加:
优化策略:
- 关键词匹配先行(< 1ms)
- 分类结果缓存(相同输入前缀)
- 异步分类 + 乐观路由(先用L2,分类完再决定是否升降级)
6.3 A/B测试路由策略
// 支持A/B测试不同路由策略
@Component
public class ABTestingRouter {
@Value("${routing.ab-test.level1-threshold:30}")
private int level1Threshold;
// 实验组:更激进的L1分配(阈值提高到40)
// 对照组:保守的L1分配(阈值30)
public ModelLevel routeWithABTest(int score, String userId) {
boolean isExperimentGroup = userId.hashCode() % 100 < 20; // 20%实验组
int threshold = isExperimentGroup ? 40 : level1Threshold;
if (score <= threshold) return ModelLevel.LEVEL_1;
if (score <= 60) return ModelLevel.LEVEL_2;
return ModelLevel.LEVEL_3;
}
}6.4 模型不可用时的熔断
// 使用 Resilience4j 保护模型调用
@CircuitBreaker(name = "level1-model", fallbackMethod = "level1Fallback")
@TimeLimiter(name = "level1-model")
public CompletableFuture<String> callLevel1(List<Message> messages) {
return CompletableFuture.supplyAsync(() ->
level1Client.call(new Prompt(messages))
.getResult().getOutput().getContent()
);
}
public CompletableFuture<String> level1Fallback(List<Message> messages, Exception e) {
log.warn("Level-1熔断,降级到Level-2");
return callLevel2(messages);
}七、FAQ
Q1:路由本身需要调用模型,不是增加了成本吗?
A:路由分类使用最便宜的GPT-3.5-turbo,且只需要约50 token。而被路由到正确级别节省的成本远超分类成本。实测:分类成本约占总成本的2%,节省约75%。
Q2:路由准确率不够高怎么办?
A:建立反馈闭环:用户对回答质量的评价(差评 = 可能路由错误)→ 人工标注 → 微调分类模型。可在3-6个月内把准确率从90%提升到97%+。
Q3:如果某个模型API宕机,整个服务怎么办?
A:配置降级链:L1宕机→用L2;L2宕机→用L3;L3宕机→返回预设答案并发告警。同时配置多个模型提供商(OpenAI + Azure OpenAI双活)。
Q4:如何处理流式输出的路由?
A:路由决策是同步的(先决策再调用),流式输出在决策后正常进行。但要注意:流式响应无法在中途切换模型,路由决策必须在第一个token输出前完成。
Q5:不同模型的输出格式可能不一样,如何统一?
A:在各级别模型的系统提示中明确输出格式要求,并在结果后处理层做格式归一化(正则提取/JSON解析),确保业务层收到统一格式。
八、总结
多模型协作架构的核心价值不是"用不同模型",而是"让对的任务遇见对的模型"。
三个关键点:
- 分层决策:关键词 → 小模型分类 → 复杂度评分,三层递进
- 成本意识:路由逻辑本身极轻量,换来的是主干成本的75%节省
- 质量保障:降级而非拒绝,路由失败不影响用户体验
李明的项目用3周时间从28万/月降到6.8万/月,是工程决策的胜利,不是魔法。
架构思维比提示词技巧更值钱。
