第2194篇:GPT-4V与Claude Vision的API工程实践——图文理解系统的完整实现
第2194篇:GPT-4V与Claude Vision的API工程实践——图文理解系统的完整实现
适读人群:需要在生产环境接入多模态API的Java工程师 | 阅读时长:约16分钟 | 核心价值:GPT-4V和Claude Vision两套API的完整工程实现,含差异对比和切换方案
去年底项目上了一个图文理解功能,甲方的需求很明确:上传图片,AI帮我描述里面有什么,还要能回答图片相关的问题。
技术方案选型时争了半天。老板说用GPT-4V,因为他们之前见过演示,效果不错。但我们有个客户是银行,数据不能出境,必须用国内或者能私有化的方案。最终结论是:同时支持两套API,通过配置切换。
这个决定让我们后来多花了一倍的时间来做适配,因为GPT-4V和Claude Vision的API设计理念差异比我想象的大得多。
这篇文章把两套API的完整工程实现都写清楚,以及怎么做统一抽象。
一、两套API的设计差异
在写代码之前,先把差异搞清楚,这样后面的设计决策才有依据。
消息结构的差异
GPT-4V的消息是"内容数组"模式,图片和文字是平等的content元素:
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,...", "detail": "high"}},
{"type": "text", "text": "这张图里有什么?"}
]
}Claude Vision的消息结构更语义化,图片有明确的source字段:
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "..."
}
},
{"type": "text", "text": "这张图里有什么?"}
]
}图片URL支持的差异
GPT-4V:支持公网URL直接传入,模型会自行下载 Claude Vision(API方式):官方文档说支持URL,但实际工程中遇到过各种302重定向问题,建议始终用Base64
System prompt的处理差异
GPT-4V:system消息和user消息独立分开,视觉内容只能在user/assistant消息里 Claude Vision:支持system prompt,但system prompt里不能包含图片,这一点和GPT-4V相同
多图片支持
GPT-4V:单次请求最多支持多张图片,在同一个user消息里放多个image_url元素即可 Claude Vision:同样支持,但每次请求总图片大小有限制(claude-3 约5MB Base64数据)
二、GPT-4V的完整实现
使用OpenAI官方Java SDK,这比用HTTP客户端手撸更稳定:
<dependency>
<groupId>com.openai</groupId>
<artifactId>openai-java</artifactId>
<version>0.9.0</version>
</dependency>@Service
@ConditionalOnProperty(name = "vision.provider", havingValue = "openai")
public class OpenAIVisionService implements VisionService {
private static final Logger log = LoggerFactory.getLogger(OpenAIVisionService.class);
private final OpenAIClient openAIClient;
@Value("${spring.ai.openai.api-key}")
private String apiKey;
@Value("${vision.openai.model:gpt-4o}")
private String model;
@Value("${vision.openai.detail:auto}")
private String detail; // low, high, auto
public OpenAIVisionService() {
this.openAIClient = OpenAIOkHttpClient.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.build();
}
@Override
public VisionResponse analyzeImage(VisionRequest request) {
long startTime = System.currentTimeMillis();
try {
// 构建内容列表
List<ChatCompletionContentPart> contentParts = new ArrayList<>();
// 添加图片
for (ImageInput image : request.getImages()) {
String imageUrl = buildImageUrl(image);
contentParts.add(ChatCompletionContentPart.ofImageUrl(
ChatCompletionContentPartImage.builder()
.imageUrl(ImageURL.builder()
.url(imageUrl)
.detail(ImageURL.Detail.of(detail))
.build())
.build()
));
}
// 添加文字提示
contentParts.add(ChatCompletionContentPart.ofText(
ChatCompletionContentPartText.builder()
.text(request.getPrompt())
.build()
));
// 构建请求
ChatCompletionCreateParams params = ChatCompletionCreateParams.builder()
.model(ChatModel.of(model))
.addMessage(ChatCompletionUserMessageParam.builder()
.content(ChatCompletionUserMessageParam.Content.ofArrayOfContentParts(
contentParts))
.build())
.maxTokens(request.getMaxTokens() != null ? request.getMaxTokens() : 2000)
.temperature(request.getTemperature() != null ? request.getTemperature() : 0.3)
.build();
// 如果有system prompt
if (request.getSystemPrompt() != null) {
params = params.toBuilder()
.addMessage(ChatCompletionSystemMessageParam.builder()
.content(request.getSystemPrompt())
.build())
.build();
}
ChatCompletion completion = openAIClient.chat().completions().create(params);
String content = completion.choices().get(0).message().content()
.orElse("");
long elapsed = System.currentTimeMillis() - startTime;
log.info("GPT-4V响应完成, 耗时: {}ms, tokens: input={}, output={}",
elapsed,
completion.usage().map(u -> u.promptTokens()).orElse(0L),
completion.usage().map(u -> u.completionTokens()).orElse(0L));
return VisionResponse.builder()
.content(content)
.provider("openai")
.model(model)
.promptTokens(completion.usage().map(u -> (int)u.promptTokens()).orElse(0))
.completionTokens(completion.usage().map(u -> (int)u.completionTokens()).orElse(0))
.latencyMs(elapsed)
.build();
} catch (Exception e) {
log.error("GPT-4V调用失败", e);
throw new VisionServiceException("OpenAI Vision API调用失败: " + e.getMessage(), e);
}
}
private String buildImageUrl(ImageInput image) {
if (image.getUrl() != null) {
return image.getUrl(); // 直接使用URL
}
// Base64模式
return String.format("data:%s;base64,%s",
image.getMimeType(), image.getBase64Data());
}
@Override
public Flux<String> analyzeImageStream(VisionRequest request) {
// 流式版本实现
List<ChatCompletionContentPart> contentParts = buildContentParts(request);
ChatCompletionCreateParams params = ChatCompletionCreateParams.builder()
.model(ChatModel.of(model))
.addMessage(ChatCompletionUserMessageParam.builder()
.content(ChatCompletionUserMessageParam.Content
.ofArrayOfContentParts(contentParts))
.build())
.stream(true)
.build();
return Flux.create(sink -> {
openAIClient.chat().completions().createStreaming(params)
.subscribe(chunk -> {
chunk.choices().stream()
.findFirst()
.flatMap(choice -> choice.delta().content())
.ifPresent(sink::next);
}, sink::error, sink::complete);
});
}
private List<ChatCompletionContentPart> buildContentParts(VisionRequest request) {
List<ChatCompletionContentPart> parts = new ArrayList<>();
for (ImageInput image : request.getImages()) {
parts.add(ChatCompletionContentPart.ofImageUrl(
ChatCompletionContentPartImage.builder()
.imageUrl(ImageURL.builder()
.url(buildImageUrl(image))
.detail(ImageURL.Detail.of(detail))
.build())
.build()
));
}
parts.add(ChatCompletionContentPart.ofText(
ChatCompletionContentPartText.builder()
.text(request.getPrompt())
.build()
));
return parts;
}
}三、Claude Vision的完整实现
<dependency>
<groupId>com.anthropic</groupId>
<artifactId>sdk</artifactId>
<version>0.8.0</version>
</dependency>@Service
@ConditionalOnProperty(name = "vision.provider", havingValue = "claude")
public class ClaudeVisionService implements VisionService {
private static final Logger log = LoggerFactory.getLogger(ClaudeVisionService.class);
private final AnthropicClient anthropicClient;
@Value("${vision.claude.model:claude-3-5-sonnet-20241022}")
private String model;
public ClaudeVisionService() {
this.anthropicClient = AnthropicOkHttpClient.builder()
.apiKey(System.getenv("ANTHROPIC_API_KEY"))
.build();
}
@Override
public VisionResponse analyzeImage(VisionRequest request) {
long startTime = System.currentTimeMillis();
try {
List<ContentBlockParam> contentBlocks = new ArrayList<>();
// 添加图片(Claude要求图片必须在文字之前,这是个细节)
for (ImageInput image : request.getImages()) {
ImageBlockParam imageBlock = buildImageBlock(image);
contentBlocks.add(ContentBlockParam.ofImage(imageBlock));
}
// 添加文字
contentBlocks.add(ContentBlockParam.ofText(
TextBlockParam.builder()
.text(request.getPrompt())
.build()
));
// 构建消息
MessageCreateParams.Builder paramsBuilder = MessageCreateParams.builder()
.model(Model.of(model))
.maxTokens(request.getMaxTokens() != null ? request.getMaxTokens() : 2000)
.addMessage(MessageParam.builder()
.role(MessageParam.Role.USER)
.content(MessageParam.Content.ofBlockParams(contentBlocks))
.build());
// Claude的system是顶层字段,不是消息
if (request.getSystemPrompt() != null) {
paramsBuilder.system(request.getSystemPrompt());
}
Message response = anthropicClient.messages().create(paramsBuilder.build());
// 提取文字内容
String content = response.content().stream()
.filter(block -> block instanceof TextBlock)
.map(block -> ((TextBlock) block).text())
.collect(Collectors.joining("\n"));
long elapsed = System.currentTimeMillis() - startTime;
log.info("Claude Vision响应完成, 耗时: {}ms, tokens: input={}, output={}",
elapsed, response.usage().inputTokens(), response.usage().outputTokens());
return VisionResponse.builder()
.content(content)
.provider("claude")
.model(model)
.promptTokens((int) response.usage().inputTokens())
.completionTokens((int) response.usage().outputTokens())
.latencyMs(elapsed)
.build();
} catch (Exception e) {
log.error("Claude Vision调用失败", e);
throw new VisionServiceException("Claude Vision API调用失败: " + e.getMessage(), e);
}
}
private ImageBlockParam buildImageBlock(ImageInput image) {
if (image.getBase64Data() != null) {
// Base64模式
return ImageBlockParam.builder()
.source(Base64ImageSource.builder()
.mediaType(Base64ImageSource.MediaType.of(image.getMimeType()))
.data(image.getBase64Data())
.build())
.build();
} else if (image.getUrl() != null) {
// URL模式(需要是公网可访问的URL)
return ImageBlockParam.builder()
.source(URLImageSource.builder()
.url(image.getUrl())
.build())
.build();
}
throw new IllegalArgumentException("ImageInput必须包含base64Data或url");
}
@Override
public Flux<String> analyzeImageStream(VisionRequest request) {
List<ContentBlockParam> contentBlocks = buildContentBlocks(request);
MessageStreamParams params = MessageStreamParams.builder()
.model(Model.of(model))
.maxTokens(2000)
.addMessage(MessageParam.builder()
.role(MessageParam.Role.USER)
.content(MessageParam.Content.ofBlockParams(contentBlocks))
.build())
.build();
return Flux.create(sink -> {
anthropicClient.messages().stream(params)
.on(ContentBlockDeltaEvent.class, event -> {
if (event.delta() instanceof TextDelta textDelta) {
sink.next(textDelta.text());
}
})
.onMessageStop(event -> sink.complete())
.onError(sink::error)
.start();
});
}
private List<ContentBlockParam> buildContentBlocks(VisionRequest request) {
List<ContentBlockParam> blocks = new ArrayList<>();
for (ImageInput image : request.getImages()) {
blocks.add(ContentBlockParam.ofImage(buildImageBlock(image)));
}
blocks.add(ContentBlockParam.ofText(
TextBlockParam.builder().text(request.getPrompt()).build()));
return blocks;
}
}四、统一抽象层与路由切换
有了两个实现,需要统一接口让上层调用者无感知切换:
// 统一接口
public interface VisionService {
VisionResponse analyzeImage(VisionRequest request);
Flux<String> analyzeImageStream(VisionRequest request);
}
// 统一请求对象
@Builder
@Data
public class VisionRequest {
private List<ImageInput> images;
private String prompt;
private String systemPrompt;
private Integer maxTokens;
private Double temperature;
private Map<String, Object> metadata; // 透传的业务元数据
}
// 统一响应对象
@Builder
@Data
public class VisionResponse {
private String content;
private String provider;
private String model;
private int promptTokens;
private int completionTokens;
private long latencyMs;
private Map<String, Object> rawMetadata;
}
// 路由器:支持按配置切换,也支持运行时A/B测试
@Service
@Primary
public class RoutingVisionService implements VisionService {
private final Map<String, VisionService> services;
@Value("${vision.default-provider:openai}")
private String defaultProvider;
public RoutingVisionService(List<VisionService> serviceList) {
this.services = serviceList.stream()
.collect(Collectors.toMap(
s -> s.getClass().getAnnotation(ConditionalOnProperty.class) != null
? extractProviderName(s)
: s.getClass().getSimpleName(),
s -> s
));
}
@Override
public VisionResponse analyzeImage(VisionRequest request) {
String provider = determineProvider(request);
VisionService service = services.get(provider);
if (service == null) {
throw new IllegalStateException("未找到Vision服务提供商: " + provider);
}
return service.analyzeImage(request);
}
private String determineProvider(VisionRequest request) {
// 优先取请求元数据里的provider
if (request.getMetadata() != null && request.getMetadata().containsKey("provider")) {
return (String) request.getMetadata().get("provider");
}
return defaultProvider;
}
}五、两套API的关键差异总结
经过在生产环境跑了几个月,总结几个必须关注的差异点:
差异1:图片和文字的顺序
Claude官方推荐图片放在文字之前(虽然两种顺序都能工作,但官方建议如此)。GPT-4V没有这个约定。
差异2:Token计算方式不同
GPT-4V的图片Token按patch数计算(上篇详细讲了);Claude的图片Token计算方式不同,官方文档给出了近似公式,大约是(width * height) / 750,但实际值取决于模型内部处理。
差异3:错误码体系
GPT-4V的图片相关错误主要是invalid_request_error,错误信息相对模糊;Claude的错误更细化,比如invalid_request_error: Image exceeds maximum file size。
差异4:多轮对话中的图片
两者都支持多轮对话中带图片,但Claude在长对话中对历史消息里的图片有更严格的总大小限制。实际工程中,建议多轮对话时只在第一轮带图片,后续轮次只用文字引用图片内容。
