第1945篇：多模态模型的API设计演进——从文本到图文音视频的接口规范

老张2026/4/30大约 9 分钟

第1945篇：多模态模型的API设计演进——从文本到图文音视频的接口规范

我第一次接图片理解的API是2023年底，那时候GPT-4V刚出，接口设计很粗糙，图片就是base64往content里塞。现在两年过去了，多模态API的设计复杂了很多，但也成熟了很多。今天系统梳理一下这条演进路线，以及Java工程师实际集成时要注意的地方。

接口设计的演进历程

最早的多模态API非常简单粗暴。OpenAI的第一版Vision API大概是这样：

{
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQ..."
          }
        },
        {
          "type": "text",
          "text": "这张图片里有什么？"
        }
      ]
    }
  ]
}

问题很明显：没有标准化的图片处理选项，base64直接嵌入请求体，大图片会让请求体变得很大，没有缓存机制，每次都要重传。

现在的设计进步了不少。比如支持URL引用（模型服务器去拉取图片），有detail参数控制图片理解的精细程度，Claude有专门的图片媒体类型声明。

更大的变化是在音频和视频这边。这部分API的设计差异很大，各家厂商还没有形成统一规范，是目前最混乱的区域。

图像理解API的标准实现

先从最常见的图片理解开始，给出生产可用的实现：

@Service
public class VisionService {

    @Autowired
    private ChatClient chatClient;

    /**
     * 分析图片内容（支持URL和本地文件）
     */
    public String analyzeImage(String imageSource, String instruction) {
        UserMessage userMessage;

        if (imageSource.startsWith("http://") || imageSource.startsWith("https://")) {
            // URL方式：模型服务器直接从URL获取图片
            userMessage = new UserMessage(instruction,
                List.of(new ImageContent(imageSource)));
        } else {
            // 本地文件：读取并转为base64
            userMessage = buildBase64ImageMessage(imageSource, instruction);
        }

        return chatClient.prompt()
            .messages(userMessage)
            .call()
            .content();
    }

    private UserMessage buildBase64ImageMessage(String filePath, String instruction) {
        try {
            byte[] imageBytes = Files.readAllBytes(Path.of(filePath));
            String base64 = Base64.getEncoder().encodeToString(imageBytes);
            String mimeType = detectMimeType(filePath);

            // Spring AI的Media对象封装图片数据
            Resource imageResource = new ByteArrayResource(imageBytes);
            Media imageMedia = new Media(MimeTypeUtils.parseMimeType(mimeType), imageResource);

            return new UserMessage(instruction, List.of(imageMedia));
        } catch (IOException e) {
            throw new RuntimeException("图片读取失败: " + filePath, e);
        }
    }

    private String detectMimeType(String filePath) {
        String lower = filePath.toLowerCase();
        if (lower.endsWith(".jpg") || lower.endsWith(".jpeg")) return "image/jpeg";
        if (lower.endsWith(".png")) return "image/png";
        if (lower.endsWith(".gif")) return "image/gif";
        if (lower.endsWith(".webp")) return "image/webp";
        return "image/jpeg";
    }
}

图片detail参数的实用建议：

OpenAI的Vision API支持low和high两种detail级别，默认是auto。这个参数在Spring AI里可以通过options传：

public String analyzeWithDetail(String imageUrl, String question, String detail) {
    // detail: "low" - 使用512x512低分辨率，费用低，适合概览
    // detail: "high" - 使用高分辨率裁切，费用高，适合细节分析
    OpenAiChatOptions options = OpenAiChatOptions.builder()
        .model("gpt-4o")
        .build();

    // 在请求中指定图片detail
    Map<String, Object> imageContent = Map.of(
        "type", "image_url",
        "image_url", Map.of(
            "url", imageUrl,
            "detail", detail  // "low", "high", 或 "auto"
        )
    );

    // 实际项目中通过ChatClient的fluent API传入
    return chatClient.prompt()
        .options(options)
        .user(spec -> spec.text(question).media(
            MimeTypeUtils.IMAGE_JPEG,
            new UrlResource(imageUrl)
        ))
        .call()
        .content();
}

批量图片处理

很多场景需要处理多张图片：产品图审核、票据识别、文档扫描。一次请求里传多张图片的实现：

@Service
public class BatchVisionService {

    /**
     * 对比分析多张图片
     */
    public String compareImages(List<String> imagePaths, String comparisonInstruction) {
        List<Media> mediaList = imagePaths.stream()
            .map(path -> loadImageAsMedia(path))
            .collect(Collectors.toList());

        // 注意：不同模型对单次请求图片数量有限制
        // GPT-4o: 最多约10张
        // Claude: 最多20个media block
        if (mediaList.size() > 10) {
            throw new IllegalArgumentException("单次请求图片不超过10张");
        }

        return chatClient.prompt()
            .user(spec -> {
                spec.text(comparisonInstruction);
                mediaList.forEach(media -> spec.media(media.getMimeType(), media.getData()));
            })
            .call()
            .content();
    }

    /**
     * 批量处理图片列表（分批发送）
     */
    public List<ImageAnalysisResult> batchAnalyze(
            List<String> imagePaths,
            String analysisPrompt) {

        List<ImageAnalysisResult> results = new ArrayList<>();

        // 每次最多3张，控制请求大小
        List<List<String>> batches = partition(imagePaths, 3);

        for (List<String> batch : batches) {
            for (String imagePath : batch) {
                try {
                    String result = analyzeSingle(imagePath, analysisPrompt);
                    results.add(ImageAnalysisResult.success(imagePath, result));
                    // 简单限速，避免触发rate limit
                    Thread.sleep(200);
                } catch (Exception e) {
                    results.add(ImageAnalysisResult.failure(imagePath, e.getMessage()));
                }
            }
        }

        return results;
    }

    private <T> List<List<T>> partition(List<T> list, int size) {
        List<List<T>> partitions = new ArrayList<>();
        for (int i = 0; i < list.size(); i += size) {
            partitions.add(list.subList(i, Math.min(i + size, list.size())));
        }
        return partitions;
    }
}

音频API集成

音频是目前各家API设计差异最大的领域。主要能力分三类：语音识别（STT）、文字转语音（TTS）、和端到端语音对话（Speech-to-Speech）。

OpenAI Whisper API（语音识别）：

@Service
public class AudioTranscriptionService {

    private final OpenAiAudioTranscriptionModel transcriptionModel;

    public AudioTranscriptionService(OpenAiApi openAiApi) {
        this.transcriptionModel = new OpenAiAudioTranscriptionModel(openAiApi,
            OpenAiAudioTranscriptionOptions.builder()
                .model(WhisperModel.WHISPER_1.getValue())
                .language("zh")      // 指定中文，可以提高准确率
                .responseFormat(TranscriptResponseFormat.JSON)
                .temperature(0.0f)   // 转录任务用0温度
                .build());
    }

    /**
     * 转录音频文件
     */
    public TranscriptionResult transcribe(MultipartFile audioFile) throws IOException {
        // Spring AI目前对Audio的封装还比较基础，这里用原始HTTP接口
        byte[] audioBytes = audioFile.getBytes();
        String fileName = audioFile.getOriginalFilename();

        AudioTranscriptionRequest request = new AudioTranscriptionRequest(
            new FileSystemResource(saveTempFile(audioBytes, fileName)),
            OpenAiAudioTranscriptionOptions.builder()
                .model("whisper-1")
                .language("zh")
                .build()
        );

        AudioTranscriptionResponse response = transcriptionModel.call(request);

        return TranscriptionResult.builder()
            .text(response.getResult().getOutput())
            .duration(audioFile.getSize()) // 粗略估算
            .build();
    }

    /**
     * 支持的音频格式：mp3, mp4, mpeg, mpga, m4a, wav, webm
     * 最大文件大小：25MB
     */
    public void validateAudioFile(MultipartFile file) {
        long maxSize = 25 * 1024 * 1024L;
        if (file.getSize() > maxSize) {
            throw new IllegalArgumentException("音频文件不能超过25MB");
        }

        List<String> supportedFormats = Arrays.asList(
            "mp3", "mp4", "mpeg", "mpga", "m4a", "wav", "webm"
        );
        String extension = FilenameUtils.getExtension(
            file.getOriginalFilename()).toLowerCase();
        if (!supportedFormats.contains(extension)) {
            throw new IllegalArgumentException("不支持的音频格式: " + extension);
        }
    }
}

TTS（文字转语音）：

@Service
public class TextToSpeechService {

    @Autowired
    private OpenAiAudioSpeechModel speechModel;

    /**
     * 生成语音文件
     * voice选项: alloy, echo, fable, onyx, nova, shimmer
     */
    public byte[] generateSpeech(String text, String voice, String format) {
        SpeechRequest request = SpeechRequest.builder()
            .model(SpeechModel.TTS_1_HD.getValue())
            .input(text)
            .voice(Voice.fromValue(voice))
            .responseFormat(AudioResponseFormat.fromValue(format))
            .speed(1.0f)  // 0.25 - 4.0
            .build();

        return speechModel.call(request).getResult().getOutput();
    }

    /**
     * 流式TTS，适合实时播放场景
     */
    public Flux<byte[]> streamSpeech(String text) {
        SpeechRequest request = SpeechRequest.builder()
            .model("tts-1")
            .input(text)
            .voice(Voice.NOVA)
            .responseFormat(AudioResponseFormat.MP3)
            .build();

        return speechModel.stream(request)
            .map(response -> response.getResult().getOutput());
    }
}

视频理解的工程现状

视频理解是多模态里最复杂的部分，目前各家API的成熟度差异很大，工程实现也最有挑战性。

从工程实现角度，视频处理有两条路线：

帧抽取路线：把视频按帧率分解为图片序列，然后用图片理解能力处理。这是目前最通用的方式，但会丢失时序信息，且帧数多了成本高。

@Service
public class VideoAnalysisService {

    /**
     * 帧抽取方式处理视频
     * 使用JavaCV或FFmpeg Java绑定抽帧
     */
    public String analyzeVideoByFrames(String videoPath, String question) throws Exception {
        List<BufferedImage> frames = extractKeyFrames(videoPath, 10); // 最多10帧

        List<byte[]> frameBytes = frames.stream()
            .map(frame -> imageToBytes(frame, "jpg"))
            .collect(Collectors.toList());

        // 把帧序列发给视觉模型
        StringBuilder prompt = new StringBuilder();
        prompt.append("以下是一段视频按时间顺序抽取的").append(frames.size()).append("帧：\n");
        prompt.append(question);

        // 构建多图片请求
        return visionService.analyzeMultipleImages(frameBytes, prompt.toString());
    }

    /**
     * 使用FFmpeg抽取关键帧
     * 依赖：org.bytedeco:javacv-platform
     */
    private List<BufferedImage> extractKeyFrames(String videoPath, int maxFrames)
            throws Exception {
        List<BufferedImage> frames = new ArrayList<>();

        try (FFmpegFrameGrabber grabber = new FFmpegFrameGrabber(videoPath)) {
            grabber.start();

            int totalFrames = grabber.getLengthInFrames();
            int step = Math.max(1, totalFrames / maxFrames);
            Java2DFrameConverter converter = new Java2DFrameConverter();

            for (int i = 0; i < totalFrames && frames.size() < maxFrames; i += step) {
                grabber.setFrameNumber(i);
                Frame frame = grabber.grabImage();
                if (frame != null) {
                    BufferedImage image = converter.getBufferedImage(frame);
                    if (image != null) {
                        frames.add(image);
                    }
                }
            }

            grabber.stop();
        }

        return frames;
    }
}

原生视频API路线：Google的Gemini 1.5 Pro支持直接传视频文件，是目前最完整的原生视频理解API。通过File API上传视频，然后在请求里引用：

@Service
public class GeminiVideoService {

    private static final String GEMINI_API_KEY = System.getenv("GOOGLE_API_KEY");
    private final RestTemplate restTemplate = new RestTemplate();

    /**
     * 使用Gemini的原生视频理解能力
     * 支持最长1小时的视频
     */
    public String analyzeVideo(String videoFilePath, String question) throws IOException {
        // Step 1: 上传视频到Google File API
        String fileUri = uploadVideoFile(videoFilePath);

        // Step 2: 等待视频处理完成（通常需要等待）
        waitForVideoProcessing(fileUri);

        // Step 3: 发起分析请求
        return queryVideoContent(fileUri, question);
    }

    private String uploadVideoFile(String videoFilePath) throws IOException {
        byte[] videoBytes = Files.readAllBytes(Path.of(videoFilePath));
        String mimeType = "video/mp4";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.valueOf(mimeType));
        headers.set("X-Goog-Upload-Protocol", "raw");
        headers.set("X-Goog-Upload-Command", "upload, finalize");

        HttpEntity<byte[]> entity = new HttpEntity<>(videoBytes, headers);

        ResponseEntity<Map> response = restTemplate.exchange(
            "https://generativelanguage.googleapis.com/upload/v1beta/files?key=" + GEMINI_API_KEY,
            HttpMethod.POST, entity, Map.class);

        return ((Map<?, ?>) response.getBody().get("file")).get("uri").toString();
    }

    private String queryVideoContent(String fileUri, String question) {
        Map<String, Object> requestBody = Map.of(
            "contents", List.of(Map.of(
                "parts", List.of(
                    Map.of("file_data", Map.of(
                        "mime_type", "video/mp4",
                        "file_uri", fileUri
                    )),
                    Map.of("text", question)
                )
            ))
        );

        ResponseEntity<Map> response = restTemplate.postForEntity(
            "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro:generateContent?key=" + GEMINI_API_KEY,
            requestBody, Map.class);

        // 解析响应
        return extractTextFromGeminiResponse(response.getBody());
    }
}

多模态内容的统一处理层

实际项目里，往往需要处理混合内容，同一个请求里可能有文字、图片、甚至文档。搭一个统一的内容构建层：

@Service
public class MultimodalRequestBuilder {

    /**
     * 统一的多模态请求构建器
     */
    public PromptSpec build(MultimodalRequest request) {
        return chatClient.prompt()
            .system(request.getSystemPrompt())
            .user(spec -> {
                // 添加文本内容
                if (request.getText() != null) {
                    spec.text(request.getText());
                }

                // 添加图片
                for (MultimodalRequest.ImageItem image : request.getImages()) {
                    if (image.isUrl()) {
                        spec.media(
                            MimeTypeUtils.IMAGE_JPEG,
                            new UrlResource(image.getSource())
                        );
                    } else {
                        byte[] bytes = Base64.getDecoder().decode(image.getSource());
                        spec.media(
                            MimeTypeUtils.parseMimeType(image.getMimeType()),
                            new ByteArrayResource(bytes)
                        );
                    }
                }

                // 添加文档内容（PDF等，转为文本或图片）
                for (MultimodalRequest.DocumentItem doc : request.getDocuments()) {
                    spec.text("\n[文档: " + doc.getName() + "]\n" + doc.getExtractedText());
                }
            });
    }
}

@Data
@Builder
public class MultimodalRequest {
    private String systemPrompt;
    private String text;
    private List<ImageItem> images = new ArrayList<>();
    private List<DocumentItem> documents = new ArrayList<>();
    private String model;

    @Data
    @AllArgsConstructor
    public static class ImageItem {
        private String source; // URL或base64
        private String mimeType;
        private boolean isUrl;
    }

    @Data
    @AllArgsConstructor
    public static class DocumentItem {
        private String name;
        private String extractedText;
    }
}

常见坑和注意事项

坑1：图片大小限制

不同模型对图片大小限制不同。GPT-4o单张图片低分辨率模式按512×512处理，高分辨率会切成多个tile，每个tile消耗额外token。一张4K分辨率图片，高分辨率模式可能消耗1000+ token，要注意成本控制。

在发送前先做图片压缩：

public byte[] resizeIfNeeded(byte[] imageBytes, int maxWidth, int maxHeight) throws IOException {
    BufferedImage original = ImageIO.read(new ByteArrayInputStream(imageBytes));

    if (original.getWidth() <= maxWidth && original.getHeight() <= maxHeight) {
        return imageBytes;
    }

    // 按比例缩放
    double scale = Math.min(
        (double) maxWidth / original.getWidth(),
        (double) maxHeight / original.getHeight()
    );

    int newWidth = (int) (original.getWidth() * scale);
    int newHeight = (int) (original.getHeight() * scale);

    BufferedImage resized = new BufferedImage(newWidth, newHeight, BufferedImage.TYPE_INT_RGB);
    Graphics2D g2d = resized.createGraphics();
    g2d.setRenderingHint(RenderingHints.KEY_INTERPOLATION,
        RenderingHints.VALUE_INTERPOLATION_BILINEAR);
    g2d.drawImage(original, 0, 0, newWidth, newHeight, null);
    g2d.dispose();

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ImageIO.write(resized, "jpg", baos);
    return baos.toByteArray();
}

坑2：Claude不支持image URL（某些场景）

Claude的API在某些配置下不支持从URL加载图片，必须传base64。如果你的代码兼容多个模型，要处理这个差异。

坑3：音频转录的语言提示

Whisper在不指定语言的情况下会自动检测，但准确率会稍低，且有时候会把中文误识别为其他语言。中文场景强烈建议显式传language: "zh"。

坑4：多模态请求的超时

包含图片的请求比纯文本请求慢，特别是高分辨率图片。超时要相应调大，建议至少90秒。

接口规范的未来走向

总体趋势：

标准化会继续推进，OpenAI的多模态API格式已经成为事实上的行业参考，其他厂商基本都在做兼容适配。Spring AI、LangChain4j这些框架的多模态抽象层会越来越完善，跨模型的多模态应用会更容易构建。

视频理解会是下一个爆发点。当前各家都在做的"Long Context + 视频"组合，会让视频分析类应用变得可行。工程侧的挑战主要是成本控制和延迟优化。

实时音视频对话（端到端语音模式）还在早期，API不稳定，不建议现在就在核心业务上押注，但可以跟踪。