多模态AI应用：让Java应用同时理解文字、图片和音频

老张2026/10/1大约 19 分钟多模态图片理解语音识别Spring AIJava

多模态AI应用：让Java应用同时理解文字、图片和音频

截图一上传就崩，用户满意度+35%的多模态改造记

2025年11月的一个工单，让某SaaS产品的技术负责人张浩看了一分钟没说出话来。

用户投诉内容是这样的："我遇到一个报错，截图发过去了，AI助手说它看不到图片，让我把错误信息打出来。我一个字一个字打完，又问了一遍。AI说它无法访问截图。我问了第三次，它开始分析一个根本没发生在我这里的问题。最后我卸载了你们的软件。"

那条投诉下面，有47个"我也是"。

张浩去查了产品的用户行为数据：在支持工单场景中，用户尝试上传截图的比例是73%。但他们的AI助手只支持文本，所有图片请求都被前端拦截了，只显示一行提示："请描述您的问题"。

改造方案经过3周的开发，上线之后第一个月：

工单解决率从41%提升到67%
用户满意度（CSAT）从3.2提升到4.3（5分制）
平均解决工单所需轮次从7.2轮降至3.8轮

差异在于：用户上传截图，AI一眼就能看到报错信息，不需要用户反复描述。

这篇文章，就是那套多模态改造的完整实现。

一、多模态AI现状：2026年你有哪些选择

1.1 主流多模态模型能力对比

模型	图片理解	文档OCR	语音输入	图片生成	视频理解	价格（图片/次）
GPT-4o	优秀	优秀	优秀(Whisper)	DALL-E 3	帧抽取	$0.00765/张*
Claude 3.5 Sonnet	优秀	优秀	不支持原生	不支持	有限	$0.024/张*
Gemini 1.5 Pro	优秀	优秀	支持	Imagen	原生支持	$0.00263/张*
Qwen-VL-Max	良好	良好	不支持	不支持	有限	¥0.008/张

*图片价格按1024x1024低清模式，实际按Token计算，高清图片Token更多

1.2 Spring AI多模态支持现状

二、图片理解：Spring AI处理图片输入

2.1 基础图片理解（本地文件+URL双模式）

@Service
@RequiredArgsConstructor
@Slf4j
public class ImageUnderstandingService {
    
    private final ChatClient chatClient;
    
    /**
     * 理解本地图片文件
     */
    public ImageAnalysisResult analyzeLocalImage(
            Path imagePath, String question) throws IOException {
        
        // 读取图片文件
        byte[] imageBytes = Files.readAllBytes(imagePath);
        String mimeType = detectMimeType(imagePath);
        
        // 构建媒体对象
        Media imageMedia = new Media(
            MimeTypeUtils.parseMimeType(mimeType),
            new ByteArrayResource(imageBytes)
        );
        
        // 构建包含图片的消息
        UserMessage userMessage = new UserMessage(
            question,
            List.of(imageMedia)
        );
        
        long startTime = System.currentTimeMillis();
        
        ChatResponse response = chatClient.prompt()
            .messages(userMessage)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .maxTokens(1024)
                .build())
            .call()
            .chatResponse();
        
        String analysis = response.getResult().getOutput().getText();
        long latencyMs = System.currentTimeMillis() - startTime;
        
        return ImageAnalysisResult.builder()
            .analysis(analysis)
            .imageSize(imageBytes.length)
            .mimeType(mimeType)
            .latencyMs(latencyMs)
            .tokensUsed(response.getMetadata().getUsage().getTotalTokens())
            .build();
    }
    
    /**
     * 理解URL图片（无需下载，直接传URL给模型）
     */
    public ImageAnalysisResult analyzeImageFromUrl(String imageUrl, String question) {
        
        // 对于URL图片，使用URL方式传递
        Media imageMedia = new Media(
            MimeTypeUtils.IMAGE_JPEG,
            new UrlResource(imageUrl)
        );
        
        UserMessage userMessage = new UserMessage(question, List.of(imageMedia));
        
        String analysis = chatClient.prompt()
            .messages(userMessage)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .build())
            .call()
            .content();
        
        return ImageAnalysisResult.builder()
            .analysis(analysis)
            .imageUrl(imageUrl)
            .build();
    }
    
    /**
     * 批量图片分析（多张图片，一次请求）
     */
    public String analyzeMultipleImages(List<Path> imagePaths, String instruction) 
            throws IOException {
        
        List<Media> mediaList = new ArrayList<>();
        
        for (Path imagePath : imagePaths) {
            byte[] imageBytes = Files.readAllBytes(imagePath);
            String mimeType = detectMimeType(imagePath);
            mediaList.add(new Media(
                MimeTypeUtils.parseMimeType(mimeType),
                new ByteArrayResource(imageBytes)
            ));
        }
        
        // 注意：GPT-4o支持最多20张图片，Claude 3支持最多5张
        UserMessage userMessage = new UserMessage(instruction, mediaList);
        
        return chatClient.prompt()
            .messages(userMessage)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .maxTokens(2048)
                .build())
            .call()
            .content();
    }
    
    /**
     * 技术截图分析（专为工程师截图优化的提示词）
     */
    public TechScreenshotAnalysis analyzeTechScreenshot(
            MultipartFile screenshotFile, String userContext) throws IOException {
        
        byte[] imageBytes = screenshotFile.getBytes();
        Media imageMedia = new Media(
            MimeTypeUtils.parseMimeType(
                Objects.requireNonNull(screenshotFile.getContentType())),
            new ByteArrayResource(imageBytes)
        );
        
        String systemPrompt = """
            你是一个技术支持专家，擅长分析软件截图中的问题。
            分析截图时，请：
            1. 识别截图中的错误信息、异常堆栈或警告
            2. 理解UI状态和操作上下文
            3. 提取关键的技术信息（错误码、版本号、URL等）
            4. 给出可能的原因和解决方案
            
            输出JSON格式：
            {
              "screenshot_type": "error_dialog/log_output/ui_state/code_editor",
              "detected_issues": ["问题1", "问题2"],
              "key_info": {"error_code": "xxx", "version": "xxx"},
              "root_cause": "可能的原因",
              "solution_steps": ["步骤1", "步骤2"],
              "confidence": 0.85
            }
            """;
        
        String userMessage = String.format(
            "用户说明：%s\n\n请分析这张截图：",
            userContext != null ? userContext : "请帮我分析这个问题"
        );
        
        UserMessage message = new UserMessage(userMessage, List.of(imageMedia));
        
        String response = chatClient.prompt()
            .system(systemPrompt)
            .messages(message)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .temperature(0.1)
                .build())
            .call()
            .content();
        
        return parseTechScreenshotAnalysis(response);
    }
    
    private String detectMimeType(Path path) {
        String filename = path.getFileName().toString().toLowerCase();
        if (filename.endsWith(".jpg") || filename.endsWith(".jpeg")) return "image/jpeg";
        if (filename.endsWith(".png")) return "image/png";
        if (filename.endsWith(".gif")) return "image/gif";
        if (filename.endsWith(".webp")) return "image/webp";
        return "image/jpeg";  // 默认
    }
    
    @Data
    @Builder
    public static class ImageAnalysisResult {
        private String analysis;
        private String imageUrl;
        private long imageSize;
        private String mimeType;
        private long latencyMs;
        private long tokensUsed;
    }
    
    @Data
    @Builder
    public static class TechScreenshotAnalysis {
        private String screenshotType;
        private List<String> detectedIssues;
        private Map<String, String> keyInfo;
        private String rootCause;
        private List<String> solutionSteps;
        private double confidence;
    }
}

2.2 图片上传Controller（Spring MVC集成）

@RestController
@RequestMapping("/api/multimodal")
@RequiredArgsConstructor
@Slf4j
public class MultimodalController {
    
    private final ImageUnderstandingService imageService;
    private final AudioTranscriptionService audioService;
    private final DocumentUnderstandingService documentService;
    private final ContentSafetyService safetyService;
    
    @PostMapping("/image/analyze")
    public ResponseEntity<ImageAnalysisResponse> analyzeImage(
            @RequestParam("file") MultipartFile file,
            @RequestParam(value = "question", defaultValue = "请描述这张图片的内容") 
                String question,
            @RequestParam(value = "userId") String userId) {
        
        // 文件大小检查（图片最大20MB）
        if (file.getSize() > 20 * 1024 * 1024) {
            return ResponseEntity.badRequest()
                .body(ImageAnalysisResponse.error("图片大小不能超过20MB"));
        }
        
        // 文件类型检查
        String contentType = file.getContentType();
        if (!isImageContentType(contentType)) {
            return ResponseEntity.badRequest()
                .body(ImageAnalysisResponse.error("不支持的文件类型：" + contentType));
        }
        
        try {
            // 安全检查（有害图片过滤）
            ContentSafetyResult safety = safetyService.checkImage(file.getBytes());
            if (!safety.isSafe()) {
                log.warn("Unsafe image detected from user: {}, reason: {}", 
                         userId, safety.getReason());
                return ResponseEntity.status(HttpStatus.UNPROCESSABLE_ENTITY)
                    .body(ImageAnalysisResponse.error("图片内容不符合使用规范"));
            }
            
            // 分析图片
            ImageUnderstandingService.ImageAnalysisResult result;
            
            if (isTechScreenshot(question)) {
                ImageUnderstandingService.TechScreenshotAnalysis analysis =
                    imageService.analyzeTechScreenshot(file, question);
                return ResponseEntity.ok(ImageAnalysisResponse.fromTechAnalysis(analysis));
            } else {
                result = imageService.analyzeLocalImage(
                    saveToTempFile(file), question);
                return ResponseEntity.ok(ImageAnalysisResponse.fromResult(result));
            }
            
        } catch (IOException e) {
            log.error("Image analysis failed for user: {}", userId, e);
            return ResponseEntity.internalServerError()
                .body(ImageAnalysisResponse.error("图片处理失败，请重试"));
        }
    }
    
    @PostMapping("/image/compare")
    public ResponseEntity<String> compareImages(
            @RequestParam("before") MultipartFile beforeImage,
            @RequestParam("after") MultipartFile afterImage,
            @RequestParam(value = "context", defaultValue = "请比较这两张图片的差异") 
                String context) throws IOException {
        
        String result = imageService.analyzeMultipleImages(
            List.of(saveToTempFile(beforeImage), saveToTempFile(afterImage)),
            context + "\n请重点指出before（第一张）和after（第二张）的具体差异。"
        );
        
        return ResponseEntity.ok(result);
    }
    
    private boolean isImageContentType(String contentType) {
        return contentType != null && 
               (contentType.startsWith("image/jpeg") ||
                contentType.startsWith("image/png") ||
                contentType.startsWith("image/gif") ||
                contentType.startsWith("image/webp"));
    }
    
    private boolean isTechScreenshot(String question) {
        String lower = question.toLowerCase();
        return lower.contains("错误") || lower.contains("error") || 
               lower.contains("bug") || lower.contains("异常") ||
               lower.contains("报错") || lower.contains("日志");
    }
    
    private Path saveToTempFile(MultipartFile file) throws IOException {
        Path tempFile = Files.createTempFile("multimodal_", 
            getExtension(file.getOriginalFilename()));
        file.transferTo(tempFile);
        tempFile.toFile().deleteOnExit();
        return tempFile;
    }
    
    private String getExtension(String filename) {
        if (filename == null) return ".jpg";
        int dotIndex = filename.lastIndexOf('.');
        return dotIndex >= 0 ? filename.substring(dotIndex) : ".jpg";
    }
}

三、文档理解：PDF与图片文档的OCR+理解

3.1 PDF文档智能解析

@Service
@RequiredArgsConstructor
@Slf4j
public class DocumentUnderstandingService {
    
    private final ChatClient chatClient;
    private final PDFBoxDocumentLoader pdfLoader;
    
    /**
     * 智能解析PDF文档（支持文字PDF和扫描PDF）
     */
    public DocumentAnalysisResult analyzePdf(Path pdfPath, DocumentAnalysisConfig config) 
            throws IOException {
        
        // 尝试文本提取（适用于电子PDF）
        String extractedText = extractTextFromPdf(pdfPath);
        
        if (extractedText.length() > 100) {
            // 文字PDF，直接用文本分析
            return analyzeTextDocument(extractedText, config);
        } else {
            // 扫描PDF，转换为图片后用视觉理解
            return analyzeScannedPdf(pdfPath, config);
        }
    }
    
    private String extractTextFromPdf(Path pdfPath) throws IOException {
        try (PDDocument document = PDDocument.load(pdfPath.toFile())) {
            PDFTextStripper stripper = new PDFTextStripper();
            return stripper.getText(document);
        }
    }
    
    private DocumentAnalysisResult analyzeTextDocument(
            String text, DocumentAnalysisConfig config) {
        
        // 如果文档太长，进行分块处理
        List<String> chunks = chunkText(text, 4000);  // 4000 tokens per chunk
        
        if (chunks.size() == 1) {
            // 短文档直接分析
            String analysis = chatClient.prompt()
                .system(config.getAnalysisPrompt())
                .user("请分析以下文档：\n\n" + chunks.get(0))
                .call()
                .content();
            
            return DocumentAnalysisResult.builder()
                .content(analysis)
                .pageCount(1)
                .processingMethod("TEXT_EXTRACTION")
                .build();
        } else {
            // 长文档：先分块摘要，再综合
            return analyzeWithMapReduce(chunks, config);
        }
    }
    
    private DocumentAnalysisResult analyzeWithMapReduce(
            List<String> chunks, DocumentAnalysisConfig config) {
        
        // Map阶段：每个chunk独立摘要
        List<String> chunkSummaries = chunks.parallelStream()
            .map(chunk -> {
                return chatClient.prompt()
                    .system("提取以下段落的关键信息，保留重要数据和结论")
                    .user(chunk)
                    .options(OpenAiChatOptions.builder()
                        .model("gpt-4o-mini")  // 分块摘要用mini降低成本
                        .maxTokens(500)
                        .build())
                    .call()
                    .content();
            })
            .collect(Collectors.toList());
        
        // Reduce阶段：综合所有摘要
        String combinedSummaries = String.join("\n\n---\n\n", chunkSummaries);
        
        String finalAnalysis = chatClient.prompt()
            .system(config.getAnalysisPrompt())
            .user("以下是文档各部分的摘要，请综合分析：\n\n" + combinedSummaries)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")  // 最终综合用强模型
                .build())
            .call()
            .content();
        
        return DocumentAnalysisResult.builder()
            .content(finalAnalysis)
            .chunkCount(chunks.size())
            .processingMethod("MAP_REDUCE")
            .build();
    }
    
    private DocumentAnalysisResult analyzeScannedPdf(
            Path pdfPath, DocumentAnalysisConfig config) throws IOException {
        
        // 将PDF页面转换为图片
        List<byte[]> pageImages = convertPdfToImages(pdfPath);
        
        if (pageImages.isEmpty()) {
            return DocumentAnalysisResult.builder()
                .content("无法解析该PDF文档")
                .processingMethod("FAILED")
                .build();
        }
        
        // 限制最多处理10页（控制成本）
        List<byte[]> pagesToProcess = pageImages.subList(
            0, Math.min(10, pageImages.size()));
        
        List<Media> pageMedia = pagesToProcess.stream()
            .map(bytes -> new Media(MimeTypeUtils.IMAGE_PNG, new ByteArrayResource(bytes)))
            .collect(Collectors.toList());
        
        UserMessage message = new UserMessage(
            config.getAnalysisPrompt() + "\n\n这是文档的各页图片，请提取并分析内容：",
            pageMedia
        );
        
        String analysis = chatClient.prompt()
            .messages(message)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .maxTokens(4096)
                .build())
            .call()
            .content();
        
        return DocumentAnalysisResult.builder()
            .content(analysis)
            .pageCount(pageImages.size())
            .processedPageCount(pagesToProcess.size())
            .processingMethod("OCR_VISION")
            .build();
    }
    
    private List<byte[]> convertPdfToImages(Path pdfPath) throws IOException {
        List<byte[]> images = new ArrayList<>();
        
        try (PDDocument document = PDDocument.load(pdfPath.toFile())) {
            PDFRenderer renderer = new PDFRenderer(document);
            
            for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
                BufferedImage image = renderer.renderImageWithDPI(pageIndex, 150);  // 150 DPI平衡质量和大小
                
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                ImageIO.write(image, "PNG", baos);
                images.add(baos.toByteArray());
            }
        }
        
        return images;
    }
    
    private List<String> chunkText(String text, int maxTokens) {
        // 简单按字符数分块，实际生产应使用tiktoken计数
        int maxChars = maxTokens * 4;  // 估算1 token ≈ 4字符
        
        List<String> chunks = new ArrayList<>();
        int start = 0;
        
        while (start < text.length()) {
            int end = Math.min(start + maxChars, text.length());
            
            // 尝试在段落边界切分
            if (end < text.length()) {
                int lastNewline = text.lastIndexOf("\n\n", end);
                if (lastNewline > start + maxChars / 2) {
                    end = lastNewline;
                }
            }
            
            chunks.add(text.substring(start, end));
            start = end;
        }
        
        return chunks;
    }
    
    @Data
    @Builder
    public static class DocumentAnalysisConfig {
        private String analysisPrompt;
        private boolean extractStructuredData;
        private List<String> fieldsToExtract;
        
        public static DocumentAnalysisConfig defaultConfig() {
            return DocumentAnalysisConfig.builder()
                .analysisPrompt("请分析文档内容，提取关键信息，给出摘要")
                .build();
        }
    }
    
    @Data
    @Builder
    public static class DocumentAnalysisResult {
        private String content;
        private int pageCount;
        private int processedPageCount;
        private int chunkCount;
        private String processingMethod;
        private Map<String, Object> extractedFields;
    }
}

四、语音转文字：Whisper API集成

4.1 Spring AI音频转写实现

@Service
@RequiredArgsConstructor
@Slf4j
public class AudioTranscriptionService {
    
    private final OpenAiAudioTranscriptionModel transcriptionModel;
    private final ChatClient chatClient;
    
    /**
     * 音频文件转写（支持mp3/mp4/wav/m4a/webm等格式）
     */
    public TranscriptionResult transcribeAudio(
            MultipartFile audioFile, String language) throws IOException {
        
        // 文件大小检查（Whisper API限制25MB）
        if (audioFile.getSize() > 25 * 1024 * 1024) {
            throw new IllegalArgumentException("音频文件不能超过25MB");
        }
        
        long startTime = System.currentTimeMillis();
        
        OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
            .model("whisper-1")
            .language(language)  // 可为null（自动检测）
            .responseFormat(OpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
            .temperature(0.0f)  // 低温度提高准确性
            .build();
        
        AudioTranscriptionResponse response = transcriptionModel.call(
            new AudioTranscriptionPrompt(
                new ByteArrayResource(audioFile.getBytes()) {
                    @Override
                    public String getFilename() {
                        return audioFile.getOriginalFilename();
                    }
                },
                options
            )
        );
        
        String transcribedText = response.getResult().getOutput();
        long latencyMs = System.currentTimeMillis() - startTime;
        
        log.info("Audio transcription completed: {} chars in {}ms", 
                 transcribedText.length(), latencyMs);
        
        return TranscriptionResult.builder()
            .text(transcribedText)
            .language(extractLanguage(response))
            .durationSeconds(extractDuration(response))
            .latencyMs(latencyMs)
            .build();
    }
    
    /**
     * 实时语音录入（适合长时间对话场景）
     * 使用分段录音，每段最长30秒
     */
    public Flux<String> transcribeAudioStream(Flux<byte[]> audioChunks, String language) {
        return audioChunks
            .buffer(Duration.ofSeconds(30))  // 每30秒处理一次
            .flatMap(chunks -> {
                byte[] combined = combineChunks(chunks);
                try {
                    TranscriptionResult result = transcribeBytes(combined, language);
                    return Flux.just(result.getText());
                } catch (Exception e) {
                    log.error("Stream transcription failed", e);
                    return Flux.empty();
                }
            });
    }
    
    /**
     * 语音+图片组合输入（例如：录音描述一张图片的问题）
     */
    public String processVoiceWithImage(
            MultipartFile audioFile, MultipartFile imageFile,
            String language) throws IOException {
        
        // 先转写语音
        TranscriptionResult transcription = transcribeAudio(audioFile, language);
        
        log.info("Voice transcription: {}", transcription.getText());
        
        // 结合图片进行分析
        Media imageMedia = new Media(
            MimeTypeUtils.parseMimeType(Objects.requireNonNull(imageFile.getContentType())),
            new ByteArrayResource(imageFile.getBytes())
        );
        
        String combinedQuestion = String.format(
            "用户的语音问题（已转写）：\"%s\"\n\n请基于以上问题，结合图片内容给出回答。",
            transcription.getText()
        );
        
        UserMessage message = new UserMessage(combinedQuestion, List.of(imageMedia));
        
        return chatClient.prompt()
            .messages(message)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .build())
            .call()
            .content();
    }
    
    private TranscriptionResult transcribeBytes(byte[] audioBytes, String language) 
            throws IOException {
        
        OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
            .model("whisper-1")
            .language(language)
            .build();
        
        AudioTranscriptionResponse response = transcriptionModel.call(
            new AudioTranscriptionPrompt(
                new ByteArrayResource(audioBytes) {
                    @Override
                    public String getFilename() { return "chunk.webm"; }
                },
                options
            )
        );
        
        return TranscriptionResult.builder()
            .text(response.getResult().getOutput())
            .build();
    }
    
    private byte[] combineChunks(List<byte[]> chunks) {
        int totalSize = chunks.stream().mapToInt(c -> c.length).sum();
        ByteBuffer buffer = ByteBuffer.allocate(totalSize);
        chunks.forEach(buffer::put);
        return buffer.array();
    }
    
    private String extractLanguage(AudioTranscriptionResponse response) {
        try {
            return response.getResult().getMetadata().get("language", String.class);
        } catch (Exception e) {
            return "unknown";
        }
    }
    
    private Double extractDuration(AudioTranscriptionResponse response) {
        try {
            return response.getResult().getMetadata().get("duration", Double.class);
        } catch (Exception e) {
            return null;
        }
    }
    
    @Data
    @Builder
    public static class TranscriptionResult {
        private String text;
        private String language;
        private Double durationSeconds;
        private long latencyMs;
    }
}

五、图片生成：DALL-E集成

5.1 图片生成服务

@Service
@RequiredArgsConstructor
@Slf4j
public class ImageGenerationService {
    
    private final ImageModel imageModel;
    private final ChatClient chatClient;
    
    /**
     * 根据文本描述生成图片（DALL-E 3）
     */
    public ImageGenerationResult generateImage(
            String prompt, ImageGenerationConfig config) {
        
        // 自动优化提示词（可选）
        String optimizedPrompt = config.isAutoOptimize() ?
            optimizeImagePrompt(prompt) : prompt;
        
        long startTime = System.currentTimeMillis();
        
        ImageOptions options = OpenAiImageOptions.builder()
            .model("dall-e-3")
            .quality(config.getQuality())  // standard/hd
            .size(config.getSize())         // 1024x1024/1792x1024/1024x1792
            .style(config.getStyle())       // vivid/natural
            .n(1)  // DALL-E 3每次只生成1张
            .build();
        
        ImageResponse response = imageModel.call(
            new ImagePrompt(optimizedPrompt, options));
        
        Image image = response.getResult().getOutput();
        long latencyMs = System.currentTimeMillis() - startTime;
        
        return ImageGenerationResult.builder()
            .imageUrl(image.getUrl())
            .revisedPrompt(image.getRevisedPrompt())  // DALL-E可能会修改提示词
            .originalPrompt(prompt)
            .optimizedPrompt(optimizedPrompt)
            .latencyMs(latencyMs)
            .build();
    }
    
    /**
     * 流程图/架构图生成（将文字描述转为可视化图）
     * 注意：这里是生成示意图，精确的架构图建议用Mermaid/PlantUML
     */
    public ImageGenerationResult generateDiagram(String description) {
        String diagramPrompt = String.format("""
            Create a clean, professional technical diagram showing: %s
            
            Style: minimalist, white background, clear labels, 
            professional colors (blue, gray, white), 
            arrows showing data flow, boxes for components.
            Make it look like a software architecture diagram.
            """, description);
        
        return generateImage(diagramPrompt, ImageGenerationConfig.builder()
            .quality("standard")
            .size("1792x1024")  // 横向更适合架构图
            .style("natural")
            .autoOptimize(false)
            .build());
    }
    
    /**
     * 使用AI优化图片生成提示词
     */
    private String optimizeImagePrompt(String userPrompt) {
        return chatClient.prompt()
            .system("""
                你是DALL-E 3的提示词工程专家。
                将用户的描述优化为更好的DALL-E提示词。
                优化要点：
                1. 添加具体的风格描述（photorealistic/digital art/etc.）
                2. 指定光线、构图、细节
                3. 避免DALL-E不支持的内容
                4. 只返回优化后的提示词，不要解释
                """)
            .user(userPrompt)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o-mini")
                .maxTokens(200)
                .build())
            .call()
            .content();
    }
    
    @Data
    @Builder
    public static class ImageGenerationConfig {
        private String quality;
        private String size;
        private String style;
        private boolean autoOptimize;
        
        public static ImageGenerationConfig defaultConfig() {
            return ImageGenerationConfig.builder()
                .quality("standard")
                .size("1024x1024")
                .style("vivid")
                .autoOptimize(false)
                .build();
        }
    }
    
    @Data
    @Builder
    public static class ImageGenerationResult {
        private String imageUrl;
        private String revisedPrompt;
        private String originalPrompt;
        private String optimizedPrompt;
        private long latencyMs;
    }
}

六、多模态RAG：图文混合知识库

6.1 图文混合向量化策略

@Service
@RequiredArgsConstructor
@Slf4j
public class MultimodalRagService {
    
    private final EmbeddingModel textEmbeddingModel;
    private final VectorStore vectorStore;
    private final ChatClient chatClient;
    private final ImageUnderstandingService imageService;
    
    /**
     * 向知识库添加图文混合内容
     * 策略：图片先用视觉模型生成描述，再用文本嵌入
     */
    public void addImageToKnowledgeBase(
            Path imagePath, String category, Map<String, String> metadata) 
            throws IOException {
        
        // Step 1：用GPT-4o生成图片的详细文字描述
        String imageDescription = chatClient.prompt()
            .messages(new UserMessage(
                "请详细描述这张图片，包括：内容、文字信息、图表数据、技术细节等。" +
                "输出详细的文字描述，用于知识库检索。",
                List.of(new Media(
                    MimeTypeUtils.parseMimeType(detectMimeType(imagePath)),
                    new ByteArrayResource(Files.readAllBytes(imagePath))
                ))
            ))
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .maxTokens(1000)
                .build())
            .call()
            .content();
        
        // Step 2：创建文档，将图片路径和描述一起存储
        Map<String, Object> docMetadata = new HashMap<>(metadata);
        docMetadata.put("content_type", "image");
        docMetadata.put("image_path", imagePath.toString());
        docMetadata.put("category", category);
        
        Document document = new Document(imageDescription, docMetadata);
        
        // Step 3：文本嵌入（使用标准文本嵌入模型）
        vectorStore.add(List.of(document));
        
        log.info("Image {} added to knowledge base with description: {}...", 
                 imagePath.getFileName(), 
                 imageDescription.substring(0, Math.min(100, imageDescription.length())));
    }
    
    /**
     * 多模态RAG查询：支持用图片或文字搜索知识库
     */
    public MultimodalRagResponse query(
            String textQuery, Optional<MultipartFile> imageQuery) throws IOException {
        
        String searchQuery = textQuery;
        
        // 如果有图片，先提取图片中的文字/内容，增强搜索
        if (imageQuery.isPresent()) {
            String imageContent = chatClient.prompt()
                .messages(new UserMessage(
                    "提取这张图片中的关键信息、文字内容和主要概念，用于知识库检索",
                    List.of(new Media(
                        MimeTypeUtils.parseMimeType(
                            Objects.requireNonNull(imageQuery.get().getContentType())),
                        new ByteArrayResource(imageQuery.get().getBytes())
                    ))
                ))
                .options(OpenAiChatOptions.builder()
                    .model("gpt-4o")
                    .maxTokens(200)
                    .build())
                .call()
                .content();
            
            searchQuery = textQuery + "\n图片内容：" + imageContent;
        }
        
        // 搜索向量库
        List<Document> relevantDocs = vectorStore.similaritySearch(
            SearchRequest.query(searchQuery).withTopK(5));
        
        if (relevantDocs.isEmpty()) {
            return MultimodalRagResponse.builder()
                .answer("未找到相关知识库内容")
                .sources(List.of())
                .build();
        }
        
        // 构建RAG上下文
        String context = buildRagContext(relevantDocs);
        
        // 最终问答（如果有图片，一并传给LLM）
        String answer;
        if (imageQuery.isPresent()) {
            UserMessage finalMessage = new UserMessage(
                String.format("基于以下知识库内容，回答用户的问题：\n\n%s\n\n用户问题：%s", 
                              context, textQuery),
                List.of(new Media(
                    MimeTypeUtils.parseMimeType(
                        Objects.requireNonNull(imageQuery.get().getContentType())),
                    new ByteArrayResource(imageQuery.get().getBytes())
                ))
            );
            answer = chatClient.prompt().messages(finalMessage).call().content();
        } else {
            answer = chatClient.prompt()
                .system("基于提供的知识库内容回答问题，如果知识库中没有相关信息请说明")
                .user(String.format("知识库内容：\n%s\n\n问题：%s", context, textQuery))
                .call()
                .content();
        }
        
        return MultimodalRagResponse.builder()
            .answer(answer)
            .sources(relevantDocs.stream()
                .map(d -> DocumentSource.builder()
                    .contentType((String) d.getMetadata().get("content_type"))
                    .imagePath((String) d.getMetadata().get("image_path"))
                    .category((String) d.getMetadata().get("category"))
                    .relevanceScore(d.getScore())
                    .build())
                .collect(Collectors.toList()))
            .build();
    }
    
    private String buildRagContext(List<Document> docs) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < docs.size(); i++) {
            sb.append(String.format("\n--- 参考资料%d（%s）---\n%s\n",
                i + 1,
                docs.get(i).getMetadata().getOrDefault("category", "未知分类"),
                docs.get(i).getText()
            ));
        }
        return sb.toString();
    }
    
    private String detectMimeType(Path path) {
        String filename = path.getFileName().toString().toLowerCase();
        if (filename.endsWith(".png")) return "image/png";
        if (filename.endsWith(".gif")) return "image/gif";
        return "image/jpeg";
    }
    
    @Data
    @Builder
    public static class MultimodalRagResponse {
        private String answer;
        private List<DocumentSource> sources;
    }
    
    @Data
    @Builder
    public static class DocumentSource {
        private String contentType;
        private String imagePath;
        private String category;
        private Float relevanceScore;
    }
}

七、视频理解：帧提取+逐帧分析

7.1 视频分析服务

@Service
@RequiredArgsConstructor
@Slf4j
public class VideoUnderstandingService {
    
    private final ChatClient chatClient;
    
    /**
     * 视频分析：提取关键帧，多帧同时理解
     * 注意：GPT-4o支持最多多帧分析，但每帧都消耗Token
     */
    public VideoAnalysisResult analyzeVideo(
            Path videoPath, VideoAnalysisConfig config) throws IOException {
        
        // 使用FFmpeg提取关键帧（每N秒一帧）
        List<byte[]> keyFrames = extractKeyFrames(videoPath, config.getFrameIntervalSeconds());
        
        if (keyFrames.isEmpty()) {
            return VideoAnalysisResult.builder()
                .summary("无法提取视频帧")
                .build();
        }
        
        // 限制最多20帧（控制Token消耗）
        List<byte[]> selectedFrames = keyFrames.subList(
            0, Math.min(config.getMaxFrames(), keyFrames.size()));
        
        log.info("Analyzing video with {} frames (total {} frames extracted)", 
                 selectedFrames.size(), keyFrames.size());
        
        // 构建多帧分析消息
        List<Media> frameMedia = selectedFrames.stream()
            .map(frameBytes -> new Media(MimeTypeUtils.IMAGE_JPEG, 
                                        new ByteArrayResource(frameBytes)))
            .collect(Collectors.toList());
        
        String instructionWithContext = String.format(
            """
            这是视频的%d个关键帧（按时间顺序），每帧间隔%d秒。
            %s
            
            请基于这些帧的内容进行分析，关注时间上的变化和发展。
            """,
            selectedFrames.size(),
            config.getFrameIntervalSeconds(),
            config.getAnalysisInstruction()
        );
        
        UserMessage message = new UserMessage(instructionWithContext, frameMedia);
        
        String analysis = chatClient.prompt()
            .messages(message)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o")
                .maxTokens(2048)
                .build())
            .call()
            .content();
        
        return VideoAnalysisResult.builder()
            .summary(analysis)
            .totalFramesExtracted(keyFrames.size())
            .framesAnalyzed(selectedFrames.size())
            .videoPath(videoPath.toString())
            .build();
    }
    
    /**
     * 使用FFmpeg提取关键帧
     * 需要系统安装FFmpeg
     */
    private List<byte[]> extractKeyFrames(Path videoPath, int intervalSeconds) 
            throws IOException {
        
        List<byte[]> frames = new ArrayList<>();
        Path tempDir = Files.createTempDirectory("video_frames_");
        
        try {
            // 调用FFmpeg提取帧
            ProcessBuilder pb = new ProcessBuilder(
                "ffmpeg",
                "-i", videoPath.toString(),
                "-vf", String.format("fps=1/%d,scale=1280:-1", intervalSeconds),
                "-q:v", "3",  // JPEG质量
                tempDir.toString() + "/frame_%04d.jpg"
            );
            pb.redirectErrorStream(true);
            
            Process process = pb.start();
            int exitCode = process.waitFor();
            
            if (exitCode != 0) {
                log.error("FFmpeg failed with exit code: {}", exitCode);
                return frames;
            }
            
            // 读取提取的帧
            Files.list(tempDir)
                .sorted()
                .forEach(framePath -> {
                    try {
                        frames.add(Files.readAllBytes(framePath));
                    } catch (IOException e) {
                        log.warn("Failed to read frame: {}", framePath);
                    }
                });
            
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        } finally {
            // 清理临时文件
            FileUtils.deleteDirectory(tempDir.toFile());
        }
        
        return frames;
    }
    
    @Data
    @Builder
    public static class VideoAnalysisConfig {
        private int frameIntervalSeconds;  // 帧间隔（秒）
        private int maxFrames;             // 最多分析帧数
        private String analysisInstruction;
        
        public static VideoAnalysisConfig defaultConfig() {
            return VideoAnalysisConfig.builder()
                .frameIntervalSeconds(5)
                .maxFrames(20)
                .analysisInstruction("请描述视频的主要内容和发展过程")
                .build();
        }
    }
    
    @Data
    @Builder
    public static class VideoAnalysisResult {
        private String summary;
        private int totalFramesExtracted;
        private int framesAnalyzed;
        private String videoPath;
    }
}

八、内容安全：图片的有害内容检测

8.1 内容安全服务

@Service
@RequiredArgsConstructor
@Slf4j
public class ContentSafetyService {
    
    private final ChatClient chatClient;
    private final MeterRegistry meterRegistry;
    
    /**
     * 检测图片内容是否安全
     * 使用GPT-4o的内容理解能力进行分类
     */
    public ContentSafetyResult checkImage(byte[] imageBytes) {
        
        Media imageMedia = new Media(
            MimeTypeUtils.IMAGE_JPEG,
            new ByteArrayResource(imageBytes)
        );
        
        String response = chatClient.prompt()
            .system("""
                你是内容安全审核系统。检查图片是否包含以下不安全内容：
                1. 明显的暴力或血腥内容
                2. 色情或性暗示内容
                3. 仇恨性符号或内容
                4. 危险活动或违法行为
                
                注意：技术截图、文档、普通生活照片等都是安全的。
                只对明显违规内容标记为不安全。
                
                输出JSON：{"safe": true/false, "category": "安全/暴力/色情/仇恨/其他", "confidence": 0.9}
                """)
            .messages(new UserMessage("请审核这张图片", List.of(imageMedia)))
            .options(OpenAiChatOptions.builder()
                .model("gpt-4o-mini")  // 安全检测用mini，成本低
                .temperature(0.0)
                .maxTokens(100)
                .build())
            .call()
            .content();
        
        ContentSafetyResult result = parseSafetyResult(response);
        
        // 记录指标
        meterRegistry.counter("content_safety.checks",
            "result", result.isSafe() ? "safe" : "unsafe",
            "category", result.getCategory()).increment();
        
        if (!result.isSafe()) {
            log.warn("Unsafe content detected: category={}, confidence={}", 
                     result.getCategory(), result.getConfidence());
        }
        
        return result;
    }
    
    private ContentSafetyResult parseSafetyResult(String response) {
        try {
            JsonNode node = objectMapper.readTree(response);
            return ContentSafetyResult.builder()
                .safe(node.path("safe").asBoolean(true))
                .category(node.path("category").asText("未知"))
                .confidence(node.path("confidence").asDouble(0.5))
                .reason(node.path("safe").asBoolean(true) ? null : "内容不符合安全规范")
                .build();
        } catch (Exception e) {
            log.error("Failed to parse safety result: {}", response);
            return ContentSafetyResult.builder().safe(true).build();  // 解析失败时放行
        }
    }
    
    @Data
    @Builder
    public static class ContentSafetyResult {
        private boolean safe;
        private String category;
        private double confidence;
        private String reason;
    }
}

九、多模态成本控制策略

9.1 图片Token消耗计算

GPT-4o的图片Token计费规则：

@Service
public class MultimodalCostCalculator {
    
    /**
     * 计算图片的Token消耗（OpenAI GPT-4o规则）
     * 低清晰度：固定85 tokens
     * 高清晰度：基于图片尺寸计算
     */
    public int calculateImageTokens(int width, int height, String detail) {
        if ("low".equals(detail)) {
            return 85;  // 低清晰度固定85 tokens
        }
        
        // 高清晰度计算
        // Step 1: 缩放到最长边不超过2048
        if (Math.max(width, height) > 2048) {
            double scale = 2048.0 / Math.max(width, height);
            width = (int)(width * scale);
            height = (int)(height * scale);
        }
        
        // Step 2: 缩放到最短边不超过768
        if (Math.min(width, height) > 768) {
            double scale = 768.0 / Math.min(width, height);
            width = (int)(width * scale);
            height = (int)(height * scale);
        }
        
        // Step 3: 计算512x512的tile数量
        int tilesX = (int) Math.ceil((double) width / 512);
        int tilesY = (int) Math.ceil((double) height / 512);
        int tiles = tilesX * tilesY;
        
        // 每个tile = 170 tokens，加上基础85 tokens
        return tiles * 170 + 85;
    }
    
    /**
     * 图片成本控制建议
     */
    public ImageCostAdvice adviseCostReduction(int width, int height) {
        int lowDetailTokens = 85;
        int highDetailTokens = calculateImageTokens(width, height, "high");
        
        // 成本差异
        double costRatio = (double) highDetailTokens / lowDetailTokens;
        
        if (costRatio > 5) {
            return ImageCostAdvice.builder()
                .recommendation("使用低清晰度模式（low detail），可节省" + 
                    String.format("%.0f%%", (1 - 1.0/costRatio) * 100) + "成本")
                .lowDetailTokens(lowDetailTokens)
                .highDetailTokens(highDetailTokens)
                .useCase("对于截图分析、文字识别等任务，低清晰度通常已足够")
                .build();
        }
        
        return ImageCostAdvice.builder()
            .recommendation("图片尺寸适中，高清晰度开销可接受")
            .lowDetailTokens(lowDetailTokens)
            .highDetailTokens(highDetailTokens)
            .build();
    }
    
    @Data
    @Builder
    public static class ImageCostAdvice {
        private String recommendation;
        private int lowDetailTokens;
        private int highDetailTokens;
        private String useCase;
    }
}

9.2 多模态成本对比数据

输入类型	典型Token消耗	估算成本（GPT-4o）	适用场景
纯文本（500字）	~200 tokens	$0.0005	大多数场景
小图片（512x512，低清）	85 tokens	$0.0002	图标/简单截图
中等图片（1024x1024，高清）	765 tokens	$0.002	详细截图分析
大图片（2048x2048，高清）	2295 tokens	$0.006	大型文档截图
音频（1分钟，Whisper）	$0.006固定	$0.006	语音转写

核心策略：

截图分析默认用low模式，仅在用户要求"仔细看"时升级到high
批量处理图片时，先检测图片内是否有文字（快速判断），无文字的图片用更简单的提示词
对于PDF文档，优先尝试文本提取（0成本），只有扫描件才走视觉理解
音频文件超过10分钟时，分段处理可并行化，但成本不变

FAQ

Q1：Spring AI对所有图片格式都支持吗？

A：支持范围取决于底层模型。OpenAI GPT-4o支持JPEG、PNG、GIF、WEBP。Claude 3支持JPEG、PNG、GIF、WEBP。不支持TIFF、BMP等格式，需要在后端预先转换。推荐统一转换为JPEG（有损压缩，文件更小）或PNG（无损，适合截图）。

Q2：图片内容理解的准确性怎么样？

A：对于截图中的英文文字，准确率接近100%；中文文字识别准确率约95-98%；手写内容约85%；复杂表格约90%。最大的挑战是低分辨率图片（<300px宽），这种情况建议引导用户上传更清晰的截图。

Q3：如何处理用户上传的超大图片？

A：建议在前端做图片压缩（Canvas API），将图片缩放到最大2048px，文件大小压缩到2MB以内，既降低Token消耗，也加快上传速度。后端作为保障，超过限制的图片自动压缩后再传给API。

Q4：Whisper语音识别支持中文吗？

A：支持，且中文识别准确率相当高（WER约5-10%）。可以在请求中指定language: "zh"来提高准确性，避免模型在中英混合时产生混淆。对于带口音的普通话，效果略有下降，但整体优于市面上大多数开源方案。

Q5：多模态功能如何做降级？

A：建议每个多模态功能都有纯文本降级路径。例如：图片分析失败时，提示用户"请描述图片内容"；语音识别失败时，显示文字输入框；PDF解析失败时，提示用户手动粘贴文本。降级策略要在产品层面设计好，而不是只在技术层面处理异常。