多模态AI应用:让Java应用同时理解文字、图片和音频
多模态AI应用:让Java应用同时理解文字、图片和音频
截图一上传就崩,用户满意度+35%的多模态改造记
2025年11月的一个工单,让某SaaS产品的技术负责人张浩看了一分钟没说出话来。
用户投诉内容是这样的:"我遇到一个报错,截图发过去了,AI助手说它看不到图片,让我把错误信息打出来。我一个字一个字打完,又问了一遍。AI说它无法访问截图。我问了第三次,它开始分析一个根本没发生在我这里的问题。最后我卸载了你们的软件。"
那条投诉下面,有47个"我也是"。
张浩去查了产品的用户行为数据:在支持工单场景中,用户尝试上传截图的比例是73%。但他们的AI助手只支持文本,所有图片请求都被前端拦截了,只显示一行提示:"请描述您的问题"。
改造方案经过3周的开发,上线之后第一个月:
- 工单解决率从41%提升到67%
- 用户满意度(CSAT)从3.2提升到4.3(5分制)
- 平均解决工单所需轮次从7.2轮降至3.8轮
差异在于:用户上传截图,AI一眼就能看到报错信息,不需要用户反复描述。
这篇文章,就是那套多模态改造的完整实现。
一、多模态AI现状:2026年你有哪些选择
1.1 主流多模态模型能力对比
| 模型 | 图片理解 | 文档OCR | 语音输入 | 图片生成 | 视频理解 | 价格(图片/次) |
|---|---|---|---|---|---|---|
| GPT-4o | 优秀 | 优秀 | 优秀(Whisper) | DALL-E 3 | 帧抽取 | $0.00765/张* |
| Claude 3.5 Sonnet | 优秀 | 优秀 | 不支持原生 | 不支持 | 有限 | $0.024/张* |
| Gemini 1.5 Pro | 优秀 | 优秀 | 支持 | Imagen | 原生支持 | $0.00263/张* |
| Qwen-VL-Max | 良好 | 良好 | 不支持 | 不支持 | 有限 | ¥0.008/张 |
*图片价格按1024x1024低清模式,实际按Token计算,高清图片Token更多
1.2 Spring AI多模态支持现状
二、图片理解:Spring AI处理图片输入
2.1 基础图片理解(本地文件+URL双模式)
@Service
@RequiredArgsConstructor
@Slf4j
public class ImageUnderstandingService {
private final ChatClient chatClient;
/**
* 理解本地图片文件
*/
public ImageAnalysisResult analyzeLocalImage(
Path imagePath, String question) throws IOException {
// 读取图片文件
byte[] imageBytes = Files.readAllBytes(imagePath);
String mimeType = detectMimeType(imagePath);
// 构建媒体对象
Media imageMedia = new Media(
MimeTypeUtils.parseMimeType(mimeType),
new ByteArrayResource(imageBytes)
);
// 构建包含图片的消息
UserMessage userMessage = new UserMessage(
question,
List.of(imageMedia)
);
long startTime = System.currentTimeMillis();
ChatResponse response = chatClient.prompt()
.messages(userMessage)
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.maxTokens(1024)
.build())
.call()
.chatResponse();
String analysis = response.getResult().getOutput().getText();
long latencyMs = System.currentTimeMillis() - startTime;
return ImageAnalysisResult.builder()
.analysis(analysis)
.imageSize(imageBytes.length)
.mimeType(mimeType)
.latencyMs(latencyMs)
.tokensUsed(response.getMetadata().getUsage().getTotalTokens())
.build();
}
/**
* 理解URL图片(无需下载,直接传URL给模型)
*/
public ImageAnalysisResult analyzeImageFromUrl(String imageUrl, String question) {
// 对于URL图片,使用URL方式传递
Media imageMedia = new Media(
MimeTypeUtils.IMAGE_JPEG,
new UrlResource(imageUrl)
);
UserMessage userMessage = new UserMessage(question, List.of(imageMedia));
String analysis = chatClient.prompt()
.messages(userMessage)
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.build())
.call()
.content();
return ImageAnalysisResult.builder()
.analysis(analysis)
.imageUrl(imageUrl)
.build();
}
/**
* 批量图片分析(多张图片,一次请求)
*/
public String analyzeMultipleImages(List<Path> imagePaths, String instruction)
throws IOException {
List<Media> mediaList = new ArrayList<>();
for (Path imagePath : imagePaths) {
byte[] imageBytes = Files.readAllBytes(imagePath);
String mimeType = detectMimeType(imagePath);
mediaList.add(new Media(
MimeTypeUtils.parseMimeType(mimeType),
new ByteArrayResource(imageBytes)
));
}
// 注意:GPT-4o支持最多20张图片,Claude 3支持最多5张
UserMessage userMessage = new UserMessage(instruction, mediaList);
return chatClient.prompt()
.messages(userMessage)
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.maxTokens(2048)
.build())
.call()
.content();
}
/**
* 技术截图分析(专为工程师截图优化的提示词)
*/
public TechScreenshotAnalysis analyzeTechScreenshot(
MultipartFile screenshotFile, String userContext) throws IOException {
byte[] imageBytes = screenshotFile.getBytes();
Media imageMedia = new Media(
MimeTypeUtils.parseMimeType(
Objects.requireNonNull(screenshotFile.getContentType())),
new ByteArrayResource(imageBytes)
);
String systemPrompt = """
你是一个技术支持专家,擅长分析软件截图中的问题。
分析截图时,请:
1. 识别截图中的错误信息、异常堆栈或警告
2. 理解UI状态和操作上下文
3. 提取关键的技术信息(错误码、版本号、URL等)
4. 给出可能的原因和解决方案
输出JSON格式:
{
"screenshot_type": "error_dialog/log_output/ui_state/code_editor",
"detected_issues": ["问题1", "问题2"],
"key_info": {"error_code": "xxx", "version": "xxx"},
"root_cause": "可能的原因",
"solution_steps": ["步骤1", "步骤2"],
"confidence": 0.85
}
""";
String userMessage = String.format(
"用户说明:%s\n\n请分析这张截图:",
userContext != null ? userContext : "请帮我分析这个问题"
);
UserMessage message = new UserMessage(userMessage, List.of(imageMedia));
String response = chatClient.prompt()
.system(systemPrompt)
.messages(message)
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.temperature(0.1)
.build())
.call()
.content();
return parseTechScreenshotAnalysis(response);
}
private String detectMimeType(Path path) {
String filename = path.getFileName().toString().toLowerCase();
if (filename.endsWith(".jpg") || filename.endsWith(".jpeg")) return "image/jpeg";
if (filename.endsWith(".png")) return "image/png";
if (filename.endsWith(".gif")) return "image/gif";
if (filename.endsWith(".webp")) return "image/webp";
return "image/jpeg"; // 默认
}
@Data
@Builder
public static class ImageAnalysisResult {
private String analysis;
private String imageUrl;
private long imageSize;
private String mimeType;
private long latencyMs;
private long tokensUsed;
}
@Data
@Builder
public static class TechScreenshotAnalysis {
private String screenshotType;
private List<String> detectedIssues;
private Map<String, String> keyInfo;
private String rootCause;
private List<String> solutionSteps;
private double confidence;
}
}2.2 图片上传Controller(Spring MVC集成)
@RestController
@RequestMapping("/api/multimodal")
@RequiredArgsConstructor
@Slf4j
public class MultimodalController {
private final ImageUnderstandingService imageService;
private final AudioTranscriptionService audioService;
private final DocumentUnderstandingService documentService;
private final ContentSafetyService safetyService;
@PostMapping("/image/analyze")
public ResponseEntity<ImageAnalysisResponse> analyzeImage(
@RequestParam("file") MultipartFile file,
@RequestParam(value = "question", defaultValue = "请描述这张图片的内容")
String question,
@RequestParam(value = "userId") String userId) {
// 文件大小检查(图片最大20MB)
if (file.getSize() > 20 * 1024 * 1024) {
return ResponseEntity.badRequest()
.body(ImageAnalysisResponse.error("图片大小不能超过20MB"));
}
// 文件类型检查
String contentType = file.getContentType();
if (!isImageContentType(contentType)) {
return ResponseEntity.badRequest()
.body(ImageAnalysisResponse.error("不支持的文件类型:" + contentType));
}
try {
// 安全检查(有害图片过滤)
ContentSafetyResult safety = safetyService.checkImage(file.getBytes());
if (!safety.isSafe()) {
log.warn("Unsafe image detected from user: {}, reason: {}",
userId, safety.getReason());
return ResponseEntity.status(HttpStatus.UNPROCESSABLE_ENTITY)
.body(ImageAnalysisResponse.error("图片内容不符合使用规范"));
}
// 分析图片
ImageUnderstandingService.ImageAnalysisResult result;
if (isTechScreenshot(question)) {
ImageUnderstandingService.TechScreenshotAnalysis analysis =
imageService.analyzeTechScreenshot(file, question);
return ResponseEntity.ok(ImageAnalysisResponse.fromTechAnalysis(analysis));
} else {
result = imageService.analyzeLocalImage(
saveToTempFile(file), question);
return ResponseEntity.ok(ImageAnalysisResponse.fromResult(result));
}
} catch (IOException e) {
log.error("Image analysis failed for user: {}", userId, e);
return ResponseEntity.internalServerError()
.body(ImageAnalysisResponse.error("图片处理失败,请重试"));
}
}
@PostMapping("/image/compare")
public ResponseEntity<String> compareImages(
@RequestParam("before") MultipartFile beforeImage,
@RequestParam("after") MultipartFile afterImage,
@RequestParam(value = "context", defaultValue = "请比较这两张图片的差异")
String context) throws IOException {
String result = imageService.analyzeMultipleImages(
List.of(saveToTempFile(beforeImage), saveToTempFile(afterImage)),
context + "\n请重点指出before(第一张)和after(第二张)的具体差异。"
);
return ResponseEntity.ok(result);
}
private boolean isImageContentType(String contentType) {
return contentType != null &&
(contentType.startsWith("image/jpeg") ||
contentType.startsWith("image/png") ||
contentType.startsWith("image/gif") ||
contentType.startsWith("image/webp"));
}
private boolean isTechScreenshot(String question) {
String lower = question.toLowerCase();
return lower.contains("错误") || lower.contains("error") ||
lower.contains("bug") || lower.contains("异常") ||
lower.contains("报错") || lower.contains("日志");
}
private Path saveToTempFile(MultipartFile file) throws IOException {
Path tempFile = Files.createTempFile("multimodal_",
getExtension(file.getOriginalFilename()));
file.transferTo(tempFile);
tempFile.toFile().deleteOnExit();
return tempFile;
}
private String getExtension(String filename) {
if (filename == null) return ".jpg";
int dotIndex = filename.lastIndexOf('.');
return dotIndex >= 0 ? filename.substring(dotIndex) : ".jpg";
}
}三、文档理解:PDF与图片文档的OCR+理解
3.1 PDF文档智能解析
@Service
@RequiredArgsConstructor
@Slf4j
public class DocumentUnderstandingService {
private final ChatClient chatClient;
private final PDFBoxDocumentLoader pdfLoader;
/**
* 智能解析PDF文档(支持文字PDF和扫描PDF)
*/
public DocumentAnalysisResult analyzePdf(Path pdfPath, DocumentAnalysisConfig config)
throws IOException {
// 尝试文本提取(适用于电子PDF)
String extractedText = extractTextFromPdf(pdfPath);
if (extractedText.length() > 100) {
// 文字PDF,直接用文本分析
return analyzeTextDocument(extractedText, config);
} else {
// 扫描PDF,转换为图片后用视觉理解
return analyzeScannedPdf(pdfPath, config);
}
}
private String extractTextFromPdf(Path pdfPath) throws IOException {
try (PDDocument document = PDDocument.load(pdfPath.toFile())) {
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(document);
}
}
private DocumentAnalysisResult analyzeTextDocument(
String text, DocumentAnalysisConfig config) {
// 如果文档太长,进行分块处理
List<String> chunks = chunkText(text, 4000); // 4000 tokens per chunk
if (chunks.size() == 1) {
// 短文档直接分析
String analysis = chatClient.prompt()
.system(config.getAnalysisPrompt())
.user("请分析以下文档:\n\n" + chunks.get(0))
.call()
.content();
return DocumentAnalysisResult.builder()
.content(analysis)
.pageCount(1)
.processingMethod("TEXT_EXTRACTION")
.build();
} else {
// 长文档:先分块摘要,再综合
return analyzeWithMapReduce(chunks, config);
}
}
private DocumentAnalysisResult analyzeWithMapReduce(
List<String> chunks, DocumentAnalysisConfig config) {
// Map阶段:每个chunk独立摘要
List<String> chunkSummaries = chunks.parallelStream()
.map(chunk -> {
return chatClient.prompt()
.system("提取以下段落的关键信息,保留重要数据和结论")
.user(chunk)
.options(OpenAiChatOptions.builder()
.model("gpt-4o-mini") // 分块摘要用mini降低成本
.maxTokens(500)
.build())
.call()
.content();
})
.collect(Collectors.toList());
// Reduce阶段:综合所有摘要
String combinedSummaries = String.join("\n\n---\n\n", chunkSummaries);
String finalAnalysis = chatClient.prompt()
.system(config.getAnalysisPrompt())
.user("以下是文档各部分的摘要,请综合分析:\n\n" + combinedSummaries)
.options(OpenAiChatOptions.builder()
.model("gpt-4o") // 最终综合用强模型
.build())
.call()
.content();
return DocumentAnalysisResult.builder()
.content(finalAnalysis)
.chunkCount(chunks.size())
.processingMethod("MAP_REDUCE")
.build();
}
private DocumentAnalysisResult analyzeScannedPdf(
Path pdfPath, DocumentAnalysisConfig config) throws IOException {
// 将PDF页面转换为图片
List<byte[]> pageImages = convertPdfToImages(pdfPath);
if (pageImages.isEmpty()) {
return DocumentAnalysisResult.builder()
.content("无法解析该PDF文档")
.processingMethod("FAILED")
.build();
}
// 限制最多处理10页(控制成本)
List<byte[]> pagesToProcess = pageImages.subList(
0, Math.min(10, pageImages.size()));
List<Media> pageMedia = pagesToProcess.stream()
.map(bytes -> new Media(MimeTypeUtils.IMAGE_PNG, new ByteArrayResource(bytes)))
.collect(Collectors.toList());
UserMessage message = new UserMessage(
config.getAnalysisPrompt() + "\n\n这是文档的各页图片,请提取并分析内容:",
pageMedia
);
String analysis = chatClient.prompt()
.messages(message)
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.maxTokens(4096)
.build())
.call()
.content();
return DocumentAnalysisResult.builder()
.content(analysis)
.pageCount(pageImages.size())
.processedPageCount(pagesToProcess.size())
.processingMethod("OCR_VISION")
.build();
}
private List<byte[]> convertPdfToImages(Path pdfPath) throws IOException {
List<byte[]> images = new ArrayList<>();
try (PDDocument document = PDDocument.load(pdfPath.toFile())) {
PDFRenderer renderer = new PDFRenderer(document);
for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
BufferedImage image = renderer.renderImageWithDPI(pageIndex, 150); // 150 DPI平衡质量和大小
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(image, "PNG", baos);
images.add(baos.toByteArray());
}
}
return images;
}
private List<String> chunkText(String text, int maxTokens) {
// 简单按字符数分块,实际生产应使用tiktoken计数
int maxChars = maxTokens * 4; // 估算1 token ≈ 4字符
List<String> chunks = new ArrayList<>();
int start = 0;
while (start < text.length()) {
int end = Math.min(start + maxChars, text.length());
// 尝试在段落边界切分
if (end < text.length()) {
int lastNewline = text.lastIndexOf("\n\n", end);
if (lastNewline > start + maxChars / 2) {
end = lastNewline;
}
}
chunks.add(text.substring(start, end));
start = end;
}
return chunks;
}
@Data
@Builder
public static class DocumentAnalysisConfig {
private String analysisPrompt;
private boolean extractStructuredData;
private List<String> fieldsToExtract;
public static DocumentAnalysisConfig defaultConfig() {
return DocumentAnalysisConfig.builder()
.analysisPrompt("请分析文档内容,提取关键信息,给出摘要")
.build();
}
}
@Data
@Builder
public static class DocumentAnalysisResult {
private String content;
private int pageCount;
private int processedPageCount;
private int chunkCount;
private String processingMethod;
private Map<String, Object> extractedFields;
}
}四、语音转文字:Whisper API集成
4.1 Spring AI音频转写实现
@Service
@RequiredArgsConstructor
@Slf4j
public class AudioTranscriptionService {
private final OpenAiAudioTranscriptionModel transcriptionModel;
private final ChatClient chatClient;
/**
* 音频文件转写(支持mp3/mp4/wav/m4a/webm等格式)
*/
public TranscriptionResult transcribeAudio(
MultipartFile audioFile, String language) throws IOException {
// 文件大小检查(Whisper API限制25MB)
if (audioFile.getSize() > 25 * 1024 * 1024) {
throw new IllegalArgumentException("音频文件不能超过25MB");
}
long startTime = System.currentTimeMillis();
OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
.model("whisper-1")
.language(language) // 可为null(自动检测)
.responseFormat(OpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
.temperature(0.0f) // 低温度提高准确性
.build();
AudioTranscriptionResponse response = transcriptionModel.call(
new AudioTranscriptionPrompt(
new ByteArrayResource(audioFile.getBytes()) {
@Override
public String getFilename() {
return audioFile.getOriginalFilename();
}
},
options
)
);
String transcribedText = response.getResult().getOutput();
long latencyMs = System.currentTimeMillis() - startTime;
log.info("Audio transcription completed: {} chars in {}ms",
transcribedText.length(), latencyMs);
return TranscriptionResult.builder()
.text(transcribedText)
.language(extractLanguage(response))
.durationSeconds(extractDuration(response))
.latencyMs(latencyMs)
.build();
}
/**
* 实时语音录入(适合长时间对话场景)
* 使用分段录音,每段最长30秒
*/
public Flux<String> transcribeAudioStream(Flux<byte[]> audioChunks, String language) {
return audioChunks
.buffer(Duration.ofSeconds(30)) // 每30秒处理一次
.flatMap(chunks -> {
byte[] combined = combineChunks(chunks);
try {
TranscriptionResult result = transcribeBytes(combined, language);
return Flux.just(result.getText());
} catch (Exception e) {
log.error("Stream transcription failed", e);
return Flux.empty();
}
});
}
/**
* 语音+图片组合输入(例如:录音描述一张图片的问题)
*/
public String processVoiceWithImage(
MultipartFile audioFile, MultipartFile imageFile,
String language) throws IOException {
// 先转写语音
TranscriptionResult transcription = transcribeAudio(audioFile, language);
log.info("Voice transcription: {}", transcription.getText());
// 结合图片进行分析
Media imageMedia = new Media(
MimeTypeUtils.parseMimeType(Objects.requireNonNull(imageFile.getContentType())),
new ByteArrayResource(imageFile.getBytes())
);
String combinedQuestion = String.format(
"用户的语音问题(已转写):\"%s\"\n\n请基于以上问题,结合图片内容给出回答。",
transcription.getText()
);
UserMessage message = new UserMessage(combinedQuestion, List.of(imageMedia));
return chatClient.prompt()
.messages(message)
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.build())
.call()
.content();
}
private TranscriptionResult transcribeBytes(byte[] audioBytes, String language)
throws IOException {
OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
.model("whisper-1")
.language(language)
.build();
AudioTranscriptionResponse response = transcriptionModel.call(
new AudioTranscriptionPrompt(
new ByteArrayResource(audioBytes) {
@Override
public String getFilename() { return "chunk.webm"; }
},
options
)
);
return TranscriptionResult.builder()
.text(response.getResult().getOutput())
.build();
}
private byte[] combineChunks(List<byte[]> chunks) {
int totalSize = chunks.stream().mapToInt(c -> c.length).sum();
ByteBuffer buffer = ByteBuffer.allocate(totalSize);
chunks.forEach(buffer::put);
return buffer.array();
}
private String extractLanguage(AudioTranscriptionResponse response) {
try {
return response.getResult().getMetadata().get("language", String.class);
} catch (Exception e) {
return "unknown";
}
}
private Double extractDuration(AudioTranscriptionResponse response) {
try {
return response.getResult().getMetadata().get("duration", Double.class);
} catch (Exception e) {
return null;
}
}
@Data
@Builder
public static class TranscriptionResult {
private String text;
private String language;
private Double durationSeconds;
private long latencyMs;
}
}五、图片生成:DALL-E集成
5.1 图片生成服务
@Service
@RequiredArgsConstructor
@Slf4j
public class ImageGenerationService {
private final ImageModel imageModel;
private final ChatClient chatClient;
/**
* 根据文本描述生成图片(DALL-E 3)
*/
public ImageGenerationResult generateImage(
String prompt, ImageGenerationConfig config) {
// 自动优化提示词(可选)
String optimizedPrompt = config.isAutoOptimize() ?
optimizeImagePrompt(prompt) : prompt;
long startTime = System.currentTimeMillis();
ImageOptions options = OpenAiImageOptions.builder()
.model("dall-e-3")
.quality(config.getQuality()) // standard/hd
.size(config.getSize()) // 1024x1024/1792x1024/1024x1792
.style(config.getStyle()) // vivid/natural
.n(1) // DALL-E 3每次只生成1张
.build();
ImageResponse response = imageModel.call(
new ImagePrompt(optimizedPrompt, options));
Image image = response.getResult().getOutput();
long latencyMs = System.currentTimeMillis() - startTime;
return ImageGenerationResult.builder()
.imageUrl(image.getUrl())
.revisedPrompt(image.getRevisedPrompt()) // DALL-E可能会修改提示词
.originalPrompt(prompt)
.optimizedPrompt(optimizedPrompt)
.latencyMs(latencyMs)
.build();
}
/**
* 流程图/架构图生成(将文字描述转为可视化图)
* 注意:这里是生成示意图,精确的架构图建议用Mermaid/PlantUML
*/
public ImageGenerationResult generateDiagram(String description) {
String diagramPrompt = String.format("""
Create a clean, professional technical diagram showing: %s
Style: minimalist, white background, clear labels,
professional colors (blue, gray, white),
arrows showing data flow, boxes for components.
Make it look like a software architecture diagram.
""", description);
return generateImage(diagramPrompt, ImageGenerationConfig.builder()
.quality("standard")
.size("1792x1024") // 横向更适合架构图
.style("natural")
.autoOptimize(false)
.build());
}
/**
* 使用AI优化图片生成提示词
*/
private String optimizeImagePrompt(String userPrompt) {
return chatClient.prompt()
.system("""
你是DALL-E 3的提示词工程专家。
将用户的描述优化为更好的DALL-E提示词。
优化要点:
1. 添加具体的风格描述(photorealistic/digital art/etc.)
2. 指定光线、构图、细节
3. 避免DALL-E不支持的内容
4. 只返回优化后的提示词,不要解释
""")
.user(userPrompt)
.options(OpenAiChatOptions.builder()
.model("gpt-4o-mini")
.maxTokens(200)
.build())
.call()
.content();
}
@Data
@Builder
public static class ImageGenerationConfig {
private String quality;
private String size;
private String style;
private boolean autoOptimize;
public static ImageGenerationConfig defaultConfig() {
return ImageGenerationConfig.builder()
.quality("standard")
.size("1024x1024")
.style("vivid")
.autoOptimize(false)
.build();
}
}
@Data
@Builder
public static class ImageGenerationResult {
private String imageUrl;
private String revisedPrompt;
private String originalPrompt;
private String optimizedPrompt;
private long latencyMs;
}
}六、多模态RAG:图文混合知识库
6.1 图文混合向量化策略
@Service
@RequiredArgsConstructor
@Slf4j
public class MultimodalRagService {
private final EmbeddingModel textEmbeddingModel;
private final VectorStore vectorStore;
private final ChatClient chatClient;
private final ImageUnderstandingService imageService;
/**
* 向知识库添加图文混合内容
* 策略:图片先用视觉模型生成描述,再用文本嵌入
*/
public void addImageToKnowledgeBase(
Path imagePath, String category, Map<String, String> metadata)
throws IOException {
// Step 1:用GPT-4o生成图片的详细文字描述
String imageDescription = chatClient.prompt()
.messages(new UserMessage(
"请详细描述这张图片,包括:内容、文字信息、图表数据、技术细节等。" +
"输出详细的文字描述,用于知识库检索。",
List.of(new Media(
MimeTypeUtils.parseMimeType(detectMimeType(imagePath)),
new ByteArrayResource(Files.readAllBytes(imagePath))
))
))
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.maxTokens(1000)
.build())
.call()
.content();
// Step 2:创建文档,将图片路径和描述一起存储
Map<String, Object> docMetadata = new HashMap<>(metadata);
docMetadata.put("content_type", "image");
docMetadata.put("image_path", imagePath.toString());
docMetadata.put("category", category);
Document document = new Document(imageDescription, docMetadata);
// Step 3:文本嵌入(使用标准文本嵌入模型)
vectorStore.add(List.of(document));
log.info("Image {} added to knowledge base with description: {}...",
imagePath.getFileName(),
imageDescription.substring(0, Math.min(100, imageDescription.length())));
}
/**
* 多模态RAG查询:支持用图片或文字搜索知识库
*/
public MultimodalRagResponse query(
String textQuery, Optional<MultipartFile> imageQuery) throws IOException {
String searchQuery = textQuery;
// 如果有图片,先提取图片中的文字/内容,增强搜索
if (imageQuery.isPresent()) {
String imageContent = chatClient.prompt()
.messages(new UserMessage(
"提取这张图片中的关键信息、文字内容和主要概念,用于知识库检索",
List.of(new Media(
MimeTypeUtils.parseMimeType(
Objects.requireNonNull(imageQuery.get().getContentType())),
new ByteArrayResource(imageQuery.get().getBytes())
))
))
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.maxTokens(200)
.build())
.call()
.content();
searchQuery = textQuery + "\n图片内容:" + imageContent;
}
// 搜索向量库
List<Document> relevantDocs = vectorStore.similaritySearch(
SearchRequest.query(searchQuery).withTopK(5));
if (relevantDocs.isEmpty()) {
return MultimodalRagResponse.builder()
.answer("未找到相关知识库内容")
.sources(List.of())
.build();
}
// 构建RAG上下文
String context = buildRagContext(relevantDocs);
// 最终问答(如果有图片,一并传给LLM)
String answer;
if (imageQuery.isPresent()) {
UserMessage finalMessage = new UserMessage(
String.format("基于以下知识库内容,回答用户的问题:\n\n%s\n\n用户问题:%s",
context, textQuery),
List.of(new Media(
MimeTypeUtils.parseMimeType(
Objects.requireNonNull(imageQuery.get().getContentType())),
new ByteArrayResource(imageQuery.get().getBytes())
))
);
answer = chatClient.prompt().messages(finalMessage).call().content();
} else {
answer = chatClient.prompt()
.system("基于提供的知识库内容回答问题,如果知识库中没有相关信息请说明")
.user(String.format("知识库内容:\n%s\n\n问题:%s", context, textQuery))
.call()
.content();
}
return MultimodalRagResponse.builder()
.answer(answer)
.sources(relevantDocs.stream()
.map(d -> DocumentSource.builder()
.contentType((String) d.getMetadata().get("content_type"))
.imagePath((String) d.getMetadata().get("image_path"))
.category((String) d.getMetadata().get("category"))
.relevanceScore(d.getScore())
.build())
.collect(Collectors.toList()))
.build();
}
private String buildRagContext(List<Document> docs) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < docs.size(); i++) {
sb.append(String.format("\n--- 参考资料%d(%s)---\n%s\n",
i + 1,
docs.get(i).getMetadata().getOrDefault("category", "未知分类"),
docs.get(i).getText()
));
}
return sb.toString();
}
private String detectMimeType(Path path) {
String filename = path.getFileName().toString().toLowerCase();
if (filename.endsWith(".png")) return "image/png";
if (filename.endsWith(".gif")) return "image/gif";
return "image/jpeg";
}
@Data
@Builder
public static class MultimodalRagResponse {
private String answer;
private List<DocumentSource> sources;
}
@Data
@Builder
public static class DocumentSource {
private String contentType;
private String imagePath;
private String category;
private Float relevanceScore;
}
}七、视频理解:帧提取+逐帧分析
7.1 视频分析服务
@Service
@RequiredArgsConstructor
@Slf4j
public class VideoUnderstandingService {
private final ChatClient chatClient;
/**
* 视频分析:提取关键帧,多帧同时理解
* 注意:GPT-4o支持最多多帧分析,但每帧都消耗Token
*/
public VideoAnalysisResult analyzeVideo(
Path videoPath, VideoAnalysisConfig config) throws IOException {
// 使用FFmpeg提取关键帧(每N秒一帧)
List<byte[]> keyFrames = extractKeyFrames(videoPath, config.getFrameIntervalSeconds());
if (keyFrames.isEmpty()) {
return VideoAnalysisResult.builder()
.summary("无法提取视频帧")
.build();
}
// 限制最多20帧(控制Token消耗)
List<byte[]> selectedFrames = keyFrames.subList(
0, Math.min(config.getMaxFrames(), keyFrames.size()));
log.info("Analyzing video with {} frames (total {} frames extracted)",
selectedFrames.size(), keyFrames.size());
// 构建多帧分析消息
List<Media> frameMedia = selectedFrames.stream()
.map(frameBytes -> new Media(MimeTypeUtils.IMAGE_JPEG,
new ByteArrayResource(frameBytes)))
.collect(Collectors.toList());
String instructionWithContext = String.format(
"""
这是视频的%d个关键帧(按时间顺序),每帧间隔%d秒。
%s
请基于这些帧的内容进行分析,关注时间上的变化和发展。
""",
selectedFrames.size(),
config.getFrameIntervalSeconds(),
config.getAnalysisInstruction()
);
UserMessage message = new UserMessage(instructionWithContext, frameMedia);
String analysis = chatClient.prompt()
.messages(message)
.options(OpenAiChatOptions.builder()
.model("gpt-4o")
.maxTokens(2048)
.build())
.call()
.content();
return VideoAnalysisResult.builder()
.summary(analysis)
.totalFramesExtracted(keyFrames.size())
.framesAnalyzed(selectedFrames.size())
.videoPath(videoPath.toString())
.build();
}
/**
* 使用FFmpeg提取关键帧
* 需要系统安装FFmpeg
*/
private List<byte[]> extractKeyFrames(Path videoPath, int intervalSeconds)
throws IOException {
List<byte[]> frames = new ArrayList<>();
Path tempDir = Files.createTempDirectory("video_frames_");
try {
// 调用FFmpeg提取帧
ProcessBuilder pb = new ProcessBuilder(
"ffmpeg",
"-i", videoPath.toString(),
"-vf", String.format("fps=1/%d,scale=1280:-1", intervalSeconds),
"-q:v", "3", // JPEG质量
tempDir.toString() + "/frame_%04d.jpg"
);
pb.redirectErrorStream(true);
Process process = pb.start();
int exitCode = process.waitFor();
if (exitCode != 0) {
log.error("FFmpeg failed with exit code: {}", exitCode);
return frames;
}
// 读取提取的帧
Files.list(tempDir)
.sorted()
.forEach(framePath -> {
try {
frames.add(Files.readAllBytes(framePath));
} catch (IOException e) {
log.warn("Failed to read frame: {}", framePath);
}
});
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
// 清理临时文件
FileUtils.deleteDirectory(tempDir.toFile());
}
return frames;
}
@Data
@Builder
public static class VideoAnalysisConfig {
private int frameIntervalSeconds; // 帧间隔(秒)
private int maxFrames; // 最多分析帧数
private String analysisInstruction;
public static VideoAnalysisConfig defaultConfig() {
return VideoAnalysisConfig.builder()
.frameIntervalSeconds(5)
.maxFrames(20)
.analysisInstruction("请描述视频的主要内容和发展过程")
.build();
}
}
@Data
@Builder
public static class VideoAnalysisResult {
private String summary;
private int totalFramesExtracted;
private int framesAnalyzed;
private String videoPath;
}
}八、内容安全:图片的有害内容检测
8.1 内容安全服务
@Service
@RequiredArgsConstructor
@Slf4j
public class ContentSafetyService {
private final ChatClient chatClient;
private final MeterRegistry meterRegistry;
/**
* 检测图片内容是否安全
* 使用GPT-4o的内容理解能力进行分类
*/
public ContentSafetyResult checkImage(byte[] imageBytes) {
Media imageMedia = new Media(
MimeTypeUtils.IMAGE_JPEG,
new ByteArrayResource(imageBytes)
);
String response = chatClient.prompt()
.system("""
你是内容安全审核系统。检查图片是否包含以下不安全内容:
1. 明显的暴力或血腥内容
2. 色情或性暗示内容
3. 仇恨性符号或内容
4. 危险活动或违法行为
注意:技术截图、文档、普通生活照片等都是安全的。
只对明显违规内容标记为不安全。
输出JSON:{"safe": true/false, "category": "安全/暴力/色情/仇恨/其他", "confidence": 0.9}
""")
.messages(new UserMessage("请审核这张图片", List.of(imageMedia)))
.options(OpenAiChatOptions.builder()
.model("gpt-4o-mini") // 安全检测用mini,成本低
.temperature(0.0)
.maxTokens(100)
.build())
.call()
.content();
ContentSafetyResult result = parseSafetyResult(response);
// 记录指标
meterRegistry.counter("content_safety.checks",
"result", result.isSafe() ? "safe" : "unsafe",
"category", result.getCategory()).increment();
if (!result.isSafe()) {
log.warn("Unsafe content detected: category={}, confidence={}",
result.getCategory(), result.getConfidence());
}
return result;
}
private ContentSafetyResult parseSafetyResult(String response) {
try {
JsonNode node = objectMapper.readTree(response);
return ContentSafetyResult.builder()
.safe(node.path("safe").asBoolean(true))
.category(node.path("category").asText("未知"))
.confidence(node.path("confidence").asDouble(0.5))
.reason(node.path("safe").asBoolean(true) ? null : "内容不符合安全规范")
.build();
} catch (Exception e) {
log.error("Failed to parse safety result: {}", response);
return ContentSafetyResult.builder().safe(true).build(); // 解析失败时放行
}
}
@Data
@Builder
public static class ContentSafetyResult {
private boolean safe;
private String category;
private double confidence;
private String reason;
}
}九、多模态成本控制策略
9.1 图片Token消耗计算
GPT-4o的图片Token计费规则:
@Service
public class MultimodalCostCalculator {
/**
* 计算图片的Token消耗(OpenAI GPT-4o规则)
* 低清晰度:固定85 tokens
* 高清晰度:基于图片尺寸计算
*/
public int calculateImageTokens(int width, int height, String detail) {
if ("low".equals(detail)) {
return 85; // 低清晰度固定85 tokens
}
// 高清晰度计算
// Step 1: 缩放到最长边不超过2048
if (Math.max(width, height) > 2048) {
double scale = 2048.0 / Math.max(width, height);
width = (int)(width * scale);
height = (int)(height * scale);
}
// Step 2: 缩放到最短边不超过768
if (Math.min(width, height) > 768) {
double scale = 768.0 / Math.min(width, height);
width = (int)(width * scale);
height = (int)(height * scale);
}
// Step 3: 计算512x512的tile数量
int tilesX = (int) Math.ceil((double) width / 512);
int tilesY = (int) Math.ceil((double) height / 512);
int tiles = tilesX * tilesY;
// 每个tile = 170 tokens,加上基础85 tokens
return tiles * 170 + 85;
}
/**
* 图片成本控制建议
*/
public ImageCostAdvice adviseCostReduction(int width, int height) {
int lowDetailTokens = 85;
int highDetailTokens = calculateImageTokens(width, height, "high");
// 成本差异
double costRatio = (double) highDetailTokens / lowDetailTokens;
if (costRatio > 5) {
return ImageCostAdvice.builder()
.recommendation("使用低清晰度模式(low detail),可节省" +
String.format("%.0f%%", (1 - 1.0/costRatio) * 100) + "成本")
.lowDetailTokens(lowDetailTokens)
.highDetailTokens(highDetailTokens)
.useCase("对于截图分析、文字识别等任务,低清晰度通常已足够")
.build();
}
return ImageCostAdvice.builder()
.recommendation("图片尺寸适中,高清晰度开销可接受")
.lowDetailTokens(lowDetailTokens)
.highDetailTokens(highDetailTokens)
.build();
}
@Data
@Builder
public static class ImageCostAdvice {
private String recommendation;
private int lowDetailTokens;
private int highDetailTokens;
private String useCase;
}
}9.2 多模态成本对比数据
| 输入类型 | 典型Token消耗 | 估算成本(GPT-4o) | 适用场景 |
|---|---|---|---|
| 纯文本(500字) | ~200 tokens | $0.0005 | 大多数场景 |
| 小图片(512x512,低清) | 85 tokens | $0.0002 | 图标/简单截图 |
| 中等图片(1024x1024,高清) | 765 tokens | $0.002 | 详细截图分析 |
| 大图片(2048x2048,高清) | 2295 tokens | $0.006 | 大型文档截图 |
| 音频(1分钟,Whisper) | $0.006固定 | $0.006 | 语音转写 |
核心策略:
- 截图分析默认用
low模式,仅在用户要求"仔细看"时升级到high - 批量处理图片时,先检测图片内是否有文字(快速判断),无文字的图片用更简单的提示词
- 对于PDF文档,优先尝试文本提取(0成本),只有扫描件才走视觉理解
- 音频文件超过10分钟时,分段处理可并行化,但成本不变
FAQ
Q1:Spring AI对所有图片格式都支持吗?
A:支持范围取决于底层模型。OpenAI GPT-4o支持JPEG、PNG、GIF、WEBP。Claude 3支持JPEG、PNG、GIF、WEBP。不支持TIFF、BMP等格式,需要在后端预先转换。推荐统一转换为JPEG(有损压缩,文件更小)或PNG(无损,适合截图)。
Q2:图片内容理解的准确性怎么样?
A:对于截图中的英文文字,准确率接近100%;中文文字识别准确率约95-98%;手写内容约85%;复杂表格约90%。最大的挑战是低分辨率图片(<300px宽),这种情况建议引导用户上传更清晰的截图。
Q3:如何处理用户上传的超大图片?
A:建议在前端做图片压缩(Canvas API),将图片缩放到最大2048px,文件大小压缩到2MB以内,既降低Token消耗,也加快上传速度。后端作为保障,超过限制的图片自动压缩后再传给API。
Q4:Whisper语音识别支持中文吗?
A:支持,且中文识别准确率相当高(WER约5-10%)。可以在请求中指定language: "zh"来提高准确性,避免模型在中英混合时产生混淆。对于带口音的普通话,效果略有下降,但整体优于市面上大多数开源方案。
Q5:多模态功能如何做降级?
A:建议每个多模态功能都有纯文本降级路径。例如:图片分析失败时,提示用户"请描述图片内容";语音识别失败时,显示文字输入框;PDF解析失败时,提示用户手动粘贴文本。降级策略要在产品层面设计好,而不是只在技术层面处理异常。
