第1846篇：会议纪要自动化——录音转文字、要点提取与行动项识别

老张2026/4/30大约 9 分钟

第1846篇：会议纪要自动化——录音转文字、要点提取与行动项识别

我在一家公司做顾问的时候，发现他们有个奇特的现象：每周例会开了一小时，会后的会议纪要要花一个下午。

专门有一个人的职责是"整理会议纪要"。她要回放录音，把录音一段一段转写成文字，然后再从密密麻麻的文字里提取要点，再识别谁承诺了什么事情……整个过程极其机械，而且还很容易出错——哪句话是谁说的、某个行动项是谁负责、截止日期是哪天，这些信息稍不注意就会错漏。

这件事的自动化天花板非常高，而且是非常好的AI应用场景：输入是音频，输出是结构化信息。

今天我来拆解一套完整的会议纪要自动化系统，从录音上传到结构化输出，全流程覆盖。

系统总体设计

整套系统的核心流程：

第一步：音频预处理

原始会议录音质量往往很差——背景噪音、麦克风切换、多人同时说话。在送给语音识别之前，需要做预处理。

@Service
@Slf4j
public class AudioPreprocessor {
    
    private final Path tempDirectory;
    
    public AudioPreprocessor(@Value("${audio.temp.dir:/tmp/meeting-audio}") String tempDir) {
        this.tempDirectory = Paths.get(tempDir);
        try {
            Files.createDirectories(this.tempDirectory);
        } catch (IOException e) {
            throw new RuntimeException("无法创建临时目录", e);
        }
    }
    
    /**
     * 预处理音频文件：格式转换 + 降噪 + 分段
     */
    public ProcessedAudio preprocess(Path inputFile) throws IOException {
        log.info("开始预处理音频: {}", inputFile.getFileName());
        
        // 1. 转换为WAV格式（16kHz单声道，Whisper最优输入）
        Path wavFile = convertToWav(inputFile);
        
        // 2. 检查文件大小，超过25MB需要分段
        long fileSizeBytes = Files.size(wavFile);
        if (fileSizeBytes > 25 * 1024 * 1024) {
            log.info("文件较大({}MB)，将进行分段处理", fileSizeBytes / (1024 * 1024));
            List<Path> segments = splitAudio(wavFile, 10 * 60);  // 每段10分钟
            return ProcessedAudio.builder()
                .originalFile(inputFile)
                .segments(segments)
                .isSegmented(true)
                .build();
        }
        
        return ProcessedAudio.builder()
            .originalFile(inputFile)
            .segments(List.of(wavFile))
            .isSegmented(false)
            .build();
    }
    
    private Path convertToWav(Path inputFile) throws IOException {
        Path outputFile = tempDirectory.resolve(
            inputFile.getFileName().toString().replaceAll("\\.[^.]+$", "") + "_processed.wav"
        );
        
        // 使用FFmpeg转换
        ProcessBuilder pb = new ProcessBuilder(
            "ffmpeg", "-i", inputFile.toString(),
            "-ar", "16000",      // 采样率16kHz
            "-ac", "1",          // 单声道
            "-acodec", "pcm_s16le",  // 16位PCM
            "-af", "highpass=f=200,lowpass=f=3000,afftdn=nf=-25",  // 滤波降噪
            "-y",                // 覆盖输出文件
            outputFile.toString()
        );
        
        pb.redirectErrorStream(true);
        Process process = pb.start();
        
        try {
            int exitCode = process.waitFor();
            if (exitCode != 0) {
                String output = new String(process.getInputStream().readAllBytes());
                throw new RuntimeException("FFmpeg转换失败: " + output);
            }
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new IOException("音频转换被中断", e);
        }
        
        return outputFile;
    }
    
    private List<Path> splitAudio(Path audioFile, int segmentSeconds) throws IOException {
        List<Path> segments = new ArrayList<>();
        String baseName = audioFile.getFileName().toString().replaceAll("\\.[^.]+$", "");
        
        // 获取音频总时长
        long totalSeconds = getAudioDuration(audioFile);
        int segmentCount = (int) Math.ceil((double) totalSeconds / segmentSeconds);
        
        for (int i = 0; i < segmentCount; i++) {
            Path segmentFile = tempDirectory.resolve(
                String.format("%s_seg%03d.wav", baseName, i)
            );
            
            ProcessBuilder pb = new ProcessBuilder(
                "ffmpeg", "-i", audioFile.toString(),
                "-ss", String.valueOf(i * segmentSeconds),
                "-t", String.valueOf(segmentSeconds),
                "-y", segmentFile.toString()
            );
            
            pb.redirectErrorStream(true);
            Process process = pb.start();
            
            try {
                process.waitFor();
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
            
            segments.add(segmentFile);
        }
        
        return segments;
    }
}

第二步：语音转文字与说话人识别

@Service
@Slf4j
public class SpeechToTextService {
    
    @Value("${openai.api.key}")
    private String openaiApiKey;
    
    private final RestTemplate restTemplate;
    
    /**
     * 使用Whisper API进行转写（含说话人时间戳）
     */
    public TranscriptionResult transcribe(ProcessedAudio audio, 
            TranscriptionOptions options) {
        
        List<TranscriptionSegment> allSegments = new ArrayList<>();
        
        for (int i = 0; i < audio.getSegments().size(); i++) {
            Path segmentFile = audio.getSegments().get(i);
            int offsetSeconds = audio.isSegmented() ? i * 600 : 0;  // 每段10分钟
            
            log.info("转写片段 {}/{}: {}", i + 1, audio.getSegments().size(), 
                segmentFile.getFileName());
            
            List<TranscriptionSegment> segments = transcribeSegment(
                segmentFile, options, offsetSeconds);
            allSegments.addAll(segments);
        }
        
        return TranscriptionResult.builder()
            .segments(allSegments)
            .language(options.getLanguage())
            .totalDurationSeconds(calculateTotalDuration(allSegments))
            .build();
    }
    
    private List<TranscriptionSegment> transcribeSegment(Path audioFile, 
            TranscriptionOptions options, int timeOffsetSeconds) {
        
        try {
            // Whisper API请求
            HttpHeaders headers = new HttpHeaders();
            headers.setBearerAuth(openaiApiKey);
            headers.setContentType(MediaType.MULTIPART_FORM_DATA);
            
            MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
            body.add("file", new FileSystemResource(audioFile));
            body.add("model", "whisper-1");
            body.add("language", options.getLanguage());
            body.add("response_format", "verbose_json");  // 获取时间戳
            body.add("timestamp_granularities[]", "segment");
            
            if (options.getPrompt() != null) {
                // 提供会议相关的专业词汇，提高识别准确率
                body.add("prompt", options.getPrompt());
            }
            
            HttpEntity<MultiValueMap<String, Object>> request = 
                new HttpEntity<>(body, headers);
            
            ResponseEntity<Map> response = restTemplate.postForEntity(
                "https://api.openai.com/v1/audio/transcriptions",
                request, Map.class
            );
            
            return parseWhisperResponse(response.getBody(), timeOffsetSeconds);
            
        } catch (Exception e) {
            log.error("Whisper转写失败: {}", e.getMessage(), e);
            throw new RuntimeException("语音转写失败", e);
        }
    }
    
    @SuppressWarnings("unchecked")
    private List<TranscriptionSegment> parseWhisperResponse(Map responseBody, 
            int timeOffsetSeconds) {
        List<Map<String, Object>> rawSegments = 
            (List<Map<String, Object>>) responseBody.get("segments");
        
        return rawSegments.stream().map(seg -> TranscriptionSegment.builder()
            .text((String) seg.get("text"))
            .startSeconds(((Number) seg.get("start")).doubleValue() + timeOffsetSeconds)
            .endSeconds(((Number) seg.get("end")).doubleValue() + timeOffsetSeconds)
            .build()
        ).collect(Collectors.toList());
    }
}

说话人聚类（识别谁在说话）

这是技术上最复杂的部分。完整的Speaker Diarization需要专业模型（如pyannote.audio），这里给出一个基于嵌入相似度的简化版：

@Service
public class SpeakerDiarizationService {
    
    /**
     * 如果没有专业的说话人分离工具，
     * 可以在转写时让用户手动标注说话人，
     * 或者在AI分析时从对话内容推断
     */
    public TranscriptionResult inferSpeakers(TranscriptionResult result, 
            List<String> knownParticipants) {
        
        // 策略：让AI根据上下文推断说话人
        String prompt = buildSpeakerInferencePrompt(result, knownParticipants);
        
        // 调用AI识别说话人（对于小型会议通常效果不错）
        String speakerAnnotated = aiClient.complete(prompt);
        
        return parseSpeakerAnnotations(result, speakerAnnotated);
    }
    
    private String buildSpeakerInferencePrompt(TranscriptionResult result, 
            List<String> participants) {
        StringBuilder transcriptText = new StringBuilder();
        result.getSegments().forEach(seg -> 
            transcriptText.append(formatTime(seg.getStartSeconds()))
                .append(" ").append(seg.getText()).append("\n")
        );
        
        return String.format("""
            以下是会议录音的转写文本，会议参与者包括：%s
            
            请识别每段话是谁说的，并以JSON数组格式返回：
            [
              {"start": 起始时间秒, "speaker": "发言人姓名", "text": "发言内容"},
              ...
            ]
            
            识别规则：
            1. 根据发言内容、语气、上下文来判断
            2. 如果无法确定，标注为"未知参与者"
            3. 不要改变原文内容
            
            转写文本：
            %s
            """, String.join("、", participants), transcriptText);
    }
}

第三步：AI结构化分析

有了转写文本，现在是关键步骤——用AI提取结构化信息：

@Service
@Slf4j
public class MeetingAnalysisService {
    
    private final AnthropicClient anthropicClient;
    
    public MeetingMinutes analyze(String transcript, MeetingContext context) {
        log.info("开始分析会议内容，转写长度: {}字符", transcript.length());
        
        // 对于长会议，分块分析再汇总
        if (transcript.length() > 8000) {
            return analyzeInChunks(transcript, context);
        }
        
        return analyzeFull(transcript, context);
    }
    
    private MeetingMinutes analyzeFull(String transcript, MeetingContext context) {
        String prompt = buildAnalysisPrompt(transcript, context);
        String response = anthropicClient.complete(prompt);
        return parseMeetingMinutes(response);
    }
    
    private String buildAnalysisPrompt(String transcript, MeetingContext context) {
        return String.format("""
            你是一个专业的会议记录助手。请分析以下会议转写文本，提取结构化信息。
            
            会议基本信息：
            - 会议标题：%s
            - 会议时间：%s
            - 参会人员：%s
            
            会议转写文本：
            %s
            
            请提取以下信息并以JSON格式输出：
            
            {
              "executive_summary": "会议整体摘要（3-5句话）",
              
              "key_points": [
                {
                  "topic": "议题名称",
                  "discussion": "讨论内容摘要",
                  "conclusion": "达成的结论（如有）"
                }
              ],
              
              "decisions": [
                {
                  "decision": "决策内容",
                  "rationale": "决策依据（如有）",
                  "decided_by": "决策人"
                }
              ],
              
              "action_items": [
                {
                  "task": "任务描述",
                  "owner": "责任人",
                  "due_date": "截止日期（YYYY-MM-DD格式，如会议中未提及则为null）",
                  "priority": "HIGH/MEDIUM/LOW",
                  "context": "任务背景说明"
                }
              ],
              
              "open_questions": [
                {
                  "question": "待解决的问题",
                  "raised_by": "提出人",
                  "status": "PENDING/BEING_INVESTIGATED"
                }
              ],
              
              "next_meeting": {
                "scheduled": true或false,
                "suggested_date": "建议日期（如有）",
                "agenda_items": ["下次会议议题"]
              }
            }
            
            注意事项：
            1. 行动项必须有明确的责任人，没有责任人的议题不算行动项
            2. 日期要从上下文中准确提取，不要猜测
            3. 优先级根据紧迫度和重要性判断
            4. 摘要要准确反映会议的核心内容，不要废话
            """,
            context.getTitle(),
            context.getMeetingTime().toString(),
            String.join("、", context.getParticipants()),
            transcript
        );
    }
    
    private MeetingMinutes analyzeInChunks(String transcript, MeetingContext context) {
        // 把长会议分成几段分析，最后汇总
        List<String> chunks = splitTranscriptIntoChunks(transcript, 6000);
        List<MeetingMinutes> chunkResults = new ArrayList<>();
        
        for (int i = 0; i < chunks.size(); i++) {
            log.info("分析会议片段 {}/{}", i + 1, chunks.size());
            MeetingContext chunkContext = context.withNote(
                String.format("这是长会议的第%d/%d段", i + 1, chunks.size()));
            chunkResults.add(analyzeFull(chunks.get(i), chunkContext));
        }
        
        return mergeChunkResults(chunkResults, context);
    }
    
    /**
     * 合并多段分析结果，消除重复，整理行动项
     */
    private MeetingMinutes mergeChunkResults(List<MeetingMinutes> chunks, 
            MeetingContext context) {
        
        // 收集所有行动项，用AI去重和整理
        List<ActionItem> allActionItems = chunks.stream()
            .flatMap(c -> c.getActionItems().stream())
            .collect(Collectors.toList());
        
        String dedupePrompt = String.format("""
            以下是从会议不同片段提取的行动项列表，可能有重复或相似的条目。
            请整理、去重，输出最终的行动项列表（JSON数组格式，结构与输入相同）：
            
            %s
            """, new ObjectMapper().writeValueAsString(allActionItems));
        
        String deduped = anthropicClient.complete(dedupePrompt);
        List<ActionItem> finalActionItems = parseActionItems(deduped);
        
        // 生成最终摘要
        String allSummaries = chunks.stream()
            .map(MeetingMinutes::getExecutiveSummary)
            .collect(Collectors.joining("\n\n"));
        
        String finalSummary = generateFinalSummary(allSummaries, context);
        
        return MeetingMinutes.builder()
            .executiveSummary(finalSummary)
            .keyPoints(mergeKeyPoints(chunks))
            .decisions(mergeDecisions(chunks))
            .actionItems(finalActionItems)
            .openQuestions(mergeOpenQuestions(chunks))
            .meetingContext(context)
            .build();
    }
}

第四步：行动项同步到Jira

@Service
@Slf4j
public class JiraTaskSyncService {
    
    private final JiraClient jiraClient;
    private final UserMappingService userMappingService;
    
    public List<String> syncActionItems(List<ActionItem> actionItems, 
            String projectKey) {
        List<String> createdIssueKeys = new ArrayList<>();
        
        for (ActionItem item : actionItems) {
            try {
                String issueKey = createJiraIssue(item, projectKey);
                createdIssueKeys.add(issueKey);
                log.info("创建Jira任务: {} -> {}", item.getTask(), issueKey);
            } catch (Exception e) {
                log.error("创建任务失败: {}", item.getTask(), e);
            }
        }
        
        return createdIssueKeys;
    }
    
    private String createJiraIssue(ActionItem item, String projectKey) {
        // 查找责任人的Jira账号
        Optional<String> assigneeId = userMappingService
            .findJiraUserId(item.getOwner());
        
        // 优先级映射
        String jiraPriority = switch (item.getPriority()) {
            case HIGH -> "High";
            case MEDIUM -> "Medium";
            case LOW -> "Low";
        };
        
        JiraIssueRequest request = JiraIssueRequest.builder()
            .projectKey(projectKey)
            .issueType("Task")
            .summary("[会议行动项] " + item.getTask())
            .description(buildIssueDescription(item))
            .assigneeId(assigneeId.orElse(null))
            .priority(jiraPriority)
            .dueDate(item.getDueDate())
            .labels(List.of("meeting-action-item"))
            .build();
        
        return jiraClient.createIssue(request);
    }
    
    private String buildIssueDescription(ActionItem item) {
        return String.format("""
            h2. 任务来源
            本任务来自会议行动项，由AI自动创建。
            
            h2. 任务背景
            %s
            
            h2. 注意事项
            * 责任人：%s
            * 截止日期：%s
            * 优先级：%s
            """,
            item.getContext(),
            item.getOwner(),
            item.getDueDate() != null ? item.getDueDate().toString() : "未指定",
            item.getPriority()
        );
    }
}

效果展示：一份真实的输出样例

以下是系统处理一次30分钟技术讨论后的输出摘要（内容已脱敏）：

# 会议纪要
**时间**：2024-03-15 14:00-14:35  
**参会**：张工（架构师）、李工（后端）、王工（前端）

## 会议摘要
本次会议讨论了用户中心微服务拆分方案，最终决定采用增量迁移策略，
第一阶段目标是将认证模块独立部署，预计4周完成。

## 关键决策
1. ✅ 采用增量迁移而非大爆炸式重写（决策人：张工）
2. ✅ 认证服务先行，其他模块后续迭代拆分

## 行动项
| # | 任务 | 责任人 | 截止日期 | 优先级 |
|---|------|--------|----------|--------|
| 1 | 完成认证服务的接口设计文档 | 李工 | 2024-03-20 | HIGH |
| 2 | 评估现有前端登录逻辑的改造工作量 | 王工 | 2024-03-22 | MEDIUM |
| 3 | 搭建认证服务的基础脚手架 | 李工 | 2024-03-25 | HIGH |

## 待解决问题
- ❓ 旧系统的session数据如何迁移？（提出人：李工，状态：待调研）

踩坑：识别准确率不稳定的问题

部署了两周后，用户反馈了一个问题：同一个人说的话，有时候被识别成了不同的人。

后来排查发现，问题出在分段处理时，相邻段落的说话人上下文丢失了。A说到一半，这句话被切到了下一个分段的开头，AI无法判断这是谁在说，就标注成了"未知"或者乱猜。

解决方法是：分段时保留30秒的重叠片段，让每段都有足够的上下文来判断说话人身份。

这种"边界问题"在各种分块处理场景里都会出现，不管是文档分块还是音频分段，重叠是通用的解决思路。