第1803篇：语音转文字在AI系统中的集成——Whisper API与实时流式转录

老张2026/4/30大约 12 分钟

第1803篇：语音转文字在AI系统中的集成——Whisper API与实时流式转录

一段让我印象深刻的对话

有次跟一个做AI客服的朋友吃饭，他说："我们的语音识别准确率有95%，挺好的。"

我问："那这5%的错误落在哪里？"

他愣了一下，说不太清楚。

这就是问题所在。语音识别的准确率指标，是平均数——但用户体验是由最坏情况决定的。那5%的错误，如果恰好是用户说的金额、人名、专业术语，一条客服记录就废了，甚至会引发投诉。

做语音转文字（ASR）集成，不是调一个API就完事的工程。从Whisper API的使用，到实时流式转录，再到后处理和容错设计，有一整套要考虑的东西。这篇来系统讲讲。

Whisper的基础能力和选型

先把Whisper的几个版本说清楚，免得选错。

OpenAI的Whisper有两种使用方式：

Whisper API：直接调OpenAI接口，按分钟计费，不用自己部署
开源Whisper模型：自己部署，有tiny/base/small/medium/large几档

选型决策树：

我们大部分企业场景的选择逻辑：

客服录音分析（离线）：Whisper API，最省事
实时语音助手：WebSocket流式，需要额外工程
医疗/法律（隐私数据）：必须自部署

基础集成：Whisper API

先把最简单的离线场景做好：

@Component
public class WhisperClient {
    
    private static final String WHISPER_API_URL = "https://api.openai.com/v1/audio/transcriptions";
    private static final long MAX_FILE_SIZE = 25 * 1024 * 1024; // 25MB
    
    @Value("${openai.api.key}")
    private String apiKey;
    
    @Autowired
    private RestTemplate restTemplate;
    
    /**
     * 基础转录接口
     */
    public TranscriptionResult transcribe(byte[] audioData, String filename) {
        validateAudioSize(audioData);
        
        // 构建multipart请求
        MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
        body.add("file", new ByteArrayResource(audioData) {
            @Override
            public String getFilename() { return filename; }
        });
        body.add("model", "whisper-1");
        body.add("language", "zh"); // 指定中文，比auto detection准
        body.add("response_format", "verbose_json"); // 获取词级别时间戳
        body.add("temperature", "0"); // 温度为0，减少随机性
        
        HttpHeaders headers = new HttpHeaders();
        headers.setBearerAuth(apiKey);
        headers.setContentType(MediaType.MULTIPART_FORM_DATA);
        
        HttpEntity<MultiValueMap<String, Object>> request = new HttpEntity<>(body, headers);
        
        try {
            ResponseEntity<String> response = restTemplate.postForEntity(
                WHISPER_API_URL, request, String.class);
            
            return parseVerboseResponse(response.getBody());
        } catch (HttpClientErrorException e) {
            throw new TranscriptionException("Whisper API调用失败: " + e.getMessage());
        }
    }
    
    /**
     * verbose_json格式的解析
     * 包含：全文、分段信息（含时间戳）、语言检测结果
     */
    private TranscriptionResult parseVerboseResponse(String jsonResponse) {
        JsonNode root = objectMapper.readTree(jsonResponse);
        
        String fullText = root.get("text").asText();
        String detectedLanguage = root.get("language").asText();
        
        // 解析segments（句子级别的时间戳）
        List<TranscriptionSegment> segments = new ArrayList<>();
        JsonNode segmentsNode = root.get("segments");
        
        if (segmentsNode != null && segmentsNode.isArray()) {
            for (JsonNode seg : segmentsNode) {
                segments.add(TranscriptionSegment.builder()
                    .id(seg.get("id").asInt())
                    .start(seg.get("start").asDouble()) // 秒
                    .end(seg.get("end").asDouble())
                    .text(seg.get("text").asText().trim())
                    .avgLogprob(seg.get("avg_logprob").asDouble()) // 置信度指标
                    .compressionRatio(seg.get("compression_ratio").asDouble())
                    .noSpeechProb(seg.get("no_speech_prob").asDouble()) // 无语音概率
                    .build());
            }
        }
        
        return TranscriptionResult.builder()
            .fullText(fullText)
            .language(detectedLanguage)
            .segments(segments)
            .confidence(calculateOverallConfidence(segments))
            .build();
    }
    
    /**
     * 用avg_logprob计算置信度
     * Whisper的avg_logprob通常在-0.5到0之间，越接近0越好
     * 转换成0-1的置信度分数
     */
    private double calculateOverallConfidence(List<TranscriptionSegment> segments) {
        if (segments.isEmpty()) return 0.0;
        
        // 过滤掉无语音段
        List<TranscriptionSegment> speechSegments = segments.stream()
            .filter(s -> s.getNoSpeechProb() < 0.5)
            .collect(Collectors.toList());
        
        if (speechSegments.isEmpty()) return 0.0;
        
        double avgLogprob = speechSegments.stream()
            .mapToDouble(TranscriptionSegment::getAvgLogprob)
            .average()
            .orElse(-1.0);
        
        // avg_logprob = 0 -> confidence = 1.0
        // avg_logprob = -1.0 -> confidence ≈ 0.37
        // 用指数函数转换
        return Math.exp(avgLogprob);
    }
    
    /**
     * 音频超过25MB需要分割
     */
    private void validateAudioSize(byte[] audioData) {
        if (audioData.length > MAX_FILE_SIZE) {
            throw new AudioTooLargeException(
                String.format("音频文件 %.1f MB 超过25MB限制，请先分割", 
                    audioData.length / 1024.0 / 1024.0));
        }
    }
}

音频预处理

很多时候音频需要先处理才能得到好的识别效果：

@Component
public class AudioPreprocessor {
    
    /**
     * 音频预处理流水线
     * 目标：提高Whisper识别准确率
     */
    public byte[] preprocess(byte[] rawAudio, AudioFormat format) {
        // Step1: 格式转换（Whisper支持mp3/mp4/mpeg/mpga/m4a/wav/webm）
        byte[] wavAudio = convertToWav(rawAudio, format);
        
        // Step2: 采样率标准化（16kHz是Whisper最佳采样率）
        byte[] resampled = resampleTo16k(wavAudio);
        
        // Step3: 单声道转换（双声道时取平均，减少噪声）
        byte[] mono = convertToMono(resampled);
        
        // Step4: 音量归一化
        byte[] normalized = normalizeVolume(mono);
        
        // Step5: 静音去除（去掉开头结尾的纯静音段）
        byte[] trimmed = trimSilence(normalized);
        
        return trimmed;
    }
    
    /**
     * 长音频分割策略
     * 按静音段切割，避免在句子中间断开
     */
    public List<AudioChunk> splitLongAudio(byte[] audioData, int maxChunkSeconds) {
        // 检测静音区间
        List<SilenceInterval> silences = detectSilences(audioData);
        
        List<AudioChunk> chunks = new ArrayList<>();
        int startSample = 0;
        int sampleRate = 16000;
        int maxChunkSamples = maxChunkSeconds * sampleRate;
        
        for (SilenceInterval silence : silences) {
            if (silence.getStartSample() - startSample >= maxChunkSamples) {
                // 找到合适的切割点
                chunks.add(extractChunk(audioData, startSample, silence.getStartSample()));
                startSample = silence.getEndSample();
            }
        }
        
        // 处理剩余部分
        if (startSample < audioData.length / 2) { // /2 because 16-bit
            chunks.add(extractChunk(audioData, startSample, audioData.length / 2));
        }
        
        return chunks;
    }
    
    /**
     * 静音检测：能量阈值法
     * 简单有效，对语音电话录音效果好
     */
    private List<SilenceInterval> detectSilences(byte[] wavData) {
        // 转成短整型数组（16-bit PCM）
        short[] samples = convertToShortArray(wavData);
        
        int windowSize = 1600; // 100ms窗口（16000Hz）
        double silenceThreshold = 500; // 能量阈值，需要根据实际噪音环境调整
        int minSilenceSamples = 8000; // 最短静音持续时间：500ms
        
        List<SilenceInterval> silences = new ArrayList<>();
        int silenceStart = -1;
        
        for (int i = 0; i < samples.length - windowSize; i += windowSize) {
            double energy = calculateEnergy(samples, i, windowSize);
            
            if (energy < silenceThreshold) {
                if (silenceStart < 0) silenceStart = i;
            } else {
                if (silenceStart >= 0 && (i - silenceStart) >= minSilenceSamples) {
                    silences.add(new SilenceInterval(silenceStart, i));
                }
                silenceStart = -1;
            }
        }
        
        return silences;
    }
}

提示词工程：专业术语是最大挑战

Whisper的通用识别还行，但遇到行业术语、人名、地名，经常识别错。这是实际应用中最大的痛点。

@Component
public class WhisperPromptOptimizer {
    
    /**
     * Whisper API支持prompt参数
     * 可以给模型"热身"，提示可能出现的词汇
     * 这个功能很多人不知道，但非常有用
     */
    public String buildPrompt(TranscriptionContext context) {
        StringBuilder prompt = new StringBuilder();
        
        // 基础提示：转录风格
        prompt.append("以下是一段");
        
        // 场景说明
        switch (context.getScene()) {
            case CUSTOMER_SERVICE:
                prompt.append("客服通话录音，包含客户咨询和客服回复。");
                break;
            case MEETING:
                prompt.append("会议录音，可能有多人发言。");
                break;
            case MEDICAL:
                prompt.append("医疗问诊录音，包含医学术语。");
                break;
        }
        
        // 公司/产品特定词汇（最重要！）
        if (context.getCustomVocabulary() != null && !context.getCustomVocabulary().isEmpty()) {
            prompt.append("可能出现的专有词汇：");
            prompt.append(String.join("、", context.getCustomVocabulary()));
            prompt.append("。");
        }
        
        // 说话人信息（如果已知）
        if (context.getSpeakerNames() != null && !context.getSpeakerNames().isEmpty()) {
            prompt.append("说话人可能包括：");
            prompt.append(String.join("、", context.getSpeakerNames()));
            prompt.append("。");
        }
        
        return prompt.toString();
    }
    
    /**
     * 使用prompt的实际调用示例
     */
    public TranscriptionResult transcribeWithContext(byte[] audioData, 
                                                      TranscriptionContext context) {
        String prompt = buildPrompt(context);
        
        MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
        body.add("file", new ByteArrayResource(audioData) {
            @Override
            public String getFilename() { return "audio.wav"; }
        });
        body.add("model", "whisper-1");
        body.add("language", "zh");
        body.add("response_format", "verbose_json");
        body.add("prompt", prompt); // 关键！
        
        // ... 发送请求
        return sendRequest(body);
    }
}

实时流式转录：WebSocket方案

实时场景更复杂，要边说边转录，延迟要控制在1-2秒以内。

Whisper API本身不支持流式，有几种解决思路：

滚动窗口法：每1-2秒发一次短音频，拼接结果
使用支持流式的模型：如Azure Speech Services、Google STT
自部署开源模型 + 流式推理

我们采用的是"滚动窗口 + 上下文延续"的方案：

@Component
public class RealtimeTranscriber {
    
    private static final int WINDOW_DURATION_MS = 2000;  // 每2秒一个窗口
    private static final int OVERLAP_MS = 500;           // 窗口间0.5秒重叠，避免断词
    private static final int SAMPLE_RATE = 16000;
    private static final int BYTES_PER_SAMPLE = 2;       // 16-bit PCM
    
    /**
     * 实时转录会话
     * 核心思路：维护一个音频缓冲区，定时切片发给Whisper
     */
    @Service
    public class TranscriptionSession {
        
        private final CircularBuffer audioBuffer;
        private final ScheduledExecutorService scheduler;
        private final TranscriptionResultAggregator aggregator;
        private String lastTranscribedText = "";
        private long lastSentTimestamp = 0;
        
        public TranscriptionSession(TranscriptionContext context) {
            int bufferCapacity = SAMPLE_RATE * BYTES_PER_SAMPLE * 30; // 30秒缓冲
            this.audioBuffer = new CircularBuffer(bufferCapacity);
            this.aggregator = new TranscriptionResultAggregator();
            this.scheduler = Executors.newSingleThreadScheduledExecutor();
            
            // 定时触发转录
            scheduler.scheduleAtFixedRate(
                () -> triggerTranscription(context),
                WINDOW_DURATION_MS, 
                WINDOW_DURATION_MS - OVERLAP_MS, 
                TimeUnit.MILLISECONDS
            );
        }
        
        /**
         * 接收音频数据块（来自WebSocket）
         */
        public void feedAudio(byte[] audioChunk) {
            audioBuffer.write(audioChunk);
        }
        
        /**
         * 定时触发：取出窗口数据发给Whisper
         */
        private void triggerTranscription(TranscriptionContext context) {
            long now = System.currentTimeMillis();
            int windowBytes = SAMPLE_RATE * BYTES_PER_SAMPLE * 
                              (int)(WINDOW_DURATION_MS / 1000.0);
            
            byte[] windowData = audioBuffer.readLast(windowBytes);
            if (windowData == null || windowData.length < windowBytes / 4) {
                return; // 数据不足，跳过
            }
            
            // 异步调用Whisper（不阻塞主线程）
            CompletableFuture.supplyAsync(() -> {
                return whisperClient.transcribe(windowData, "window.wav");
            }).thenAccept(result -> {
                processWindowResult(result, now);
            }).exceptionally(ex -> {
                log.warn("窗口转录失败", ex);
                return null;
            });
            
            lastSentTimestamp = now;
        }
        
        /**
         * 处理窗口转录结果
         * 关键：去除与上一窗口重叠的内容
         */
        private void processWindowResult(TranscriptionResult result, long windowTimestamp) {
            String newText = result.getFullText();
            
            // 找到新增部分（非重叠部分）
            String increment = findIncrement(lastTranscribedText, newText);
            
            if (!increment.isEmpty()) {
                lastTranscribedText = newText;
                
                // 推送增量文本给前端
                aggregator.append(increment, windowTimestamp);
                pushToWebSocket(increment, windowTimestamp);
            }
        }
        
        /**
         * 找出新文本相对于旧文本的增量部分
         * 用最长公共子序列算法处理重叠
         */
        private String findIncrement(String oldText, String newText) {
            if (oldText.isEmpty()) return newText;
            if (newText.isEmpty()) return "";
            
            // 找到oldText末尾在newText中的匹配位置
            // 简化版：用后缀匹配
            int maxOverlap = Math.min(oldText.length(), newText.length());
            for (int overlapLen = maxOverlap; overlapLen > 0; overlapLen--) {
                String oldSuffix = oldText.substring(oldText.length() - overlapLen);
                if (newText.startsWith(oldSuffix)) {
                    return newText.substring(overlapLen);
                }
            }
            
            // 没找到重叠，全部是新内容
            return newText;
        }
    }
}

WebSocket服务端

@ServerEndpoint("/api/transcribe/realtime")
@Component
public class TranscriptionWebSocketServer {
    
    private final Map<String, RealtimeTranscriber.TranscriptionSession> sessions = 
        new ConcurrentHashMap<>();
    
    @OnOpen
    public void onOpen(Session session, @PathParam("") String param) {
        log.info("新的转录会话: {}", session.getId());
        
        // 解析会话配置（从握手头或首条消息）
        TranscriptionContext context = TranscriptionContext.builder()
            .scene(TranscriptionScene.CUSTOMER_SERVICE)
            .customVocabulary(Arrays.asList("会员卡", "积分", "退款", "换货")) // 业务词汇
            .build();
        
        RealtimeTranscriber.TranscriptionSession transcriptionSession = 
            realtimeTranscriber.createSession(context, text -> {
                // 回调：有新文本时推给客户端
                sendText(session, text);
            });
        
        sessions.put(session.getId(), transcriptionSession);
    }
    
    @OnMessage
    public void onBinaryMessage(byte[] audioData, Session session) {
        RealtimeTranscriber.TranscriptionSession transcriptionSession = 
            sessions.get(session.getId());
        
        if (transcriptionSession != null) {
            transcriptionSession.feedAudio(audioData);
        }
    }
    
    @OnClose
    public void onClose(Session session) {
        RealtimeTranscriber.TranscriptionSession transcriptionSession = 
            sessions.remove(session.getId());
        
        if (transcriptionSession != null) {
            // 处理结尾剩余音频
            TranscriptionResult finalResult = transcriptionSession.flush();
            if (finalResult != null) {
                sendText(session, finalResult.getFullText());
            }
            transcriptionSession.close();
        }
    }
    
    private void sendText(Session session, String text) {
        try {
            Map<String, String> message = Map.of(
                "type", "transcript",
                "text", text,
                "timestamp", String.valueOf(System.currentTimeMillis())
            );
            session.getBasicRemote().sendText(objectMapper.writeValueAsString(message));
        } catch (IOException e) {
            log.error("WebSocket发送失败", e);
        }
    }
}

后处理：提升最终质量

原始转录结果通常还需要后处理才能用：

@Component
public class TranscriptionPostProcessor {
    
    @Autowired
    private LlmClient llmClient;
    
    /**
     * 三步后处理：标点、纠错、说话人分离
     */
    public ProcessedTranscription process(TranscriptionResult raw, 
                                           TranscriptionContext context) {
        // Step1: 标点符号恢复（Whisper的中文标点经常不准）
        String withPunctuation = restorePunctuation(raw.getFullText());
        
        // Step2: 专业术语纠错
        String corrected = correctDomainTerms(withPunctuation, context.getDomainTerms());
        
        // Step3: 说话人分离（如果是多人录音）
        List<SpeakerTurn> turns = null;
        if (context.isMultiSpeaker()) {
            turns = separateSpeakers(raw.getSegments(), corrected);
        }
        
        return ProcessedTranscription.builder()
            .rawText(raw.getFullText())
            .processedText(corrected)
            .speakerTurns(turns)
            .confidence(raw.getConfidence())
            .build();
    }
    
    /**
     * 用LLM恢复标点
     * 比规则方法好，能理解句子语义边界
     */
    private String restorePunctuation(String text) {
        if (text.length() < 50) return text; // 短文本不处理
        
        String prompt = String.format("""
            请为以下语音转录文字添加标点符号，注意：
            1. 只添加标点，不修改任何词语
            2. 使用中文标点（，。？！）
            3. 保持原有换行
            4. 不要添加任何解释
            
            原文：%s
            """, text);
        
        return llmClient.complete(prompt);
    }
    
    /**
     * 领域术语纠错
     * 思路：编辑距离匹配，找出可能的术语误识
     */
    private String correctDomainTerms(String text, List<String> domainTerms) {
        if (domainTerms == null || domainTerms.isEmpty()) return text;
        
        String result = text;
        
        for (String term : domainTerms) {
            // 找编辑距离≤2的可能错误识别
            List<String> candidates = findSimilarSubstrings(result, term, 2);
            
            for (String candidate : candidates) {
                if (!candidate.equals(term)) {
                    result = result.replace(candidate, term);
                    log.debug("纠正术语: {} -> {}", candidate, term);
                }
            }
        }
        
        return result;
    }
    
    /**
     * 说话人分离（基于能量变化和停顿的简单版本）
     * 真正生产级的说话人分离需要更复杂的声纹模型
     * 这里用的是基于时间戳和文本特征的启发式方法
     */
    private List<SpeakerTurn> separateSpeakers(List<TranscriptionSegment> segments, 
                                                  String text) {
        // 简化版：找到明显的角色切换标志
        // 客服场景下，"您好"、"请问"通常是客服开口，"我要"、"为什么"通常是客户
        
        List<SpeakerTurn> turns = new ArrayList<>();
        String currentSpeaker = "UNKNOWN";
        StringBuilder currentText = new StringBuilder();
        double currentStart = 0;
        
        for (TranscriptionSegment segment : segments) {
            String segText = segment.getText();
            
            // 简单的说话人切换检测
            String detectedSpeaker = detectSpeaker(segText, currentSpeaker);
            
            if (!detectedSpeaker.equals(currentSpeaker) && currentText.length() > 0) {
                turns.add(SpeakerTurn.builder()
                    .speaker(currentSpeaker)
                    .text(currentText.toString().trim())
                    .startTime(currentStart)
                    .endTime(segment.getStart())
                    .build());
                
                currentText = new StringBuilder();
                currentStart = segment.getStart();
                currentSpeaker = detectedSpeaker;
            }
            
            currentText.append(segText);
        }
        
        if (currentText.length() > 0) {
            turns.add(SpeakerTurn.builder()
                .speaker(currentSpeaker)
                .text(currentText.toString().trim())
                .startTime(currentStart)
                .endTime(segments.isEmpty() ? 0 : 
                    segments.get(segments.size()-1).getEnd())
                .build());
        }
        
        return turns;
    }
}

客服场景的完整示例

把上面这些组合成一个完整的客服通话分析流程：

@Service
public class CustomerServiceCallAnalyzer {
    
    @Autowired
    private AudioPreprocessor audioPreprocessor;
    
    @Autowired
    private WhisperClient whisperClient;
    
    @Autowired
    private WhisperPromptOptimizer promptOptimizer;
    
    @Autowired
    private TranscriptionPostProcessor postProcessor;
    
    @Autowired
    private LlmClient llmClient;
    
    /**
     * 完整的客服通话分析流程
     */
    public CallAnalysisReport analyzeCall(byte[] audioData, CallMetadata metadata) {
        
        // 1. 音频预处理
        byte[] processedAudio = audioPreprocessor.preprocess(audioData, AudioFormat.MP3);
        
        // 2. 构建转录上下文
        TranscriptionContext context = TranscriptionContext.builder()
            .scene(TranscriptionScene.CUSTOMER_SERVICE)
            .isMultiSpeaker(true)
            .customVocabulary(getProductVocabulary(metadata.getProductType()))
            .domainTerms(getDomainTerms())
            .build();
        
        // 3. 音频转文字
        TranscriptionResult transcription = whisperClient.transcribeWithContext(
            processedAudio, context);
        
        // 4. 后处理
        ProcessedTranscription processed = postProcessor.process(transcription, context);
        
        // 5. 通话分析：情感、关键事件、结论
        CallAnalysis analysis = analyzeContent(processed, metadata);
        
        return CallAnalysisReport.builder()
            .callId(metadata.getCallId())
            .transcript(processed.getProcessedText())
            .speakerTurns(processed.getSpeakerTurns())
            .analysis(analysis)
            .transcriptionConfidence(transcription.getConfidence())
            .build();
    }
    
    private CallAnalysis analyzeContent(ProcessedTranscription transcript, 
                                          CallMetadata metadata) {
        String prompt = String.format("""
            以下是一段客服通话转录：
            
            %s
            
            请分析以下内容，以JSON格式返回：
            {
              "call_reason": "客户来电原因（一句话）",
              "resolution": "问题是否解决（YES/NO/PARTIAL）",
              "customer_sentiment": "客户情绪（POSITIVE/NEUTRAL/NEGATIVE/ANGRY）",
              "key_issues": ["关键问题点1", "关键问题点2"],
              "action_items": ["需要跟进的事项1"],
              "quality_score": 0-100的服务质量分,
              "quality_issues": ["服务问题1", "服务问题2"]
            }
            """, transcript.getProcessedText());
        
        String response = llmClient.complete(prompt);
        return parseCallAnalysis(response);
    }
}

踩过的坑

坑1：中英混合识别

很多企业用语中英文混杂，比如"ERP系统"、"SKU"、"KPI达成"。Whisper在中英混合时经常把英文识别成谐音汉字，比如"ERP"识别成"意哦啊"。

解决方案：在prompt里把常见英文缩写列出来，或者事后用正则匹配纠正。

坑2：方言识别

Whisper对粤语、闽南语支持很差。如果你的用户群有大量方言用户，要单独处理，不能指望Whisper。这种情况我们用的是腾讯云的方言识别，效果好很多。

坑3：网络电话音质

客服录音通常是G.711编码、8kHz采样率的电话语音，音质很差。要先做上采样到16kHz，然后做降噪处理。这一步如果省了，准确率至少差10个百分点。

坑4：长会议录音的漂移问题

对于一两个小时的会议录音，用滚动窗口方式，越到后面累积误差越大，上下文容易断裂。解决方案是定期做全文重新对齐，以及在切割点尽量保证语义完整性。

小结

Whisper API做语音转文字，是很好的起点，但要在生产环境真正用好，需要：

音频预处理是基础，质量不好给再好的模型也没用
prompt要用好，把业务词汇喂给模型，准确率提升明显
实时场景要滚动窗口，注意重叠和增量提取
后处理不能省，标点恢复和术语纠错很重要
置信度要监控，低置信度段落要特别处理

语音是信息密度很高的输入方式，做好了能给AI系统带来很大的价值提升。