第2209篇：语音转文字与LLM的集成——语音助手后端的工程实现

老张2026/4/30大约 6 分钟

第2209篇：语音转文字与LLM的集成——语音助手后端的工程实现

适读人群：需要构建语音交互功能的Java工程师 | 阅读时长：约16分钟 | 核心价值：语音识别+LLM的完整后端工程实现，含流式处理和多轮对话

语音功能的需求一直有，但真正做顺手挺难的。

我做过两个语音助手项目，踩过很多坑。第一个是给客服系统加语音输入，用户说话，系统识别后回答；第二个是给会议室做语音记录，边开会边生成会议纪要。

这两个需求看起来不一样，但工程核心是相同的：语音识别（ASR）+ LLM理解 + 流式返回。

难的不是每个环节，而是把这三个环节串起来的延迟控制。用户说完话，到看到AI的第一个字，这个时间必须控制在1.5秒以内，否则用户体验很差。

一、ASR工具链选型

OpenAI Whisper（推荐的通用选择）

识别质量最好（尤其是中文口语和技术术语）
支持多语言（95种语言）
本地部署：Whisper Large v3，需要约10GB显存
云端API：$0.006/分钟，性价比高

百度/科大讯飞/阿里云 ASR（国内场景）

中文识别效果好，方言支持更好
实时流式识别（边说边识别），适合低延迟场景
定价按调用次数，大量使用时比Whisper贵

选型建议：

会议记录（事后处理）：Whisper，质量优先
实时语音交互（低延迟）：百度/讯飞，支持流式
多语言场景：Whisper

二、Whisper的Java集成

Whisper是Python库，通过HTTP服务集成：

# whisper_service.py（Python端，简洁实现）
from fastapi import FastAPI, UploadFile, File
from pydantic import BaseModel
import whisper
import tempfile
import os

app = FastAPI()
model = whisper.load_model("large-v3")  # 加载模型

@app.post("/transcribe")
async def transcribe(
    file: UploadFile = File(...),
    language: str = "zh"
):
    # 保存上传的音频文件
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name
    
    try:
        result = model.transcribe(tmp_path, language=language, 
                                   word_timestamps=True)
        return {
            "text": result["text"],
            "segments": result["segments"],
            "language": result["language"]
        }
    finally:
        os.unlink(tmp_path)

@Component
public class WhisperASRClient {
    
    private static final Logger log = LoggerFactory.getLogger(WhisperASRClient.class);
    
    private final RestTemplate restTemplate;
    
    @Value("${asr.whisper.url:http://localhost:8200}")
    private String whisperUrl;
    
    /**
     * 语音转文字
     * @param audioBytes 音频文件字节数组（支持WAV/MP3/M4A/WebM等）
     * @param language 语言代码，如"zh"、"en"，null表示自动检测
     */
    public TranscriptionResult transcribe(byte[] audioBytes, String language) {
        long startTime = System.currentTimeMillis();
        
        try {
            HttpHeaders headers = new HttpHeaders();
            headers.setContentType(MediaType.MULTIPART_FORM_DATA);
            
            MultiValueMap<String, Object> body = new LinkedMultiValueMap<>();
            body.add("file", new ByteArrayResource(audioBytes) {
                @Override
                public String getFilename() { return "audio.wav"; }
            });
            
            if (language != null) {
                body.add("language", language);
            }
            
            HttpEntity<MultiValueMap<String, Object>> entity = new HttpEntity<>(body, headers);
            
            ResponseEntity<Map> response = restTemplate.postForEntity(
                whisperUrl + "/transcribe", entity, Map.class);
            
            long elapsed = System.currentTimeMillis() - startTime;
            
            @SuppressWarnings("unchecked")
            Map<String, Object> responseBody = response.getBody();
            
            String text = (String) responseBody.get("text");
            String detectedLanguage = (String) responseBody.get("language");
            
            log.info("语音识别完成，耗时{}ms，识别语言：{}", elapsed, detectedLanguage);
            
            return new TranscriptionResult(text.trim(), detectedLanguage, elapsed, true, null);
            
        } catch (Exception e) {
            log.error("语音识别失败", e);
            return new TranscriptionResult("", language, 0, false, e.getMessage());
        }
    }
    
    /**
     * 音频格式转换为WAV（Whisper对WAV格式最稳定）
     * 使用ffmpeg-java（需要系统安装ffmpeg）
     */
    public byte[] convertToWAV(byte[] inputAudio, String inputFormat) throws IOException {
        // 写入临时文件
        Path inputPath = Files.createTempFile("audio_input_", "." + inputFormat);
        Path outputPath = Files.createTempFile("audio_output_", ".wav");
        
        try {
            Files.write(inputPath, inputAudio);
            
            ProcessBuilder pb = new ProcessBuilder(
                "ffmpeg", "-y",
                "-i", inputPath.toString(),
                "-ar", "16000",    // 16kHz采样率
                "-ac", "1",        // 单声道
                "-f", "wav",
                outputPath.toString()
            );
            pb.redirectErrorStream(true);
            
            Process process = pb.start();
            int exitCode = process.waitFor();
            
            if (exitCode != 0) {
                throw new IOException("ffmpeg转换失败，退出码: " + exitCode);
            }
            
            return Files.readAllBytes(outputPath);
            
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new IOException("音频转换被中断", e);
        } finally {
            Files.deleteIfExists(inputPath);
            Files.deleteIfExists(outputPath);
        }
    }
    
    public record TranscriptionResult(String text, String language, long latencyMs, 
                                       boolean success, String errorMessage) {}
}

三、语音助手的完整后端流水线

ASR识别完后，文字送给LLM，LLM流式输出：

@Service
public class VoiceAssistantService {
    
    private final WhisperASRClient asrClient;
    private final ChatClient chatClient;
    private final ConversationHistoryService historyService;
    
    private static final String SYSTEM_PROMPT = """
        你是一个语音助手。用户通过语音与你交流，以下是识别出的文字。
        注意：
        1. 语音识别可能有些许错误，请根据上下文理解用户意图
        2. 回答要简洁，适合语音播报（避免使用Markdown格式）
        3. 如果识别结果不清晰，礼貌地请求用户重复
        """;
    
    /**
     * 处理语音输入，返回流式文字响应
     */
    public Flux<String> processVoiceInput(byte[] audioBytes, String sessionId, 
                                           String audioFormat) {
        return Mono.fromCallable(() -> {
            // 1. 音频预处理（转WAV）
            byte[] wavAudio = "wav".equalsIgnoreCase(audioFormat) 
                ? audioBytes 
                : asrClient.convertToWAV(audioBytes, audioFormat);
            
            // 2. 语音识别
            WhisperASRClient.TranscriptionResult transcription = 
                asrClient.transcribe(wavAudio, "zh");
            
            if (!transcription.success() || transcription.text().isEmpty()) {
                throw new ASRException("语音识别失败或内容为空");
            }
            
            log.info("语音识别结果: {}", transcription.text());
            return transcription.text();
        })
        .flatMapMany(userText -> {
            // 3. 获取对话历史
            List<Message> history = historyService.getHistory(sessionId);
            
            // 4. 流式LLM响应
            return chatClient.prompt()
                .system(SYSTEM_PROMPT)
                .messages(history)
                .user(userText)
                .stream()
                .content()
                .doOnNext(chunk -> {
                    // 5. 实时保存到对话历史
                    historyService.appendAssistantChunk(sessionId, chunk);
                })
                .doOnComplete(() -> {
                    // 6. 保存用户消息
                    historyService.addUserMessage(sessionId, userText);
                });
        });
    }
    
    /**
     * WebSocket端点处理语音流
     */
    @ServerEndpoint("/voice-assistant/{sessionId}")
    @Component
    public static class VoiceAssistantWebSocket {
        
        @OnMessage
        public void onAudioChunk(byte[] audioData, Session session,
                                  @PathParam("sessionId") String sessionId) {
            // WebSocket接收音频数据，积累完整句子后处理
            // 这里是简化演示，实际需要VAD（语音活动检测）来判断句子结束
        }
        
        @OnOpen
        public void onOpen(Session session, @PathParam("sessionId") String sessionId) {
            log.info("语音会话建立: sessionId={}", sessionId);
        }
        
        @OnClose
        public void onClose(Session session, @PathParam("sessionId") String sessionId) {
            log.info("语音会话结束: sessionId={}", sessionId);
        }
    }
}

四、语音活动检测（VAD）——判断用户说完没有

VAD（Voice Activity Detection）是语音交互里容易忽视的关键组件：

@Component
public class VoiceActivityDetector {
    
    /**
     * 简单的基于音量的VAD
     * 检测音频片段中是否有语音活动
     */
    public VADResult detect(byte[] pcmAudio, int sampleRate) {
        // 将PCM字节转换为采样值数组
        short[] samples = new short[pcmAudio.length / 2];
        ByteBuffer.wrap(pcmAudio).order(ByteOrder.LITTLE_ENDIAN)
            .asShortBuffer().get(samples);
        
        // 计算RMS（均方根，代表音量）
        double sumSquares = 0;
        for (short sample : samples) {
            sumSquares += (double) sample * sample;
        }
        double rms = Math.sqrt(sumSquares / samples.length);
        double dbFS = 20 * Math.log10(rms / 32768.0); // 转dBFS
        
        // 判断是否有语音活动
        // 通常人声在-40dBFS以上，环境噪声在-60dBFS以下
        boolean isSpeech = dbFS > -40;
        
        return new VADResult(isSpeech, dbFS, estimateSpeechDuration(samples, sampleRate));
    }
    
    /**
     * 检测静音（用于判断用户停止说话）
     * 如果最近500ms内持续静音，认为用户说完了一句话
     */
    public boolean isSilenceDetected(Queue<VADResult> recentResults, int silenceThresholdMs) {
        if (recentResults.isEmpty()) return false;
        
        long silenceCount = recentResults.stream()
            .filter(r -> !r.isSpeech())
            .count();
        
        // 假设每个result对应20ms的音频
        long silenceMs = silenceCount * 20;
        return silenceMs >= silenceThresholdMs;
    }
    
    private double estimateSpeechDuration(short[] samples, int sampleRate) {
        return (double) samples.length / sampleRate;
    }
    
    public record VADResult(boolean isSpeech, double dbFS, double durationSeconds) {}
}

五、延迟优化：减少语音交互的端到端时间

用户体验的核心是延迟。从用户说完话到看到/听到AI响应，目标是<1.5秒：

时间拆解：
- 音频传输（网络）：<100ms
- VAD检测句子结束：+300ms（等待静音确认）
- ASR识别（Whisper Large本地GPU）：~800ms
- LLM首个Token：~500ms（使用流式，首Token更快）
- 总计：~1700ms

优化方向：
1. 用流式ASR（边说边识别）：可以省掉300ms静音等待
2. 用小模型Whisper Tiny/Base做初步识别：~200ms（质量会下降）
3. LLM用更快的模型（Qwen-turbo vs Qwen-plus）：首Token从500ms降到200ms
优化后总计：~700ms，体验明显改善