语音AI应用开发：用Java构建语音交互功能

老张2026/6/28大约 19 分钟语音AITTSSTTWhisperSpring AIJava语音交互

语音AI应用开发：用Java构建语音交互功能

那个让留存率提升25%的语音按钮

2025年11月，广州某在线教育公司产品经理王芳在用户访谈后，写了一份让技术团队震惊的报告。

报告里有一个数据：在200名核心用户访谈中，有147名用户（73.5%） 提到了同一个需求——"希望能直接说话问问题，不用打字"。

王芳在报告里写道：

"我们的用户有30%是45岁以上的职场人。他们打字慢，表达困难，很多人遇到问题宁愿放弃，也不愿花10分钟把问题打出来。如果有语音输入，他们愿意问的问题会多很多。"

技术团队负责人李强接下了这个任务。

两周后，语音问答功能上线。

三个月后的数据：

用户7日留存率：从41% → 52%（提升25%）
日均问题数量：从人均3.2个 → 5.7个（提升78%）
45岁+用户留存率：从28% → 61%（提升118%）

今天我把李强当时的技术方案完整还原，这套方案完全可以用Java复现。

第一章：语音AI全链路架构

1.1 完整的语音交互链路

1.2 技术选型矩阵

推荐组合（按场景）：

场景	ASR推荐	TTS推荐	月成本估算
国内C端产品	百度/阿里云	阿里云TTS	¥500-2000
国际化产品	OpenAI Whisper	OpenAI TTS	$200-800
极致成本	本地Whisper	Edge TTS	$0
企业私有化	本地Whisper	Azure TTS私有化	按硬件

第二章：Spring AI集成OpenAI Whisper

2.1 依赖与配置

<!-- pom.xml -->
<dependencies>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        <version>1.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>ws.schild</groupId>
        <artifactId>jave-all-deps</artifactId>
        <version>3.3.1</version>
    </dependency>
    <!-- 音频处理 -->
    <dependency>
        <groupId>org.bytedeco</groupId>
        <artifactId>javacv-platform</artifactId>
        <version>1.5.9</version>
    </dependency>
</dependencies>

# application.yml
spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      audio:
        transcription:
          options:
            model: whisper-1
            language: zh    # 指定中文，提升准确率
            response-format: verbose_json  # 返回时间戳信息
            temperature: 0  # 转录任务设为0，减少幻觉
        speech:
          options:
            model: tts-1
            voice: nova    # alloy/echo/fable/onyx/nova/shimmer
            speed: 1.0
            response-format: mp3

# 语音相关配置
voice:
  max-file-size: 25MB  # Whisper API最大文件限制
  supported-formats: [mp3, mp4, mpeg, mpga, m4a, wav, webm]
  temp-dir: /tmp/voice-uploads
  cleanup-interval: 3600  # 临时文件清理间隔（秒）

2.2 语音转文字核心服务

package com.example.voice.service;

import org.springframework.ai.openai.OpenAiAudioTranscriptionModel;
import org.springframework.ai.openai.OpenAiAudioTranscriptionOptions;
import org.springframework.ai.openai.audio.transcription.AudioTranscriptionPrompt;
import org.springframework.ai.openai.audio.transcription.AudioTranscriptionResponse;
import org.springframework.core.io.ByteArrayResource;
import org.springframework.stereotype.Service;
import lombok.extern.slf4j.Slf4j;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.Instant;

/**
 * 语音转文字服务（ASR - Automatic Speech Recognition）
 */
@Slf4j
@Service
public class SpeechToTextService {
    
    private final OpenAiAudioTranscriptionModel transcriptionModel;
    
    public SpeechToTextService(OpenAiAudioTranscriptionModel transcriptionModel) {
        this.transcriptionModel = transcriptionModel;
    }
    
    /**
     * 将音频字节数组转为文字
     *
     * @param audioData 音频数据（支持 mp3/wav/webm/m4a）
     * @param fileName 文件名（Whisper通过扩展名判断格式）
     * @param language 语言代码（"zh"=中文, "en"=英文, null=自动检测）
     * @return 转录结果
     */
    public TranscriptionResult transcribe(byte[] audioData, String fileName, String language) {
        Instant start = Instant.now();
        
        // 校验文件大小（Whisper API限制25MB）
        if (audioData.length > 25 * 1024 * 1024) {
            throw new BusinessException(ErrorCode.AUDIO_TOO_LARGE, 
                "音频文件过大，最大支持25MB，当前：" + audioData.length / 1024 / 1024 + "MB");
        }
        
        // 构建请求
        ByteArrayResource audioResource = new ByteArrayResource(audioData) {
            @Override
            public String getFilename() {
                return fileName; // 必须设置文件名，Whisper依赖扩展名识别格式
            }
        };
        
        OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
            .language(language)
            .temperature(0.0f)
            .responseFormat(OpenAiAudioTranscriptionOptions.TranscriptResponseFormat.VERBOSE_JSON)
            .build();
        
        AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioResource, options);
        
        try {
            AudioTranscriptionResponse response = transcriptionModel.call(prompt);
            
            long elapsedMs = Instant.now().toEpochMilli() - start.toEpochMilli();
            String text = response.getResult().getOutput();
            
            log.info("语音转文字成功，音频大小: {}KB，耗时: {}ms，文字长度: {}字",
                     audioData.length / 1024, elapsedMs, text.length());
            
            return TranscriptionResult.builder()
                .text(text.trim())
                .language(language)
                .durationMs(elapsedMs)
                .audioSizeBytes(audioData.length)
                .build();
                
        } catch (Exception e) {
            log.error("语音转文字失败，文件: {}", fileName, e);
            throw new BusinessException(ErrorCode.ASR_FAILED, "语音识别失败，请重试");
        }
    }
    
    /**
     * 从MultipartFile转录
     */
    public TranscriptionResult transcribeFromFile(MultipartFile file, String language) {
        validateAudioFile(file);
        
        try {
            byte[] audioData = file.getBytes();
            return transcribe(audioData, file.getOriginalFilename(), language);
        } catch (IOException e) {
            throw new BusinessException(ErrorCode.FILE_READ_FAILED, "读取音频文件失败");
        }
    }
    
    /**
     * 音频格式预处理：将WebM/OGG转换为MP3（提升Whisper识别率）
     */
    public byte[] convertToMp3(byte[] inputData, String sourceFormat) throws IOException {
        if ("mp3".equalsIgnoreCase(sourceFormat) || "wav".equalsIgnoreCase(sourceFormat)) {
            return inputData; // 已支持的格式，直接返回
        }
        
        // 使用JAVE2进行音频格式转换
        Path tempInput = Files.createTempFile("audio_input_", "." + sourceFormat);
        Path tempOutput = Files.createTempFile("audio_output_", ".mp3");
        
        try {
            Files.write(tempInput, inputData);
            
            // 转换音频格式
            AudioAttributes audioAttributes = new AudioAttributes();
            audioAttributes.setCodec("libmp3lame");
            audioAttributes.setBitRate(128000);
            audioAttributes.setChannels(1);  // 单声道，减少文件大小
            audioAttributes.setSamplingRate(16000);  // 16kHz，Whisper最优采样率
            
            EncodingAttributes encodingAttributes = new EncodingAttributes();
            encodingAttributes.setOutputFormat("mp3");
            encodingAttributes.setAudioAttributes(audioAttributes);
            
            Encoder encoder = new Encoder();
            encoder.encode(new MultimediaObject(tempInput.toFile()), 
                          tempOutput.toFile(), 
                          encodingAttributes);
            
            byte[] mp3Data = Files.readAllBytes(tempOutput);
            log.info("音频格式转换完成: {} -> mp3, 原始: {}KB, 转换后: {}KB",
                     sourceFormat, inputData.length / 1024, mp3Data.length / 1024);
            
            return mp3Data;
            
        } finally {
            Files.deleteIfExists(tempInput);
            Files.deleteIfExists(tempOutput);
        }
    }
    
    private void validateAudioFile(MultipartFile file) {
        if (file.isEmpty()) {
            throw new BusinessException(ErrorCode.EMPTY_AUDIO_FILE, "音频文件为空");
        }
        
        String filename = file.getOriginalFilename();
        if (filename == null) {
            throw new BusinessException(ErrorCode.INVALID_AUDIO_FORMAT, "无效的文件名");
        }
        
        String extension = filename.substring(filename.lastIndexOf(".") + 1).toLowerCase();
        List<String> supported = List.of("mp3", "mp4", "mpeg", "mpga", "m4a", "wav", "webm");
        if (!supported.contains(extension)) {
            throw new BusinessException(ErrorCode.INVALID_AUDIO_FORMAT, 
                "不支持的音频格式: " + extension + "，支持: " + String.join(", ", supported));
        }
    }
}

第三章：实时流式ASR——边说边识别

3.1 WebSocket实时识别架构

3.2 WebSocket实时ASR实现

package com.example.voice.websocket;

import org.springframework.web.socket.*;
import org.springframework.web.socket.handler.AbstractWebSocketHandler;
import lombok.extern.slf4j.Slf4j;

import java.io.IOException;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicLong;

/**
 * 实时语音识别WebSocket处理器
 * 
 * 连接URL: ws://your-server/ws/speech-recognition
 * 
 * 消息协议：
 * - 客户端发送：二进制音频帧（PCM 16kHz 16bit 单声道）
 * - 服务端推送：JSON格式的识别结果
 *   {"type": "partial", "text": "你好", "confidence": 0.85}
 *   {"type": "final", "text": "你好世界", "confidence": 0.95}
 *   {"type": "error", "message": "识别超时"}
 */
@Slf4j
public class SpeechRecognitionWebSocketHandler extends AbstractWebSocketHandler {
    
    private final StreamingAsrService streamingAsrService;
    private final ChatClient chatClient;
    
    // 每个WebSocket连接对应一个ASR会话
    private final ConcurrentHashMap<String, AsrSession> sessions = new ConcurrentHashMap<>();
    
    public SpeechRecognitionWebSocketHandler(StreamingAsrService streamingAsrService,
                                              ChatClient chatClient) {
        this.streamingAsrService = streamingAsrService;
        this.chatClient = chatClient;
    }
    
    @Override
    public void afterConnectionEstablished(WebSocketSession session) throws Exception {
        String sessionId = session.getId();
        log.info("新的语音识别连接建立，sessionId: {}", sessionId);
        
        // 创建ASR会话，配置回调
        AsrSession asrSession = streamingAsrService.createSession(sessionId, 
            new AsrCallback() {
                @Override
                public void onPartialResult(String text, double confidence) {
                    sendMessage(session, Map.of(
                        "type", "partial",
                        "text", text,
                        "confidence", confidence
                    ));
                }
                
                @Override
                public void onFinalResult(String text, double confidence) {
                    sendMessage(session, Map.of(
                        "type", "final",
                        "text", text,
                        "confidence", confidence
                    ));
                    
                    // 语音识别完成，自动调用AI回答
                    processWithLLM(session, text);
                }
                
                @Override
                public void onError(String errorMessage) {
                    sendMessage(session, Map.of("type", "error", "message", errorMessage));
                }
            });
        
        sessions.put(sessionId, asrSession);
        
        // 发送连接成功消息
        sendMessage(session, Map.of("type", "connected", "sessionId", sessionId));
    }
    
    @Override
    protected void handleBinaryMessage(WebSocketSession session, BinaryMessage message) {
        String sessionId = session.getId();
        AsrSession asrSession = sessions.get(sessionId);
        
        if (asrSession == null) {
            log.warn("收到未知session的音频数据: {}", sessionId);
            return;
        }
        
        // 将音频帧发送给ASR引擎
        byte[] audioFrame = message.getPayload().array();
        asrSession.sendAudioFrame(audioFrame);
    }
    
    @Override
    protected void handleTextMessage(WebSocketSession session, TextMessage message) {
        // 处理控制消息
        String payload = message.getPayload();
        Map<String, Object> control = JsonUtils.fromJson(payload, Map.class);
        String type = (String) control.get("type");
        
        switch (type) {
            case "start_stream" -> {
                // 开始新的语音流
                AsrSession asrSession = sessions.get(session.getId());
                if (asrSession != null) {
                    asrSession.reset();
                }
                sendMessage(session, Map.of("type", "stream_started"));
            }
            case "end_stream" -> {
                // 结束语音流，等待最终结果
                AsrSession asrSession = sessions.get(session.getId());
                if (asrSession != null) {
                    asrSession.finalize();
                }
            }
            case "ping" -> {
                sendMessage(session, Map.of("type", "pong"));
            }
        }
    }
    
    @Override
    public void afterConnectionClosed(WebSocketSession session, CloseStatus status) {
        String sessionId = session.getId();
        AsrSession asrSession = sessions.remove(sessionId);
        if (asrSession != null) {
            asrSession.close();
        }
        log.info("语音识别连接关闭，sessionId: {}, 原因: {}", sessionId, status);
    }
    
    /**
     * 识别完成后调用LLM处理
     */
    private void processWithLLM(WebSocketSession session, String recognizedText) {
        sendMessage(session, Map.of(
            "type", "ai_thinking",
            "message", "AI正在思考..."
        ));
        
        // 流式调用AI，实时推送回答
        chatClient.prompt()
            .user(recognizedText)
            .stream()
            .content()
            .subscribe(
                chunk -> sendMessage(session, Map.of("type", "ai_chunk", "text", chunk)),
                error -> sendMessage(session, Map.of("type", "ai_error", "message", error.getMessage())),
                () -> sendMessage(session, Map.of("type", "ai_done"))
            );
    }
    
    private void sendMessage(WebSocketSession session, Object data) {
        if (!session.isOpen()) return;
        try {
            session.sendMessage(new TextMessage(JsonUtils.toJson(data)));
        } catch (IOException e) {
            log.error("发送WebSocket消息失败: {}", e.getMessage());
        }
    }
}

/**
 * WebSocket配置
 */
@Configuration
@EnableWebSocket
public class WebSocketConfig implements WebSocketConfigurer {
    
    private final SpeechRecognitionWebSocketHandler speechHandler;
    
    public WebSocketConfig(SpeechRecognitionWebSocketHandler speechHandler) {
        this.speechHandler = speechHandler;
    }
    
    @Override
    public void registerWebSocketHandlers(WebSocketHandlerRegistry registry) {
        registry.addHandler(speechHandler, "/ws/speech-recognition")
            .setAllowedOriginPatterns("*")  // 生产环境应限制域名
            .withSockJS();  // 支持SockJS降级
    }
}

第四章：TTS文字转语音实现

4.1 OpenAI TTS集成

package com.example.voice.service;

import org.springframework.ai.openai.OpenAiAudioSpeechModel;
import org.springframework.ai.openai.OpenAiAudioSpeechOptions;
import org.springframework.ai.openai.audio.speech.SpeechPrompt;
import org.springframework.ai.openai.audio.speech.SpeechResponse;
import org.springframework.stereotype.Service;
import lombok.extern.slf4j.Slf4j;

/**
 * 文字转语音服务（TTS - Text-to-Speech）
 */
@Slf4j
@Service
public class TextToSpeechService {
    
    private final OpenAiAudioSpeechModel speechModel;
    private final RedisTemplate<String, byte[]> redisTemplate;
    
    public TextToSpeechService(OpenAiAudioSpeechModel speechModel,
                                RedisTemplate<String, byte[]> redisTemplate) {
        this.speechModel = speechModel;
        this.redisTemplate = redisTemplate;
    }
    
    /**
     * 将文字转换为MP3音频
     *
     * @param text 要转换的文字（最大4096字符）
     * @param voice 音色（alloy/echo/fable/onyx/nova/shimmer）
     * @param speed 语速（0.25-4.0，默认1.0）
     * @return MP3音频数据
     */
    public byte[] textToSpeech(String text, String voice, float speed) {
        // 参数验证
        if (text == null || text.trim().isEmpty()) {
            throw new BusinessException(ErrorCode.EMPTY_TEXT, "转换文字不能为空");
        }
        
        // 截断过长文字（OpenAI TTS限制4096字符）
        if (text.length() > 4096) {
            log.warn("文字过长（{}字符），截断至4096字符", text.length());
            text = text.substring(0, 4096);
        }
        
        // 检查缓存（相同文字+音色+语速，直接返回缓存）
        String cacheKey = "tts:" + DigestUtils.md5DigestAsHex(
            (text + voice + speed).getBytes(StandardCharsets.UTF_8));
        byte[] cached = redisTemplate.opsForValue().get(cacheKey);
        if (cached != null) {
            log.debug("TTS缓存命中，key: {}", cacheKey);
            return cached;
        }
        
        // 调用TTS API
        Instant start = Instant.now();
        
        OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder()
            .model("tts-1")                    // tts-1（快速，适合实时）或 tts-1-hd（高质量）
            .voice(voice)
            .speed(speed)
            .responseFormat(OpenAiAudioSpeechOptions.AudioResponseFormat.MP3)
            .build();
        
        SpeechPrompt prompt = new SpeechPrompt(text, options);
        SpeechResponse response = speechModel.call(prompt);
        
        byte[] audioData = response.getResult().getOutput();
        long elapsedMs = Instant.now().toEpochMilli() - start.toEpochMilli();
        
        log.info("TTS完成，文字长度: {}字，音频大小: {}KB，耗时: {}ms",
                 text.length(), audioData.length / 1024, elapsedMs);
        
        // 缓存结果（TTL 24小时）
        redisTemplate.opsForValue().set(cacheKey, audioData, Duration.ofHours(24));
        
        return audioData;
    }
    
    /**
     * 长文本分段TTS（超过4096字符的文本）
     * 将文本按句子分段，并发生成后合并
     */
    public byte[] longTextToSpeech(String longText, String voice) throws Exception {
        if (longText.length() <= 4096) {
            return textToSpeech(longText, voice, 1.0f);
        }
        
        // 按句子分段（不打断完整句子）
        List<String> segments = splitBySentence(longText, 3000);
        log.info("长文本TTS，总长度: {}字，分为 {} 段", longText.length(), segments.size());
        
        // 并发生成各段音频（最多4个并发）
        ExecutorService executor = Executors.newFixedThreadPool(4);
        List<CompletableFuture<byte[]>> futures = segments.stream()
            .map(segment -> CompletableFuture.supplyAsync(
                () -> textToSpeech(segment, voice, 1.0f), executor))
            .collect(Collectors.toList());
        
        // 等待所有段完成
        List<byte[]> audioSegments = new ArrayList<>();
        for (CompletableFuture<byte[]> future : futures) {
            audioSegments.add(future.get(30, TimeUnit.SECONDS));
        }
        
        executor.shutdown();
        
        // 合并MP3文件
        return mergeMp3Files(audioSegments);
    }
    
    /**
     * 按句子分段（避免在句子中间断开）
     */
    private List<String> splitBySentence(String text, int maxLength) {
        List<String> segments = new ArrayList<>();
        StringBuilder current = new StringBuilder();
        
        // 按句号、问号、感叹号分割
        String[] sentences = text.split("(?<=[。！？.!?])");
        
        for (String sentence : sentences) {
            if (current.length() + sentence.length() > maxLength) {
                if (current.length() > 0) {
                    segments.add(current.toString());
                    current = new StringBuilder();
                }
                // 单句超长，强制截断
                if (sentence.length() > maxLength) {
                    segments.add(sentence.substring(0, maxLength));
                    sentence = sentence.substring(maxLength);
                }
            }
            current.append(sentence);
        }
        
        if (current.length() > 0) {
            segments.add(current.toString());
        }
        
        return segments;
    }
    
    /**
     * 合并多个MP3文件（简单的二进制拼接，MP3格式支持直接拼接）
     */
    private byte[] mergeMp3Files(List<byte[]> audioSegments) throws IOException {
        ByteArrayOutputStream merged = new ByteArrayOutputStream();
        for (byte[] segment : audioSegments) {
            merged.write(segment);
        }
        return merged.toByteArray();
    }
}

4.2 流式TTS播放API

package com.example.voice.controller;

import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.servlet.mvc.method.annotation.StreamingResponseBody;

/**
 * TTS API控制器
 */
@RestController
@RequestMapping("/api/voice")
public class VoiceController {
    
    private final TextToSpeechService ttsService;
    private final SpeechToTextService sttService;
    private final ChatClient chatClient;
    private final OssService ossService;
    
    public VoiceController(TextToSpeechService ttsService, 
                           SpeechToTextService sttService,
                           ChatClient chatClient,
                           OssService ossService) {
        this.ttsService = ttsService;
        this.sttService = sttService;
        this.chatClient = chatClient;
        this.ossService = ossService;
    }
    
    /**
     * 语音问答接口（完整链路：ASR → LLM → TTS）
     */
    @PostMapping("/ask")
    public ResponseEntity<VoiceAnswerResponse> voiceAsk(
            @RequestParam("audio") MultipartFile audioFile,
            @RequestParam(value = "language", defaultValue = "zh") String language,
            @RequestParam(value = "voice", defaultValue = "nova") String ttsVoice,
            @RequestHeader(value = "X-Session-Id", required = false) String sessionId) {
        
        long startTime = System.currentTimeMillis();
        
        // Step 1: 语音转文字（ASR）
        TranscriptionResult transcription = sttService.transcribeFromFile(audioFile, language);
        log.info("ASR完成，识别文字: {}", transcription.getText());
        
        // Step 2: 调用大模型生成回答
        String aiAnswer = chatClient.prompt()
            .system("你是一个友好的AI助手，回答简洁清晰，适合语音播报（避免过多Markdown格式）。")
            .user(transcription.getText())
            .call()
            .content();
        
        // Step 3: 文字转语音（TTS）
        byte[] audioData = ttsService.textToSpeech(aiAnswer, ttsVoice, 1.0f);
        
        // Step 4: 上传到OSS，返回URL
        String audioUrl = ossService.uploadAudio(audioData, "mp3");
        
        long totalTime = System.currentTimeMillis() - startTime;
        
        return ResponseEntity.ok(VoiceAnswerResponse.builder()
            .recognizedText(transcription.getText())
            .aiAnswer(aiAnswer)
            .audioUrl(audioUrl)
            .totalDurationMs(totalTime)
            .build());
    }
    
    /**
     * TTS流式播放接口（边生成边播放，降低首字延迟）
     */
    @GetMapping(value = "/tts/stream", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
    public ResponseEntity<StreamingResponseBody> ttsStream(
            @RequestParam String text,
            @RequestParam(defaultValue = "nova") String voice) {
        
        StreamingResponseBody stream = outputStream -> {
            try {
                byte[] audioData = ttsService.textToSpeech(text, voice, 1.0f);
                
                // 分块写入（每64KB一块）
                int chunkSize = 64 * 1024;
                for (int offset = 0; offset < audioData.length; offset += chunkSize) {
                    int length = Math.min(chunkSize, audioData.length - offset);
                    outputStream.write(audioData, offset, length);
                    outputStream.flush();
                }
            } catch (Exception e) {
                log.error("TTS流式输出失败", e);
            }
        };
        
        return ResponseEntity.ok()
            .contentType(MediaType.parseMediaType("audio/mpeg"))
            .header("Content-Disposition", "inline; filename=\"speech.mp3\"")
            .header("Cache-Control", "no-cache")
            .body(stream);
    }
    
    /**
     * 纯TTS接口（文字转MP3，返回URL）
     */
    @PostMapping("/tts")
    public ResponseEntity<TtsResponse> textToSpeech(@RequestBody TtsRequest request) {
        byte[] audioData = ttsService.textToSpeech(
            request.getText(), 
            request.getVoice() != null ? request.getVoice() : "nova",
            request.getSpeed() > 0 ? request.getSpeed() : 1.0f
        );
        
        String audioUrl = ossService.uploadAudio(audioData, "mp3");
        
        return ResponseEntity.ok(TtsResponse.builder()
            .audioUrl(audioUrl)
            .durationSeconds(estimateDuration(request.getText()))
            .build());
    }
    
    /**
     * 估算音频时长（每分钟约150字）
     */
    private double estimateDuration(String text) {
        return text.length() / 150.0 * 60;
    }
}

第五章：多语言支持与混合语音

5.1 中英文混合识别

/**
 * 多语言语音处理服务
 */
@Service
public class MultilingualVoiceService {
    
    private final SpeechToTextService sttService;
    private final TextToSpeechService ttsService;
    
    /**
     * 智能语言检测：根据音频内容自动选择最佳识别策略
     */
    public TranscriptionResult smartTranscribe(byte[] audioData, String filename) {
        // 第一次尝试：自动检测语言
        TranscriptionResult autoResult = sttService.transcribe(audioData, filename, null);
        
        String detectedText = autoResult.getText();
        
        // 如果文字主要是中文但识别质量低，重新用中文模式识别
        double chineseRatio = calculateChineseRatio(detectedText);
        
        if (chineseRatio > 0.5 && autoResult.getConfidence() < 0.8) {
            log.info("检测到中文内容({:.1f}%)，切换到中文优化模式", chineseRatio * 100);
            return sttService.transcribe(audioData, filename, "zh");
        }
        
        return autoResult;
    }
    
    /**
     * 中英文混合TTS
     * 对于中英混合文本，自动处理语调切换
     */
    public byte[] mixedLanguageTTS(String text, String primaryVoice) {
        // 检测是否包含大量英文
        double englishRatio = calculateEnglishRatio(text);
        
        if (englishRatio > 0.3) {
            // 混合语言：中文用中文音色，英文术语保持原发音
            // OpenAI TTS本身支持中英混读，直接调用即可
            log.info("检测到混合语言文本（英文占比{:.1f}%），启用混合TTS", englishRatio * 100);
        }
        
        // OpenAI TTS对中英混合文本的处理效果较好
        return ttsService.textToSpeech(text, primaryVoice, 1.0f);
    }
    
    /**
     * 计算文本中中文字符的比例
     */
    private double calculateChineseRatio(String text) {
        if (text == null || text.isEmpty()) return 0;
        
        long chineseCount = text.chars()
            .filter(c -> c >= 0x4E00 && c <= 0x9FFF)  // CJK统一汉字范围
            .count();
        
        return (double) chineseCount / text.length();
    }
    
    /**
     * 计算文本中英文字母的比例
     */
    private double calculateEnglishRatio(String text) {
        if (text == null || text.isEmpty()) return 0;
        
        long englishCount = text.chars()
            .filter(c -> (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
            .count();
        
        return (double) englishCount / text.length();
    }
}

第六章：语音活动检测（VAD）

6.1 前端VAD实现（与后端协作）

/**
 * VAD（Voice Activity Detection）控制器
 * 
 * 前端录音时，使用Web Audio API实现VAD：
 * 1. 实时分析音频音量
 * 2. 检测说话开始（音量超过阈值）
 * 3. 检测说话结束（静默超过500ms）
 * 4. 自动发送音频到后端
 * 
 * 以下是后端配套接口：
 */
@RestController
@RequestMapping("/api/voice/vad")
public class VadController {
    
    private final SpeechToTextService sttService;
    private final ChatClient chatClient;
    
    /**
     * 接收VAD切割后的音频片段
     * 前端检测到说话结束后，立即发送该片段
     */
    @PostMapping("/segment")
    public ResponseEntity<VadSegmentResponse> processSegment(
            @RequestParam("audio") MultipartFile audioSegment,
            @RequestParam(value = "segmentIndex") int segmentIndex,
            @RequestParam(value = "sessionId") String sessionId) {
        
        // 过滤过短的片段（可能是误触发）
        if (audioSegment.getSize() < 5000) {  // 小于5KB，约0.15秒
            return ResponseEntity.ok(VadSegmentResponse.builder()
                .ignored(true)
                .reason("音频过短，忽略")
                .build());
        }
        
        // 转录该片段
        TranscriptionResult result = sttService.transcribeFromFile(audioSegment, "zh");
        String text = result.getText().trim();
        
        // 过滤无意义内容
        if (text.isEmpty() || text.length() < 2) {
            return ResponseEntity.ok(VadSegmentResponse.builder()
                .ignored(true)
                .reason("识别内容过短")
                .build());
        }
        
        log.info("VAD片段识别成功，segmentIndex: {}，内容: {}", segmentIndex, text);
        
        return ResponseEntity.ok(VadSegmentResponse.builder()
            .ignored(false)
            .recognizedText(text)
            .segmentIndex(segmentIndex)
            .build());
    }
}

第七章：图片存储与音频管理

7.1 OSS音频管理服务

package com.example.voice.service;

import com.aliyun.oss.OSS;
import com.aliyun.oss.model.ObjectMetadata;
import com.aliyun.oss.model.PutObjectRequest;
import org.springframework.stereotype.Service;
import lombok.extern.slf4j.Slf4j;

import java.io.ByteArrayInputStream;
import java.time.format.DateTimeFormatter;
import java.util.UUID;

/**
 * 音频文件OSS存储服务
 */
@Slf4j
@Service
public class AudioOssService {
    
    private final OSS ossClient;
    
    @Value("${aliyun.oss.bucket-name}")
    private String bucketName;
    
    @Value("${aliyun.oss.cdn-domain}")
    private String cdnDomain;
    
    public AudioOssService(OSS ossClient) {
        this.ossClient = ossClient;
    }
    
    /**
     * 上传音频文件到OSS
     *
     * @param audioData 音频数据
     * @param format 格式（mp3/wav/ogg）
     * @return CDN可访问的URL
     */
    public String uploadAudio(byte[] audioData, String format) {
        // 生成存储路径：voice/2026/06/28/uuid.mp3
        String date = LocalDate.now().format(DateTimeFormatter.ofPattern("yyyy/MM/dd"));
        String objectKey = String.format("voice/%s/%s.%s", date, UUID.randomUUID(), format);
        
        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setContentType(getContentType(format));
        metadata.setContentLength(audioData.length);
        // 设置缓存（静态音频文件，客户端缓存1天）
        metadata.setCacheControl("public, max-age=86400");
        // 设置Content-Disposition，支持在线播放
        metadata.setContentDisposition("inline");
        
        PutObjectRequest request = new PutObjectRequest(
            bucketName, objectKey, new ByteArrayInputStream(audioData), metadata);
        
        ossClient.putObject(request);
        
        String url = cdnDomain + "/" + objectKey;
        log.info("音频上传成功，大小: {}KB，URL: {}", audioData.length / 1024, url);
        
        return url;
    }
    
    /**
     * 清理过期音频文件（定时任务，每天凌晨2点执行）
     */
    @Scheduled(cron = "0 0 2 * * ?")
    public void cleanupExpiredAudio() {
        // 删除7天前的音频文件（业务场景：临时语音问答，不需要长期保存）
        LocalDate cutoffDate = LocalDate.now().minusDays(7);
        String prefix = "voice/" + cutoffDate.format(DateTimeFormatter.ofPattern("yyyy/MM/dd"));
        
        log.info("开始清理过期音频，日期前缀: {}", prefix);
        
        // 列出并删除（批量删除，每次最多1000个）
        String nextMarker = null;
        int totalDeleted = 0;
        
        do {
            com.aliyun.oss.model.ListObjectsRequest listRequest = 
                new com.aliyun.oss.model.ListObjectsRequest(bucketName)
                    .withPrefix(prefix)
                    .withMarker(nextMarker)
                    .withMaxKeys(1000);
            
            com.aliyun.oss.model.ObjectListing listing = ossClient.listObjects(listRequest);
            List<String> keys = listing.getObjectSummaries().stream()
                .map(com.aliyun.oss.model.OSSObjectSummary::getKey)
                .collect(Collectors.toList());
            
            if (!keys.isEmpty()) {
                ossClient.deleteObjects(new com.aliyun.oss.model.DeleteObjectsRequest(bucketName)
                    .withKeys(keys));
                totalDeleted += keys.size();
            }
            
            nextMarker = listing.getNextMarker();
        } while (nextMarker != null);
        
        log.info("过期音频清理完成，共删除 {} 个文件", totalDeleted);
    }
    
    private String getContentType(String format) {
        return switch (format.toLowerCase()) {
            case "mp3" -> "audio/mpeg";
            case "wav" -> "audio/wav";
            case "ogg" -> "audio/ogg";
            case "m4a" -> "audio/mp4";
            default -> "audio/mpeg";
        };
    }
}

第八章：移动端适配

8.1 移动端调用后端语音API

/**
 * 移动端适配的语音API
 * 针对iOS/Android的特殊处理
 */
@RestController
@RequestMapping("/api/mobile/voice")
public class MobileVoiceController {
    
    private final SpeechToTextService sttService;
    private final TextToSpeechService ttsService;
    private final ChatClient chatClient;
    
    /**
     * 移动端语音问答（针对移动端优化）
     * 
     * 移动端特点：
     * 1. 网络不稳定，需要超时控制
     * 2. 音频格式：iOS录音默认m4a，Android录音默认webm/aac
     * 3. 返回的音频URL需要支持HTTPS（iOS要求）
     * 4. 文件大小限制更严格（移动端内存有限）
     */
    @PostMapping(value = "/ask", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
    public ResponseEntity<MobileVoiceResponse> mobileVoiceAsk(
            @RequestParam("audio") MultipartFile audioFile,
            @RequestParam(value = "deviceType", defaultValue = "unknown") String deviceType,
            @RequestParam(value = "language", defaultValue = "zh") String language) {
        
        log.info("收到移动端语音请求，设备类型: {}，文件大小: {}KB，格式: {}",
                 deviceType, audioFile.getSize() / 1024, audioFile.getContentType());
        
        // 移动端文件大小限制（10MB）
        if (audioFile.getSize() > 10 * 1024 * 1024) {
            return ResponseEntity.status(413)
                .body(MobileVoiceResponse.error("音频文件过大，请录制较短的音频（建议60秒以内）"));
        }
        
        try {
            // Step 1: ASR
            TranscriptionResult transcription = sttService.transcribeFromFile(audioFile, language);
            
            if (transcription.getText().isEmpty()) {
                return ResponseEntity.ok(MobileVoiceResponse.builder()
                    .success(false)
                    .errorMessage("未能识别到语音内容，请重新录制")
                    .build());
            }
            
            // Step 2: LLM（设置超时，移动端用户等待耐心有限）
            String aiAnswer = callLLMWithTimeout(transcription.getText(), 15);
            
            // Step 3: TTS（生成语音回答）
            // 移动端回答不宜太长，截断到200字以内
            String speakableAnswer = truncateForSpeech(aiAnswer, 200);
            byte[] audioData = ttsService.textToSpeech(speakableAnswer, "nova", 1.0f);
            String audioUrl = audioOssService.uploadAudio(audioData, "mp3");
            
            return ResponseEntity.ok(MobileVoiceResponse.builder()
                .success(true)
                .recognizedText(transcription.getText())
                .fullAnswer(aiAnswer)           // 完整回答（显示在屏幕上）
                .speakableAnswer(speakableAnswer)  // 语音播报版本
                .audioUrl(audioUrl)
                .build());
                
        } catch (TimeoutException e) {
            return ResponseEntity.ok(MobileVoiceResponse.builder()
                .success(false)
                .errorMessage("AI响应超时，请稍后重试")
                .build());
        }
    }
    
    /**
     * 带超时的LLM调用
     */
    private String callLLMWithTimeout(String question, int timeoutSeconds) 
            throws TimeoutException {
        CompletableFuture<String> future = CompletableFuture.supplyAsync(
            () -> chatClient.prompt()
                .system("请简洁回答，控制在200字以内，适合语音播报。")
                .user(question)
                .call()
                .content()
        );
        
        try {
            return future.get(timeoutSeconds, TimeUnit.SECONDS);
        } catch (java.util.concurrent.TimeoutException e) {
            future.cancel(true);
            throw new TimeoutException("LLM响应超时");
        } catch (Exception e) {
            throw new RuntimeException("LLM调用失败", e);
        }
    }
    
    /**
     * 截断文字为适合语音播报的长度
     * 在句子边界处截断，而不是硬截断
     */
    private String truncateForSpeech(String text, int maxLength) {
        if (text.length() <= maxLength) return text;
        
        // 在最后一个句子边界截断
        int lastSentenceEnd = Math.max(
            text.lastIndexOf("。", maxLength),
            text.lastIndexOf("！", maxLength)
        );
        lastSentenceEnd = Math.max(lastSentenceEnd, text.lastIndexOf("？", maxLength));
        
        if (lastSentenceEnd > maxLength * 0.6) {
            return text.substring(0, lastSentenceEnd + 1);
        }
        
        return text.substring(0, maxLength) + "...";
    }
}

第九章：性能优化与监控

9.1 语音服务性能指标

真实生产环境的性能数据（2026年5月，某教育平台生产数据）：

指标	目标值	实测值	优化措施
ASR延迟（Whisper API）	<2s	1.2s	音频预处理、并发限制
TTS延迟（OpenAI TTS）	<1.5s	0.8s	缓存热点文本
语音问答全链路延迟	<5s	3.5s	流水线并发
TTS缓存命中率	>30%	42%	常用话术预生成
ASR识别准确率（普通话）	>95%	97.2%	指定language=zh
服务可用性	>99.9%	99.95%	多区域降级

/**
 * 语音服务性能监控
 */
@Component
public class VoiceServiceMetrics {
    
    private final MeterRegistry meterRegistry;
    
    // 定义监控指标
    private final Timer asrLatencyTimer;
    private final Timer ttsLatencyTimer;
    private final Counter ttsacheHitCounter;
    private final Counter ttsCacheMissCounter;
    private final Gauge activeConnectionsGauge;
    
    private final AtomicInteger activeConnections = new AtomicInteger(0);
    
    public VoiceServiceMetrics(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        
        this.asrLatencyTimer = Timer.builder("voice.asr.latency")
            .description("ASR语音转文字延迟")
            .tag("service", "whisper")
            .register(meterRegistry);
        
        this.ttsLatencyTimer = Timer.builder("voice.tts.latency")
            .description("TTS文字转语音延迟")
            .tag("service", "openai")
            .register(meterRegistry);
        
        this.ttsacheHitCounter = Counter.builder("voice.tts.cache.hit")
            .description("TTS缓存命中次数")
            .register(meterRegistry);
        
        this.ttsCacheMissCounter = Counter.builder("voice.tts.cache.miss")
            .description("TTS缓存未命中次数")
            .register(meterRegistry);
        
        this.activeConnectionsGauge = Gauge.builder("voice.websocket.active_connections",
            activeConnections, AtomicInteger::get)
            .description("活跃WebSocket连接数")
            .register(meterRegistry);
    }
    
    public void recordAsrLatency(long latencyMs) {
        asrLatencyTimer.record(latencyMs, TimeUnit.MILLISECONDS);
    }
    
    public void recordTtsLatency(long latencyMs, boolean cacheHit) {
        ttsLatencyTimer.record(latencyMs, TimeUnit.MILLISECONDS);
        if (cacheHit) {
            ttsacheHitCounter.increment();
        } else {
            ttsCacheMissCounter.increment();
        }
    }
    
    public void incrementActiveConnections() {
        activeConnections.incrementAndGet();
    }
    
    public void decrementActiveConnections() {
        activeConnections.decrementAndGet();
    }
}

第十章：FAQ

FAQ

Q1：Whisper API的识别准确率能达到多少？

A：实测数据：

普通话（标准）：97-99%
带口音的普通话：90-95%
方言（粤语/闽南语）：75-85%（需要指定language参数）
专业术语（医疗/法律）：85-92%（可用prompt参数提示专业词汇）
背景噪音环境：80-90%（取决于信噪比）

Q2：实时流式识别和Whisper API有什么区别？

A：

Whisper API：批处理模式，录完再识别，延迟1-3秒，但准确率高
流式ASR（Azure/阿里云）：边说边识别，延迟200ms以内，适合实时交互场景

对于大多数问答场景，Whisper API的质量更好；对于实时电话客服、直播字幕等场景，需要用流式ASR。

Q3：语音功能上线后，成本会增加多少？

A：以1000个用户，每人每天使用3次语音问答为例：

每次问答假设：10秒语音 + 100字回答
ASR成本：$0.006 × (10/60) × 3000次 = $3/天
TTS成本：$15/1M字符 × 100字 × 3000次 = $4.5/天
LLM成本：约$5/天
合计：约$12.5/天（约¥90/天）

对比用户留存率提升25%带来的LTV增加，这个成本通常是完全划算的。

Q4：如何处理用户拒绝麦克风权限的情况？

A：前端必须优雅降级：

首先请求麦克风权限，监听permissiondenied事件
权限被拒绝时，隐藏语音按钮，显示文字输入
在设置页引导用户开启麦克风权限
后端接口同时支持文字和语音两种输入方式

总结

语音功能不只是"锦上添花"，在正确的场景下它是核心竞争力。

关键收益数据：

用户留存率平均提升20-30%
日均交互次数提升50-100%
对45岁+用户群体效果尤其显著

技术实现的核心要点：

Whisper API + Spring AI是最快的集成路径，准确率有保证
TTS缓存能节省40%以上成本（热点文本预生成）
流式TTS播放降低用户感知延迟，体验远好于等待完整音频
VAD前端实现是实时流式识别的关键基础设施