第1804篇：文字转语音的产品化——TTS在智能客服中的情感与音色控制

老张2026/4/30大约 12 分钟

第1804篇：文字转语音的产品化——TTS在智能客服中的情感与音色控制

一个被用户骂的经历

我们第一版智能客服上线后，收到了不少反馈，其中有一条让我印象很深：

"这个机器人说话太僵了，就像念稿子，一点温度都没有，越听越烦。"

当时我们用的是最基础的TTS方案，把LLM生成的文字直接转成语音。文字写得不错，但声音听起来就是平板的阅读腔，抑扬顿挫全靠运气。

后来我才意识到，TTS不是"文字变声音"这么简单，要做好还需要：情感控制、语速节奏、停顿设计、音色选择……这些组合起来，才能让语音听起来像一个真人在跟你说话。

这篇文章就来讲讲，TTS在智能客服里的产品化，从API选型到情感控制，到工程层面的完整实现。

TTS方案的选型

国内外可用的TTS方案很多，选择时主要考虑几个维度：

从实际项目经验来看：

OpenAI TTS（tts-1/tts-1-hd）：英文效果一流，中文过关但不惊艳，适合国际化产品
Azure Cognitive Services TTS：中文情感控制是目前最成熟的，SSML支持完整，推荐
字节火山引擎TTS：中文非常自然，低延迟，适合国内场景
阿里云TTS：稳定可靠，接入简单，适合入门

我们生产环境用的是Azure + 国内厂商混用的策略：国际客户走Azure，国内客户走火山，成本和效果都较优。

SSML：TTS的真正控制语言

很多人用TTS只是把文字传进去，这样的效果往往比较平。SSML（Speech Synthesis Markup Language）才是精细化控制语音的关键。

SSML是XML格式的标记语言，可以控制：

停顿时长（<break>）
语速和音调（<prosody>）
强调某个词（<emphasis>）
情感风格（<mstts:express-as>）
特殊读法（数字、日期、货币）

@Component
public class SsmlBuilder {
    
    /**
     * 为客服回复构建SSML
     * 核心：根据内容语义，自动添加情感和节奏标记
     */
    public String buildCustomerServiceSsml(String text, 
                                            EmotionType emotion,
                                            SpeechStyle style) {
        StringBuilder ssml = new StringBuilder();
        
        ssml.append("<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis'");
        ssml.append(" xmlns:mstts='https://www.w3.org/2001/mstts'");
        ssml.append(" xml:lang='zh-CN'>");
        
        // 音色选择（Azure的中文女声晓晓效果最好）
        ssml.append("<voice name='zh-CN-XiaoxiaoNeural'>");
        
        // 情感风格
        String styleStr = mapEmotionToStyle(emotion, style);
        ssml.append(String.format("<mstts:express-as style='%s' styledegree='1.2'>", 
                                  styleStr));
        
        // 语速控制
        String rate = mapStyleToRate(style);
        ssml.append(String.format("<prosody rate='%s'>", rate));
        
        // 处理文本内容，添加停顿和强调
        ssml.append(processTextWithPauses(text));
        
        ssml.append("</prosody>");
        ssml.append("</mstts:express-as>");
        ssml.append("</voice>");
        ssml.append("</speak>");
        
        return ssml.toString();
    }
    
    /**
     * 情感到Azure风格的映射
     * Azure支持的风格：chat, customerservice, empathetic, 
     *                  excited, friendly, hopeful, sad, serious等
     */
    private String mapEmotionToStyle(EmotionType emotion, SpeechStyle style) {
        return switch (emotion) {
            case WELCOME -> "friendly";
            case APOLOGIZE -> "empathetic";
            case EXPLAIN -> "customerservice";
            case ENCOURAGE -> "hopeful";
            case SERIOUS -> "serious";
            case EXCITED -> "excited";
            default -> "chat";
        };
    }
    
    /**
     * 文本处理：自动识别停顿点
     * 
     * 句号/感叹号后 -> 较长停顿
     * 逗号后 -> 短停顿
     * 数字列表 -> 放慢语速
     * 重要词汇 -> 轻微强调
     */
    private String processTextWithPauses(String text) {
        StringBuilder processed = new StringBuilder();
        
        String[] sentences = text.split("(?<=[。！？])");
        
        for (int i = 0; i < sentences.length; i++) {
            String sentence = sentences[i];
            
            // 处理句内逗号停顿
            sentence = sentence.replaceAll("([，、])", "$1<break time='200ms'/>");
            
            // 处理数字序列（如"第一、第二、第三"）放慢
            sentence = processNumberedLists(sentence);
            
            // 处理强调词（如"重要"、"注意"、"必须"）
            sentence = emphasizeKeyWords(sentence);
            
            processed.append(sentence);
            
            // 句子间停顿
            if (i < sentences.length - 1) {
                processed.append("<break time='400ms'/>");
            }
        }
        
        return processed.toString();
    }
    
    private String processNumberedLists(String text) {
        // "第一"、"首先"等序号词前后加停顿并稍微放慢
        return text.replaceAll("(第[一二三四五六七八九十]+|首先|其次|最后|另外)",
            "<break time='200ms'/><prosody rate='-5%'>$1</prosody>");
    }
    
    private String emphasizeKeyWords(String text) {
        List<String> emphasisWords = Arrays.asList(
            "重要", "注意", "请注意", "必须", "一定要", "千万", "特别"
        );
        
        for (String word : emphasisWords) {
            text = text.replace(word, 
                String.format("<emphasis level='moderate'>%s</emphasis>", word));
        }
        
        return text;
    }
    
    /**
     * 语速映射
     */
    private String mapStyleToRate(SpeechStyle style) {
        return switch (style) {
            case URGENT -> "+15%";     // 稍快，处理紧急问题
            case NORMAL -> "+0%";
            case EXPLAIN -> "-10%";    // 解释性内容稍慢
            case SOOTHING -> "-15%";   // 安抚情绪时更慢
        };
    }
}

Azure TTS的完整封装

@Component
public class AzureTtsService {
    
    @Value("${azure.speech.key}")
    private String speechKey;
    
    @Value("${azure.speech.region}")
    private String speechRegion;
    
    private SpeechConfig speechConfig;
    
    @PostConstruct
    public void init() {
        speechConfig = SpeechConfig.fromSubscription(speechKey, speechRegion);
        speechConfig.setSpeechSynthesisOutputFormat(
            SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3);
    }
    
    /**
     * 合成语音（返回字节数组）
     */
    public TtsResult synthesize(String ssml) {
        try (SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null)) {
            
            SpeechSynthesisResult result = synthesizer.SpeakSsmlAsync(ssml).get();
            
            if (result.getReason() == ResultReason.SynthesizingAudioCompleted) {
                byte[] audioData = result.getAudioData();
                
                return TtsResult.builder()
                    .audioData(audioData)
                    .format(AudioFormat.MP3_16KHZ)
                    .durationMs(estimateDuration(audioData))
                    .success(true)
                    .build();
            } else {
                SpeechSynthesisCancellationDetails cancellation = 
                    SpeechSynthesisCancellationDetails.fromResult(result);
                
                throw new TtsException("语音合成失败: " + cancellation.getErrorDetails());
            }
        } catch (Exception e) {
            throw new TtsException("TTS调用异常", e);
        }
    }
    
    /**
     * 流式合成（低延迟场景）
     * 边合成边播放，首包延迟可以做到300ms以内
     */
    public void synthesizeStream(String ssml, AudioStreamCallback callback) {
        
        AudioConfig audioConfig = AudioConfig.fromStreamOutput(
            AudioOutputStream.createPullStream());
        
        try (SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig)) {
            
            // 注册合成进度事件
            synthesizer.SynthesizingAudio.addEventListener((s, e) -> {
                byte[] chunk = e.getResult().getAudioData();
                if (chunk != null && chunk.length > 0) {
                    callback.onAudioChunk(chunk);
                }
            });
            
            synthesizer.SynthesisCompleted.addEventListener((s, e) -> {
                callback.onComplete();
            });
            
            synthesizer.SynthesisCanceled.addEventListener((s, e) -> {
                callback.onError("合成被取消");
            });
            
            // 开始合成
            synthesizer.SpeakSsmlAsync(ssml);
            
            // 等待完成
            synthesizer.waitForCompletion();
        }
    }
    
    /**
     * 预热缓存：对常用回复预先合成音频
     * 可以把"您好，欢迎致电XXX"这类固定话术提前合成缓存
     */
    @Autowired
    private TtsAudioCache audioCache;
    
    public TtsResult synthesizeWithCache(String text, EmotionType emotion, 
                                          SpeechStyle style) {
        String cacheKey = buildCacheKey(text, emotion, style);
        
        // 先查缓存
        TtsResult cached = audioCache.get(cacheKey);
        if (cached != null) {
            return cached;
        }
        
        // 缓存未命中，合成
        String ssml = ssmlBuilder.buildCustomerServiceSsml(text, emotion, style);
        TtsResult result = synthesize(ssml);
        
        // 短文本和固定话术放入缓存
        if (text.length() < 100) {
            audioCache.put(cacheKey, result, Duration.ofHours(24));
        }
        
        return result;
    }
}

情感识别驱动TTS情感

更智能的做法：不是手动指定情感，而是先分析用户语气，再自动选择TTS情感。

@Component
public class EmotionDrivenTtsOrchestrator {
    
    @Autowired
    private LlmClient llmClient;
    
    @Autowired
    private AzureTtsService ttsService;
    
    @Autowired
    private SsmlBuilder ssmlBuilder;
    
    /**
     * 情感感知的TTS合成
     * 根据用户输入的情绪，自动调整客服回复的语音风格
     */
    public TtsResult synthesizeAdaptive(String userInput, String agentReply) {
        
        // 分析用户情绪
        UserEmotion userEmotion = analyzeUserEmotion(userInput);
        
        // 根据用户情绪，决定客服的回复风格
        EmotionType responseEmotion = determineResponseEmotion(userEmotion);
        SpeechStyle responseStyle = determineSpeechStyle(userEmotion, agentReply);
        
        // 合成
        String ssml = ssmlBuilder.buildCustomerServiceSsml(
            agentReply, responseEmotion, responseStyle);
        
        return ttsService.synthesize(ssml);
    }
    
    /**
     * 用户情绪分析
     * 几种关键情绪的识别逻辑
     */
    private UserEmotion analyzeUserEmotion(String userInput) {
        String prompt = String.format("""
            请判断以下用户发言中的情绪状态，只回答一个词：
            
            用户发言：%s
            
            情绪选项：
            - CALM（平静，正常咨询）
            - CONFUSED（困惑，不清楚某件事）
            - FRUSTRATED（轻度不满，等待时间长等小问题）
            - ANGRY（愤怒，有强烈投诉或不满）
            - URGENT（紧急，有紧急需求）
            - SAD（悲伤，遇到了令人沮丧的情况）
            
            只回答情绪代码。
            """, userInput);
        
        String result = llmClient.complete(prompt).trim().toUpperCase();
        try {
            return UserEmotion.valueOf(result);
        } catch (IllegalArgumentException e) {
            return UserEmotion.CALM;
        }
    }
    
    /**
     * 情绪响应策略：用户情绪 -> 客服应有情绪
     * 
     * 关键原则：不能"以愤还愤"，也不能在用户愤怒时还欢快
     */
    private EmotionType determineResponseEmotion(UserEmotion userEmotion) {
        return switch (userEmotion) {
            case CALM -> EmotionType.FRIENDLY;       // 平静对平静，友善
            case CONFUSED -> EmotionType.EXPLAIN;    // 困惑时耐心解释
            case FRUSTRATED -> EmotionType.APOLOGIZE; // 轻度不满时道歉安抚
            case ANGRY -> EmotionType.APOLOGIZE;     // 愤怒时同理心最重要
            case URGENT -> EmotionType.SERIOUS;      // 紧急时认真高效
            case SAD -> EmotionType.APOLOGIZE;       // 悲伤时同情
        };
    }
    
    /**
     * 说话风格决策
     */
    private SpeechStyle determineSpeechStyle(UserEmotion userEmotion, String replyText) {
        // 愤怒用户用安抚语速（慢）
        if (userEmotion == UserEmotion.ANGRY) return SpeechStyle.SOOTHING;
        
        // 紧急情况稍快
        if (userEmotion == UserEmotion.URGENT) return SpeechStyle.URGENT;
        
        // 解释性内容（回复较长）用稍慢语速
        if (replyText.length() > 150) return SpeechStyle.EXPLAIN;
        
        return SpeechStyle.NORMAL;
    }
}

音色管理系统

企业级TTS通常需要管理多个音色，不同场景用不同声音：

@Component
public class VoiceProfileManager {
    
    /**
     * 音色配置：不同业务场景用不同声音
     */
    @Configuration
    public static class VoiceProfiles {
        
        // 售前咨询：活泼友善
        public static final VoiceProfile SALES_CONSULT = VoiceProfile.builder()
            .voiceName("zh-CN-XiaoxiaoNeural")
            .defaultStyle("friendly")
            .defaultRate("+5%")
            .defaultPitch("+2Hz")
            .build();
        
        // 售后服务：专业耐心
        public static final VoiceProfile AFTER_SALES = VoiceProfile.builder()
            .voiceName("zh-CN-YunxiNeural") // 男声，更稳重
            .defaultStyle("customerservice")
            .defaultRate("-5%")
            .defaultPitch("-1Hz")
            .build();
        
        // 通知播报：清晰准确
        public static final VoiceProfile NOTIFICATION = VoiceProfile.builder()
            .voiceName("zh-CN-XiaoyiNeural")
            .defaultStyle("serious")
            .defaultRate("+0%")
            .defaultPitch("+0Hz")
            .build();
        
        // 儿童应用：温柔活泼
        public static final VoiceProfile CHILDREN = VoiceProfile.builder()
            .voiceName("zh-CN-XiaohanNeural")
            .defaultStyle("cheerful")
            .defaultRate("+10%")
            .defaultPitch("+5Hz")
            .build();
    }
    
    /**
     * 根据业务场景自动选择音色
     */
    public VoiceProfile selectProfile(BusinessScenario scenario, 
                                       UserProfile userProfile) {
        // 可以根据用户偏好做个性化
        if (userProfile != null && userProfile.getPreferredVoice() != null) {
            return loadCustomProfile(userProfile.getPreferredVoice());
        }
        
        return switch (scenario) {
            case SALES -> VoiceProfiles.SALES_CONSULT;
            case AFTER_SALES -> VoiceProfiles.AFTER_SALES;
            case NOTIFICATION -> VoiceProfiles.NOTIFICATION;
            default -> VoiceProfiles.SALES_CONSULT;
        };
    }
    
    /**
     * A/B测试：对同一用户随机分配音色，收集偏好数据
     */
    public VoiceProfile abTestProfile(String userId) {
        int hash = Math.abs(userId.hashCode() % 2);
        
        return switch (hash) {
            case 0 -> VoiceProfiles.SALES_CONSULT; // 组A：晓晓
            case 1 -> VoiceProfile.builder()       // 组B：新音色
                .voiceName("zh-CN-XiaomoNeural")
                .defaultStyle("cheerful")
                .build();
            default -> VoiceProfiles.SALES_CONSULT;
        };
    }
}

数字和特殊内容的处理

这是TTS里非常容易被忽略的细节，但对用户体验影响很大。

@Component
public class TextNormalizer {
    
    /**
     * TTS前的文本规范化处理
     * 核心：让数字、日期、金额等被正确读出
     */
    public String normalize(String text) {
        String result = text;
        
        // 手机号：每4位一组，读起来更清晰
        result = formatPhoneNumbers(result);
        
        // 日期：转成自然语言格式
        result = formatDates(result);
        
        // 金额：带元角分的正确读法
        result = formatCurrency(result);
        
        // 英文缩写：展开或标注读法
        result = expandAbbreviations(result);
        
        // URL和邮箱：简化处理
        result = simplifyUrls(result);
        
        return result;
    }
    
    /**
     * 手机号格式化：138-xxxx-xxxx 读起来更自然
     */
    private String formatPhoneNumbers(String text) {
        // 匹配11位手机号
        return text.replaceAll("(1[3-9]\\d)(\\d{4})(\\d{4})", 
            "<say-as interpret-as='telephone'>$1-$2-$3</say-as>");
    }
    
    /**
     * 日期格式化
     * 2024-03-15 -> 二零二四年三月十五日
     */
    private String formatDates(String text) {
        // 匹配 YYYY-MM-DD 格式
        Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
        Matcher matcher = datePattern.matcher(text);
        
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            String year = matcher.group(1);
            String month = matcher.group(2);
            String day = matcher.group(3);
            
            // 用SSML的say-as标记
            String formatted = String.format(
                "<say-as interpret-as='date' format='ymd'>%s-%s-%s</say-as>",
                year, month, day);
            matcher.appendReplacement(sb, formatted);
        }
        matcher.appendTail(sb);
        
        return sb.toString();
    }
    
    /**
     * 金额格式化
     * ¥1234.56 -> 一千两百三十四元五角六分
     * 注意：TTS通常会把"¥"读成"人民币"，把小数点读成"点"，
     * 这在金融场景下可能不够正式，需要显式控制
     */
    private String formatCurrency(String text) {
        Pattern moneyPattern = Pattern.compile("[¥￥](\\d+(?:\\.\\d{1,2})?)");
        Matcher matcher = moneyPattern.matcher(text);
        
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            String amount = matcher.group(1);
            String spoken = convertAmountToSpoken(amount);
            matcher.appendReplacement(sb, spoken);
        }
        matcher.appendTail(sb);
        
        return sb.toString();
    }
    
    private String convertAmountToSpoken(String amount) {
        try {
            BigDecimal bd = new BigDecimal(amount);
            long yuan = bd.longValue();
            int fen = bd.remainder(BigDecimal.ONE).multiply(BigDecimal.TEN.pow(2))
                       .intValue();
            
            StringBuilder spoken = new StringBuilder();
            spoken.append(numberToChineseCapital(yuan)).append("元");
            
            if (fen > 0) {
                int jiao = fen / 10;
                int fenZi = fen % 10;
                if (jiao > 0) spoken.append(jiao).append("角");
                if (fenZi > 0) spoken.append(fenZi).append("分");
            } else {
                spoken.append("整");
            }
            
            return spoken.toString();
        } catch (NumberFormatException e) {
            return amount; // 转换失败返回原文
        }
    }
    
    /**
     * 常见缩写展开
     * 让TTS读出全称而不是逐字母念
     */
    private String expandAbbreviations(String text) {
        Map<String, String> abbreviations = Map.of(
            "ERP", "企业资源规划系统",
            "CRM", "客户关系管理系统",
            "API", "接口",
            "APP", "应用",
            "URL", "网址",
            "ID", "编号"
        );
        
        // 注意：只替换独立出现的缩写，不替换词汇中间的
        for (Map.Entry<String, String> entry : abbreviations.entrySet()) {
            text = text.replaceAll("\\b" + entry.getKey() + "\\b", entry.getValue());
        }
        
        return text;
    }
}

流式TTS与WebSocket推送

实时客服场景下，LLM生成文字和TTS合成音频需要流水线化处理：

@Service
public class StreamingTtsPipeline {
    
    /**
     * LLM生成 -> TTS合成 的流水线
     * 目标：首句话播放的延迟 < 1秒
     */
    public void streamToClient(String userQuery, WebSocketSession session) {
        
        // 缓冲区：LLM输出积累到句子级别才发给TTS
        SentenceBuffer sentenceBuffer = new SentenceBuffer(this::onSentenceComplete);
        
        // LLM流式生成
        llmClient.streamGenerate(userQuery, chunk -> {
            sentenceBuffer.append(chunk);
        }, () -> {
            // LLM生成完毕，处理剩余缓冲
            sentenceBuffer.flush();
        });
    }
    
    /**
     * 句子完整了，开始TTS合成
     */
    private void onSentenceComplete(String sentence, WebSocketSession session) {
        
        // 分析这句话的情感（简单版本）
        EmotionType emotion = quickEmotionCheck(sentence);
        SpeechStyle style = sentence.length() > 80 ? 
            SpeechStyle.EXPLAIN : SpeechStyle.NORMAL;
        
        // 异步TTS合成，不阻塞下一句的处理
        CompletableFuture.supplyAsync(() -> {
            String ssml = ssmlBuilder.buildCustomerServiceSsml(sentence, emotion, style);
            return ttsService.synthesize(ssml);
        }).thenAccept(ttsResult -> {
            // 发给客户端
            sendAudioToClient(ttsResult.getAudioData(), session);
        });
    }
    
    /**
     * 句子缓冲器：按标点断句
     */
    private static class SentenceBuffer {
        
        private static final Set<Character> SENTENCE_ENDINGS = 
            Set.of('。', '！', '？', '…', '\n');
        
        private final StringBuilder buffer = new StringBuilder();
        private final Consumer<String> onComplete;
        
        SentenceBuffer(Consumer<String> onComplete) {
            this.onComplete = onComplete;
        }
        
        void append(String chunk) {
            buffer.append(chunk);
            
            // 检查是否有完整句子
            while (true) {
                int endPos = findSentenceEnd();
                if (endPos < 0) break;
                
                String sentence = buffer.substring(0, endPos + 1).trim();
                buffer.delete(0, endPos + 1);
                
                if (!sentence.isEmpty()) {
                    onComplete.accept(sentence);
                }
            }
        }
        
        void flush() {
            String remaining = buffer.toString().trim();
            if (!remaining.isEmpty()) {
                onComplete.accept(remaining);
            }
            buffer.setLength(0);
        }
        
        private int findSentenceEnd() {
            for (int i = 0; i < buffer.length(); i++) {
                if (SENTENCE_ENDINGS.contains(buffer.charAt(i))) {
                    return i;
                }
            }
            return -1;
        }
    }
}

音频缓存策略

高频话术预先合成并缓存，大幅降低延迟和成本：

@Component
public class TtsAudioCache {
    
    @Autowired
    private RedisTemplate<String, byte[]> redisTemplate;
    
    @Autowired
    private AzureTtsService ttsService;
    
    private static final String CACHE_PREFIX = "tts:audio:";
    
    /**
     * 预热缓存：应用启动时合成所有固定话术
     */
    @PostConstruct
    @Async
    public void warmUpCache() {
        List<FixedPhrase> fixedPhrases = loadFixedPhrases();
        
        log.info("开始预热TTS缓存，共{}条话术", fixedPhrases.size());
        
        for (FixedPhrase phrase : fixedPhrases) {
            String cacheKey = CACHE_PREFIX + phrase.getId();
            
            if (!redisTemplate.hasKey(cacheKey)) {
                try {
                    TtsResult result = ttsService.synthesizeWithCache(
                        phrase.getText(), 
                        phrase.getEmotion(), 
                        phrase.getStyle());
                    
                    redisTemplate.opsForValue().set(
                        cacheKey, 
                        result.getAudioData(),
                        Duration.ofDays(7));
                    
                    log.debug("预热完成: {}", phrase.getId());
                } catch (Exception e) {
                    log.warn("预热失败: {}", phrase.getId(), e);
                }
            }
        }
        
        log.info("TTS缓存预热完成");
    }
    
    /**
     * 固定话术列表
     */
    private List<FixedPhrase> loadFixedPhrases() {
        return Arrays.asList(
            new FixedPhrase("welcome", 
                "您好，欢迎致电智能客服，请问有什么可以帮您？",
                EmotionType.WELCOME, SpeechStyle.NORMAL),
            new FixedPhrase("hold_on",
                "好的，请稍等一下，我来为您查询。",
                EmotionType.FRIENDLY, SpeechStyle.NORMAL),
            new FixedPhrase("transfer",
                "我来帮您转接人工客服，请稍候。",
                EmotionType.FRIENDLY, SpeechStyle.NORMAL),
            new FixedPhrase("goodbye",
                "感谢您的来电，再见，祝您生活愉快！",
                EmotionType.FRIENDLY, SpeechStyle.NORMAL),
            new FixedPhrase("apology",
                "非常抱歉给您带来了不便，我们会尽快处理。",
                EmotionType.APOLOGIZE, SpeechStyle.SOOTHING)
        );
    }
}

真实踩坑记录

坑1：SSML的转义问题

文本中如果包含&、<、>这些XML特殊字符，SSML会解析报错。必须做转义：

private String escapeXml(String text) {
    return text
        .replace("&", "&amp;")
        .replace("<", "&lt;")
        .replace(">", "&gt;")
        .replace("\"", "&quot;")
        .replace("'", "&apos;");
}

坑2：情感程度过高

Azure的styledegree参数如果设太高（比如2.0），声音会非常夸张，不自然。我测试下来1.0-1.3是比较合适的范围。

坑3：中英混读

"我们的API是RESTful风格的"，这句话TTS经常把"RESTful"读成乱七八糟。要么在Prompt里告诉用户"我们的接口是RESTful风格"，要么用SSML显式标注读法。

坑4：停顿叠加

如果文本里已经有，然后又加了<break>，停顿会叠加太长。要先去除原有标点再加SSML停顿，或者只在没有自然标点的地方加停顿。

小结

TTS产品化，核心不是选哪个API，而是：

SSML是关键，不用SSML你只能得到50%的效果
情感要跟着用户走，用户愤怒时千万别用欢快语音
数字和特殊内容要规范化，金额、电话要单独处理
流水线化设计，LLM输出和TTS合成并行，降低端到端延迟
缓存固定话术，高频语句预合成，速度和成本双赢

语音交互体验好不好，这些细节决定的是"感觉"，而感觉就是留住用户的关键。