第2023篇：Speculative Decoding——让LLM推理快2倍的反直觉技术

老张2026/4/30大约 6 分钟

第2023篇：Speculative Decoding——让LLM推理快2倍的反直觉技术

适读人群：关注LLM推理性能优化的工程师 | 阅读时长：约17分钟 | 核心价值：理解推测采样的原理，知道什么时候值得用，怎么配置

第一次听说推测采样（Speculative Decoding）的时候，我觉得这个思路很奇怪——用一个小模型帮大模型预测，然后大模型再验证小模型的预测。不是多此一举吗？

但理解了LLM推理的瓶颈之后，你会发现这个设计非常聪明。

LLM推理速度的真正瓶颈

LLM的推理是逐token生成的，每生成一个token都需要：

把所有token的embedding送进Transformer
计算多头注意力（代价巨大）
输出下一个token的概率分布
采样得到下一个token

这个过程不能并行——token N必须等token N-1生成完才能开始。

但有一个关键观察：GPU做一次前向计算，不管是验证1个token还是验证10个token，时间差异非常小。这是因为GPU的计算单元是大规模并行的，小批量和大批量的开销差不多（到一定规模之前）。

推测采样就利用了这个特性。

推测采样的工作原理

关键点：大模型并行验证4个token，和串行生成1个token的计算量差不多，但最好情况下产出了4个token。期望加速比取决于草稿模型的接受率——如果80%的候选token被接受，加速比约为3x。

草稿模型的选择

草稿模型有几种选择：

方式1：独立的小模型

最直接的方式。比如用3B的小模型来辅助70B的大模型。小模型需要和大模型的词表（tokenizer）相同，否则token对不上。

# vLLM中配置推测采样（独立草稿模型）
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-70B-Instruct \
    --speculative-model meta-llama/Meta-Llama-3-8B-Instruct \
    --num-speculative-tokens 5 \     # 每轮预测5个候选token
    --speculative-draft-tensor-parallel-size 1 \  # 草稿模型用1块GPU
    --port 8080

方式2：自身的小版本（同系列模型）

Llama3-70B + Llama3-8B，因为来自同一家族，词表完全相同，接受率往往更高。

方式3：Medusa（同一模型的多头）

Medusa在原模型的基础上增加多个"猜测头"，不需要独立的草稿模型：

# Medusa配置（在同一个模型上增加推测头）
python -m vllm.entrypoints.openai.api_server \
    --model FasterDecoding/medusa-vicuna-7b-v1.3 \
    --speculative-model "[medusa]" \
    --num-speculative-tokens 3 \
    --port 8080

实测加速效果

我在A100上测试了几个场景下的推测采样效果：

// 性能基准测试代码
@Component
@RequiredArgsConstructor
public class SpeculativeDecodingBenchmark {
    
    @Value("${vllm.standard.url:http://localhost:8080}")
    private String standardUrl;
    
    @Value("${vllm.speculative.url:http://localhost:8081}")
    private String speculativeUrl;
    
    private final RestTemplate restTemplate;
    
    /**
     * 对比标准推理和推测采样的延迟
     */
    public BenchmarkComparison benchmark(List<String> testPrompts) {
        List<Long> standardLatencies = new ArrayList<>();
        List<Long> speculativeLatencies = new ArrayList<>();
        
        for (String prompt : testPrompts) {
            // 测标准推理
            long start = System.currentTimeMillis();
            callModel(standardUrl, prompt, 200);
            standardLatencies.add(System.currentTimeMillis() - start);
            
            // 测推测采样
            start = System.currentTimeMillis();
            callModel(speculativeUrl, prompt, 200);
            speculativeLatencies.add(System.currentTimeMillis() - start);
        }
        
        double stdAvg = standardLatencies.stream().mapToLong(Long::longValue).average().orElse(0);
        double specAvg = speculativeLatencies.stream().mapToLong(Long::longValue).average().orElse(0);
        
        return BenchmarkComparison.builder()
            .standardAvgLatency(stdAvg)
            .speculativeAvgLatency(specAvg)
            .speedup(stdAvg / specAvg)
            .build();
    }
    
    private String callModel(String baseUrl, String prompt, int maxTokens) {
        Map<String, Object> body = Map.of(
            "model", "local",
            "messages", List.of(Map.of("role", "user", "content", prompt)),
            "max_tokens", maxTokens
        );
        
        ResponseEntity<Map> response = restTemplate.postForEntity(
            baseUrl + "/v1/chat/completions",
            body,
            Map.class
        );
        
        List<Map> choices = (List<Map>) response.getBody().get("choices");
        Map message = (Map) choices.get(0).get("message");
        return (String) message.get("content");
    }
}

不同类型输入的加速比：

输入类型	草稿接受率	实测加速比	原因分析
代码续写（规律强）	85%	2.8x	代码有高度规律，小模型猜对率高
文档摘要	75%	2.2x	摘要有套路，接受率较高
开放域对话	55%	1.6x	创意回答不可预测，接受率低
翻译任务	80%	2.5x	对应关系规律，容易被猜到

规律：输出越有规律、越可预测，推测采样效果越好。

什么时候用，什么时候不用

推测采样不是万能的。有几种情况效果不好：

场景一：批量处理（batch inference）

推测采样的优势是减少"串行等待"。但当你一次处理大量请求时，GPU已经被满负荷使用，Continuous Batching的效率已经很高了。推测采样在这个场景下反而可能增加开销（多了一个草稿模型）。

推测采样最适合的场景是低并发、对延迟敏感的在线服务，比如实时对话。

场景二：草稿模型和目标模型风格差异大

如果草稿模型经过了特殊微调（比如用了不同的指令格式），和大模型的分布差距变大，接受率会显著降低，加速效果消失。

场景三：输出很短

如果平均输出只有20-30个token，推测采样的启动开销反而占比大。这种场景不适合。

/**
 * 决定是否为某个请求启用推测采样
 */
@Component
public class SpeculativeRoutingDecider {
    
    /**
     * 评估请求特征，决定是否路由到推测采样服务
     */
    public boolean shouldUseSpeculative(ChatRequest request) {
        // 预期输出短的请求，不值得用推测采样
        if (request.getMaxTokens() < 50) {
            return false;
        }
        
        // 高并发场景（服务器已满负荷），不用推测采样
        if (getCurrentServerQps() > 20) {
            return false;
        }
        
        // 规律性任务：翻译、摘要、代码，推测采样效果好
        String systemPrompt = request.getSystemPrompt();
        if (systemPrompt != null && (
                systemPrompt.contains("翻译") ||
                systemPrompt.contains("摘要") ||
                systemPrompt.contains("代码"))) {
            return true;
        }
        
        // 默认：普通对话不一定值得
        return false;
    }
    
    private int getCurrentServerQps() {
        // 从监控系统获取当前QPS
        return 0;  // 实际实现连接到Prometheus
    }
}

推测采样的Java集成示意

在Java端，推测采样对调用方完全透明——接口和普通OpenAI API完全一样，差别只在服务端配置：

/**
 * 智能路由：高延迟敏感场景走推测采样，高吞吐场景走标准推理
 */
@Service
@RequiredArgsConstructor
public class SmartLlmRouter {
    
    @Qualifier("speculativeVllmClient")
    private final ChatClient speculativeClient;  // 指向推测采样服务
    
    @Qualifier("standardVllmClient")
    private final ChatClient standardClient;     // 指向标准推理服务
    
    private final SpeculativeRoutingDecider decider;
    
    public String chat(ChatRequest request) {
        ChatClient client = decider.shouldUseSpeculative(request) 
            ? speculativeClient 
            : standardClient;
        
        return client.prompt()
            .system(request.getSystemPrompt())
            .user(request.getUserMessage())
            .call()
            .content();
    }
}

小结

推测采样的反直觉之处在于：用了更多计算资源（多了一个草稿模型），但总体速度更快。原理是把GPU的并行能力利用得更充分——并行验证比串行生成高效。

实际工程中，推测采样对低并发、长输出、规律性任务效果最显著，加速2-3倍很正常。对于高并发批量场景，用Continuous Batching更合适。两者不互斥，可以根据流量特征灵活切换。