Understanding Speculative Decoding

Speculative Decoding is a latency reduction technique for large language model (LLM) inference. Instead of letting the big model (the target model) generate each token sequentially, we first use a smaller, faster draft model to propose multiple candidate tokens ahead of time. Then the large model verifies these proposals in parallel, significantly reducing the number of expensive forward passes.

Read More