Understanding Speculative Decoding

Speculative Decoding is a latency reduction technique for large language model (LLM) inference. Instead of letting the big model (the target model) generate each token sequentially, we first use a smaller, faster draft model to propose multiple candidate tokens ahead of time. Then the large model verifies these proposals in parallel, significantly reducing the number of expensive forward passes.

Hello, world!

The energy-mass equivalence is $E = mc^2$.