
The performance of diffusion models has shown to be as fast as or even faster than traditional models of comparable size. Researchers from LLaDA have reported that their 8 billion parameter model delivers performance on par with the LLaMA3 8B, achieving competitive scores across various benchmarks including MMLU, ARC, and GSM8K.
On the other hand, Mercury has highlighted impressive speed enhancements. Their Mercury Coder Mini achieved an 88.0% success rate on HumanEval and 77.1% on MBPP, rivalling the performance of the GPT-4o Mini. Notably, Mercury Coder Mini operates at a staggering rate of 1,109 tokens per second, compared to the mere 59 tokens per second of GPT-4o Mini. This represents approximately a 19-fold increase in speed while delivering comparable performance on coding-related benchmarks.
According to Mercury’s documentation, their models run at over 1,000 tokens per second on Nvidia H100 chips—speeds that were previously only attainable with custom chips from specialized hardware manufacturers such as Groq, Cerebras, and SambaNova. When contrasted with other speed-optimized models, the advantage is substantial: Mercury Coder Mini is estimated to be about 5.5 times faster than Gemini 2.0 Flash-Lite (201 tokens per second) and approximately 18 times faster than Claude 3.5 Haiku (61 tokens per second).
Exploring New Possibilities in LLMs
Despite their advantages, diffusion models come with certain trade-offs. They generally require multiple forward passes through the network to create a full response, whereas traditional models can generate a response with just one pass per token. Nevertheless, due to the parallel processing capability of diffusion models, they can achieve higher throughput despite this added complexity.
Inception believes these speed benefits could revolutionize code completion tools, enhancing developer productivity through instant responses, as well as improving conversational AI applications, mobile environments with limited resources, and AI agents that need rapid replies.
If diffusion-based language models can maintain high quality while also speeding up processing times, they could significantly shift the trajectory of AI text generation. AI researchers have been increasingly open to exploring new methodologies.
Independent AI researcher Simon Willison shared with Ars Technica, “I love that people are experimenting with alternative architectures to transformers; it illustrates how much of the realm of LLMs remains uncharted.”
On X, former OpenAI researcher Andrej Karpathy noted about Inception, “This model has the potential to be distinct, showcasing unique psychological traits, as well as new strengths and weaknesses. I encourage experimentation!”
Questions linger regarding whether larger diffusion models can compete with the likes of GPT-4o and Claude 3.7 Sonnet, especially in tackling complex reasoning tasks. For the moment, these models present a viable option for smaller AI language models, achieving speed without sacrificing functionality.
You can explore the Mercury Coder yourself on Inception’s demo site, and feel free to download LLaDA’s code or try out a demo on Hugging Face.
