Submission + - Cerebras Breaks AI Inference Speed Records, Outpaces GPUs by 20x (cerebras.ai)
mimd writes: Cerebras Systems is making big claims about their new AI inference speeds, boasting 1,850 tokens per second on Llama 3.1 8B and 446 tokens per second on 70B, according to Artificial Analysis. These numbers are nearly 20x faster than what OpenAI, Google, and Anthropic are currently putting out.
Powered by the pizza-sized WSE-3 chip—a massive piece of silicon with 900,000 cores and 44GB of on-chip memory—Cerebras is pushing the boundaries of what’s possible in AI inference. They claim to scale to even larger models like LLama 405B at less than 1% loss in efficiency, coming over the next few weeks. The real question is: can they deliver consistently at scale, and how will this affect chip makers that do not physically couple memory with compute?
If you want to see for yourself, they’ve got a free chat UI (sign-in required) where you can kick the tires. For more details, check out their press release.
Powered by the pizza-sized WSE-3 chip—a massive piece of silicon with 900,000 cores and 44GB of on-chip memory—Cerebras is pushing the boundaries of what’s possible in AI inference. They claim to scale to even larger models like LLama 405B at less than 1% loss in efficiency, coming over the next few weeks. The real question is: can they deliver consistently at scale, and how will this affect chip makers that do not physically couple memory with compute?
If you want to see for yourself, they’ve got a free chat UI (sign-in required) where you can kick the tires. For more details, check out their press release.