Last week Chinese AI company DeepSeek dropped R1, it's latest reasoning model similar to OpenAI's o1 released last fall. An exciting and potentially groundbreaking achievement, particularly given that it's open source.
I'm not the one to pick apart the performance implications or relative comparisons on benchmark tests against the likes of GPT, Claude, or other models. What's most interesting to me in this story is the nature of the breakthrough: that the situation the company found itself in compelled a fundamentally different approach to model training.
Foundation LLMs like GPT and Claude are trained through self-supervised learning on massive text datasets. The process requires enormous computational resources and typically costs millions in computing power. This is why we see these massive superclusters of GPUs, with every AI lab investing billions.
DeepSeek reportedly improved on this methodology by leaps and bounds. From
:DeepSeek-R-1 was released just last week. It performs about as well as OpenAI’s o1 reasoning model but is about a tenth the cost. DeepSeek’s non-reasoning model, V3, is similarly disruptive. About as good as GPT-4o but one-fifth the price.
This breakthrough seems hard to believe, and the market will separate fact from fiction eventually. But so far it looks to be the case.
It seems that the catalyst for their unconventional approach was constraint:
So how did DeepSeek achieve such a breakthrough?
R-1 appears to borrow heavily from methods pioneered by reasoning models like o1. The remarkable achievement is how DeepSeek researchers managed to replicate advanced reasoning on what would typically be considered “lower-grade” hardware, simultaneously lowering both cost and latency. It’s important to note that China already had a range of GPT-4-level models by late 2023—Qwen 72B, for instance, exceeded GPT-4 on certain Chinese benchmarks in December 2023. While o1 showcased a new paradigm of reasoning models by scaling test-time compute, DeepSeek demonstrated that replicating such reasoning may not be as resource-intensive as originally assumed.
Whereas the main US labs have access to the latest Nvidia Chips, DeepSeek, at the mercy of US export restrictions does not. This meant they used Nvidia’s cut-down H800 chips. Necessity is the mother of invention. How many? We don’t know.
Ben Thompson pointed this out this week in his DeepSeek FAQ:
Again, just to emphasize this point, all of the decisions DeepSeek made in the design of this model only make sense if you are constrained to the H800; if DeepSeek had access to H100s, they probably would have used a larger training cluster with much fewer optimizations specifically focused on overcoming the lack of bandwidth.
Unlike the big foundation models, DeepSeek R1 was developed under significant computational constraint. The team couldn't rely on massive computing resources, which led to several innovative adaptations.
As I've written before, embracing constraints forces you to visit paths that might be overlooked by others less-constrained. The DeepSeek team's lack of access to higher bandwidth, higher performance gear forced them to focus on the directions they could take. They couldn't just throw compute at the problem; they had to drop down into the lower levels of the GPU instruction set and look for optimizations there. This is relatively nuts for anyone to consider bothering with if they've got access to superclusters with 100K state-of-the-art nodes.
Constraints act as a forcing function to get creative. When you don't have access to the tens of billions in capital or the same technical resources, you find yourself testing the ignored or overlooked options that the unconstrained innovator takes for granted.
This isn't to say the "unconstrained" OpenAI aren't innovators — everyone has constraints somewhere. But a team searching for breakthroughs is incentivized to look for those leaps in areas where they have a unique edge. If you have access to $1bn to spend and no one else does, you're naturally incentivized to use your advantage. To look where others can't.
We find countless examples of limitations driving creativity if we look back over history of invention. IKEA figured out modular flat-packed, self-assembled furniture due to constraints in shipping. The character count limits on Twitter originated in limits in SMS, and generated a new communication format. The Apollo 13 mission crew famously had to cobble together a CO₂ filter from materials around in the capsule. Constraint forces us to get creative.
For DeepSeek, they had the incentive to look for answers under the rocks that the big labs walked right past.
What are some other examples of finding novelty in constrained environments?