Part I: The Gala, the Suburbs, and the “Months Behind” Myth in LLM Labs
robot flex, tech science parks, the supply chain of LLMs, post-training, data arbitrage
Hello! Over the last week, X has gone into a frenzy about the CCTV Spring Gala’s robotics shows. And tbh they were truly impressive in their agility, flexibility, and mobility. I’m resurfacing some old robotics pieces I’ve written and put under “Physical AI” on AI Proem. But I’m not going into details about that today, because even though these robots are impressive looking right now, I still don’t think they’re ready for mass scale yet.
Anyway, I am going to break down my Spring Festival series into 3 parts for ease of reference, since otherwise one piece would be way too long.
Part 1 will focus on what, how, and when the Chinese labs are chasing the frontier labs (this post), then Part 2 will look at the business models and constraints of Chinese LLM labs and Part 3 looks at how the Spring Festival Gala is like China’s Super Bowl.
Reflections
So last week I was in Beijing. What was really interesting to me was what was happening outside the 5th ring road, driving around in Changping, Miyun, and Shunyi, where it was mostly just farm land and suburban houses; there was a lot of “new tech” injection.
The first thing you feel is a sense of connectivity. The high-speed bullet train has now connected many of these areas to the city center for under 35 RMB (about five bucks), with a 20–25 minute ride. A few years ago, that was a minimum one-hour commute. That is not a lifestyle upgrade. It changes the economics of distance. Places that used to feel like “far suburbs” suddenly feel like plausible overflow for offices, campuses, and talent.
The second thing you see is capacity. Many new developments are either just completed or clearly up and coming. A lot of these science parks don’t seem vacant, but they’re not filled up either. They sit in that uncomfortable middle state that shows up when a place builds for a future that hasn’t fully arrived yet.
Then there’s the symbolism. Even the street names are named after streets near the Tsinghua University campus, the Beijing hub for innovation and technology. About an hour north of the city center and an hour east of the university cluster in Haidian, I saw a newly built campus with roads named 英才路 (Talent Road) and 未来科学路 (Future Technology Road). This is just one of many science parks and tech office complexes being bolstered.
It’s not that a street sign proves anything. But it does reveal intent from top-down planning. And that context matters for how I’ve been interpreting China’s AI moment, especially after a week of conversations across labs, American AI companies operating in China, and product managers at US-listed tech firms.
I went into the week carrying a ‘dumb question’ I’ve had for a long time.
“What does it mean that Chinese research is only a few months away from US research in LLM development?”
I never understood how research gets quantified by time. And last week, someone finally gave me a framing that translated “months”.
Why Chinese labs can keep catching up to frontier labs
So to start, here’s how some people explained the catch-up mechanism to me, and why it finally made “months behind” feel like a real statement instead of a vague boast/ warning depending on where you sit.
Put simply, it was said that China’s top models can keep “catching up” because DeepSeek supplies a frontier-ish base model via open release, and everyone else closes the remaining gap through post-training once the right inputs become accessible. Some people described it as “copying the homework” more efficiently than distillation.
If that’s true, then China’s odds of genuine paradigm innovation, not just catch-up, depend disproportionately on DeepSeek. And it’s hard not to notice that a lot of people are still pondering where DeepSeek’s latest release is.
The part that clicked for me is that “months behind” doesn’t have to mean “months worse at research.” It can mean a predictable lag in access and cycle time: exclusivity ends, procurement happens, post-training runs, iteration loops tighten, and the gap narrows.
Now, there’s obviously been debate over whether Chinese labs are copying or innovating. From what I’ve seen and heard, it’s a bit of both. The cleanest version of the story is about access and iteration. The reality includes real systems work, too. But the mechanism is still useful because it explains how “catch-up” can be fast.

Let me break it down the way it was explained to me.
First, the base model. A person at a frontier lab framed it bluntly: in China, the foundation or base model is often anchored to DeepSeek’s open releases. They gave me examples as claims, not verified: Kimi 2.5’s fundamental infrastructure is built on DeepSeek V3, and GLM 5 is similarly built on DeepSeek V3.2.
The implication was pragmatic rather than ideological. If DeepSeek releases something near the frontier, then for many Chinese labs, the marginal return on doing frontier pretraining themselves is not the best use of time. Waiting becomes leverage. One person even described a mindset inside their org that experiments for months can still underperform simply waiting for DeepSeek’s next open release.
And to be fair, calling this “free-riding” misses the point of open release. The whole logic of open sourcing is diffusion. If one team publishes an upstream base and others build downstream capability on top, that’s not necessarily a failure of innovation. It’s how ecosystems scale.
Second, post-training. This is where even I get lost sometimes, because the terminology can feel like alphabet soup. The simplest translation I’ve heard is: pretraining teaches the model the world; post-training teaches the model how to behave in it.
When people say post-training, they’re usually talking about some combination of:
SFT (supervised fine-tuning): training on human-labeled examples of what “good” answers look like
RLHF (reinforcement learning from human feedback): using preference judgments to push the model toward outputs humans like and trust
DPO (direct preference optimization): a preference-based method that can be more operationally straightforward than full RLHF in some pipelines
Tool-use training: teaching the model to call tools and follow structured workflows reliably
The off-the-record anecdote I heard repeatedly was like this: For example, if a frontier lab, let’s say Gemini, would pay around ~$10 million for a set of premium labeled data, and negotiate one to two months of exclusivity where the vendor cannot resell it.
Then, after that exclusivity ends, labs like Zhipu, MiniMax, and others can allegedly spend ~$1–2 million to buy a portion of that already labeled dataset. And then they aggressively post-train, using high-signal data to target alignment and capability tuning.
People described it as “having the answer keys to the quiz already” and laughed at it all, saying it was more effective than distillation. The intuition is that direct, high-quality labeled preference and trajectory data have higher information density than learning indirectly from another model’s outputs.
At the same time, I also heard what some people described as a hybrid approach, which makes the whole story more believable than a simplistic “dataset resale” narrative. Distill first on the frontier model trained on the $10 million dataset, then buy a narrower labeled set for ~$1–2 million to supplement or fine-tune. The aim is to hit 80–90% of the capability while spending one-fifth to one-tenth of the money. In that framing, “post-training beats distillation” really means “post-training with the right labeled data is higher-signal than pure distillation when you can afford it.”
Third, what this implies about talent. If the base model is “good enough” and the biggest gains come from post-training recipes and fast iteration loops, then elite pretraining pedigree matters less on the margin. Execution matters more: data acquisition, evaluation loops, post-training engineering, and speed without regressions. Because what they’re optimizing is the downstream lever.
So when people online are trolling developers for their “taste,” the taste isn’t in their Lululemon ABC pants or their On sneakers and HOKA runners. It’s in their ability to pick which dataset to train on, which tasks to prioritize, and which benchmark to max.
All of this only works if a few assumptions hold simultaneously. DeepSeek’s base models must be reusable and near the frontier to serve as a platform. Premium post-training datasets have to become available after exclusivity, and Chinese labs have to be able to reliably buy meaningful slices. Post-training still has to deliver large marginal returns, meaning you can close big gaps without redoing frontier-scale pretraining. And teams have to be able to run the loop repeatedly: filter data, train, evaluate, iterate.
So now, let’s say those assumptions hold; the implications are quite interesting.
“Catch-up” starts looking like a system-level arbitrage. For a non-technical person, I always thought “months away” meant the research just arbitrarily took a few extra months to crack, and I never bought that. If you give me a Math Olympiad question, no matter how much time you give me, I’m not solving it. And raw intelligence isn’t the answer either. These researchers move from lab to lab; their IQs aren’t the bottleneck. Data access, the creativity of how to synthesize data, and the speed of iteration are what create the lag. That framing finally made the “months” language make sense.
In this framing, open base models reduce the cost of frontier exploration. Dataset exclusivity creates a time-lag window. China closes gaps by exploiting that window through cheaper post-training. “Months behind” becomes procurement plus cycle time.
It also makes DeepSeek look less like “another lab” and more like THE upstream infrastructure. It sets the ceiling and tempo for China’s base-model capability. Other labs become downstream capability packagers: alignment, tool use, productization, distribution.
And the loaded implication follows: if most labs rationally “wait for DeepSeek,” then frontier innovation gets concentrated. China’s odds of genuine paradigm leadership depend disproportionately on whether DeepSeek can keep pushing the frontier, not just releasing.
The fastest catch-up should show up in areas responsive to post-training data, such as tool use, instruction-following, formatting, domain QA, and preference alignment. It should be slower in areas that require deep pretraining innovations. China’s big base-model jumps should cluster around DeepSeek’s major releases. And the narrative should drift toward agents, outcomes, subscriptions, and away from “we invented new pretraining paradigms.”
But there is an important caveat: the clean mechanism is not the whole story. It risks understating the amount of real systems work required and overstating one upstream node.
Even the public technical artifacts around the models people like to cite point in that direction. Kimi K2’s technical report describes large-scale agentic data synthesis plus a joint RL stage. GLM-5’s model card highlights an asynchronous RL infrastructure they call “slime,” designed to speed up post-training iteration. And the NVIDIA model card for Kimi K2.5 describes continual pretraining on roughly 15T multimodal tokens on top of a Kimi-K2-Base. Those aren’t “copy a dataset, and you’re done” signals. They’re signs of continued scaling and infrastructure investment.
So I’ll keep four closing thoughts in view:
Post-training is a huge lever, but it isn’t cheap copying. Great data doesn’t remove the need for reward modeling and preference modeling, stable RL infrastructure, careful evaluation to avoid regressions, and deployment efficiency. Iteration loops matter.
“China’s paradigm innovation depends on DeepSeek” is too concentrated a bet, and also anecdotal. But it’s still a useful reminder: in technology and in business, you don’t always need to be the first mover if you can reliably close the gap.
This framing helps explain why Chinese models can end up tracking American labs’ taste. If the post-training signal comes from labeled data shaped by frontier ecosystems, and in some cases, models are distilled on frontier models trained on those datasets, alignment preferences converge.
Which brings me back to DeepSeek’s absence. We kept hearing that DeepSeek would release something during Spring Festival, but it hasn’t happened yet. Some people told me they've found a breakthrough in engineering efficiency, once again, so they’re still working on it. Others said they’ve been hit by the talent poaching of critical members. But the downstream expectation is consistent: people are waiting for an efficiency jump that is cost-efficient, chip-efficient, and inference-efficient.
Parting words for now
So, if you can start from a frontier-ish base model when it becomes available, then buy or synthesize the right post-training signal when exclusivity windows open, and run a tight train-eval-iterate loop without breaking the model, you can compress the visible gap faster than most people expect.
That doesn’t mean China is “just copying,” nor does it mean China is leading. The public artifacts already show how much real-world systems work lies beneath the best Chinese models. Continued pretraining, synthetic data pipelines, and RL infrastructure are not the hallmarks of a lazy strategy. They’re the hallmarks of teams that are serious about closing the last mile of behavior, reliability, and product usefulness.
But the framing does change how I interpret the direction of travel. If a meaningful portion of post-training signal is shaped by frontier ecosystems, then the “catch-up” path naturally pulls Chinese models toward the frontier labs’ tastes. It also means that, when people say “China is a few months behind,” they may be describing something very literal: an access window, a procurement cycle, and a post-training schedule.
The risky and kinda sketchy implication is concentration. If a lot of teams rationally treat DeepSeek as upstream infrastructure, then the ceiling and tempo of base-model progress become more dependent on a smaller number of upstream breakthroughs. That’s why the absence of a new DeepSeek release becomes more than gossip, and it becomes a system-level question: what happens to the downstream cadence when the upstream supplier pauses, accelerates, or changes direction?
IF and IF this all holds, then it is more important to watch frontier capability jumps, then downstream closing behavior; which capabilities move fastest; and whether the next step change comes from a new method, a new data regime, or an efficiency breakthrough. That’s the lens Part 1 gives me.
And it tees up the next question cleanly: even if catch-up is structurally possible, does any of this translate into a durable business?
That’s for Part 2.
P.S. Excited to see many of you in SF in a few days.





This is really well done. I have one question, though.
We often see Chinese labs introduce interesting architectural innovations — for example, Kimi’s muon clip optimizer or DeepSeek’s hyperconnection approach. In these cases, the improvements don’t primarily come from pre-training scale or post-training techniques. They’re more algorithmic or architectural breakthroughs.
How do you think these kinds of innovations factor into the overall trajectory? Are they central drivers of progress, or more like exceptions to the broader trend?
Grace, awesome article, as always!
The only question I’m still puzzled about: if Chinese labs just wait for the new Deepseek models to come out, why do GLM and Kimi perform consistently better than Deepseek on benchmarks?
https://artificialanalysis.ai/#intelligence#artificial-analysis-intelligence-index
Is it because post-training gives them additional boost in performance?