DeepSeek-OCR Showcases the Team's Creativity, yet again
Who said Chinese AI developers are just a bunch of nerds?
Imagine this: a room of AI developers slouched over their computers and sipping on their late-night bubble teas, and suddenly a light bulb goes off, and someone asks, “If a picture is worth a thousand words, why dont we try to use images as input to LLMs?”
We all know this saying, which originated in the ad industry. But now, the DeepSeek team has made it the brain prompt of the month for AI developers.
DeepSeek dropped an OCR paper today, and what jumped out to me is their extreme creativity and ability to think outside the box. DeepSeek’s team first showed its engineering ingenuity in its R1 model and has now turned “a picture is worth a thousand words” into a literal technological breakthrough.
From Moonshot’s team’s Pink Floyd rock band devotees to DeepSeek’s philosophers. Who said they’re all just a bunch of nerds?
How it works
In plain English: DeepSeek-OCR [Optical Character Recognition system] treats a page image like a super-compressed “zip file” for text. Instead of keeping thousands of text tokens in memory, the model turns the image into a small set of vision tokens, and when you need the content back, it “reads” those tokens into words. The paper reports roughly ~97% recovery at ~10× compression, and still around ~60% at ~20×, which is a big deal for cost and speed.
Think of a memory from twenty years ago. You have memories of you enjoying that holiday by the beach with your parents, but you’re not sure what the flavor of your ice cream cone was anymore. Details are fuzzy, but with focus, you can also reconstruct a lot based on some reverse logical thinking, and things become much more vivid. This behaves similarly to what we’re talking about here; you can keep a compact “picture memory,” work out 97% of what it was, and then reconstruct the details with some tricks and prompting, maybe. It’s a clever, very practical way to preserve brain real estate.
How it’s built
So their innovation is two parts: a DeepEncoder that turns a page into a tiny number of vision tokens (64/100/256/400, plus a tiled “Gundam” mode for dense pages), and a lightweight Mixture-of-Experts decoder that turns those tokens back into text (only ~570M parameters “wake up” each step, so it runs like a small model). The encoder is like a smart zipper: it looks locally, shrinks tokens ~16× before doing the global pass, and lets you pick the cheapest setting that still reads cleanly. When you need the content later, the decoder reads the tokens back into words—simple.
Why this matters
This unlocks much cheaper memory for long contexts such as old chat history or big PDFs. Store images/vision tokens instead of mountains of text tokens, then recover the text when you actually need it. They even show a “gradual forgetting” trick: keep older pages at lower resolution to save more tokens. And the throughput isn’t hand-wavy, hundreds of thousands of pages per day on a single A100-40G—so this feels deployable, not just academic. If this can scale, then it could redefine how LLMs handle context and, more importantly, cost.
Bigger picture
As Andrej Karpathy has pointed out, pure text can potentially be a wasteful interface for models. Multimodal approaches like this hint at a future where images (and other modalities) could carry context more efficiently, and even Elon Musk chimed in on this thought. So maybe it’s only a matter of time before LLM inputs expand far beyond text.
Related material:
DeepSeek’s Ascent Reshapes China’s Burgeoning AI App Landscape
DeepSeek’s Open Source Week: Sharing the Future of AI Efficiency
DeepSeek V3 puts China AI on the global map: consumer use and capital expenditure implications
Everybody Losing Sleep Over DeepSeek: Industry Implications to LLMs and AI Infrastructure
The Jevons Paradox in AI Infrastructure: DeepSeek Efficiency Breakthroughs to Drive Energy Demand





