Own the Videos, Own the Models: The Hidden Logic Behind AI's Multimodal Race

I recently stumbled onto an interesting pattern:

China’s two short-video giants — Douyin (ByteDance) and Kuaishou — have each produced world-class video generation models. Meanwhile, the most competitive image and video models outside China — Nano Banana and Veo — come from Google, which owns YouTube.

There’s a formula hiding here: companies that own video platforms built the best video models.

But if that formula holds, Meta — owner of Instagram — should be dominating multimodal generation. It isn’t. And MiniMax, which owns no video platform at all, somehow built Hailuo into a globally competitive video model.

So is the formula right? Let’s break it down.

Platform → Data → Model

ByteDance & Kuaishou: The Short-Video AI Dividend

ByteDance and Kuaishou building great video models is almost inevitable.

Douyin sees hundreds of millions of short videos uploaded daily. That’s near-infinite training data with built-in quality labels — user likes, completion rates, and shares are natural annotations. Which frames look good, which transitions are smooth, which content hooks people — the platform knows this better than anyone.

Kling (ByteDance) and Kuaishou’s video models being world-class is fundamentally a data moat story.

Google: The Giant That Finally Woke Up

Google’s situation is similar but distinct. YouTube is the world’s largest video library — over 500 hours uploaded every minute. Crucially, YouTube videos come with subtitles, tags, comments, and timestamps — precious multimodal alignment data.

Google was mocked for years as having “started early but finished late.” But the problem was never capability — it was organization. Google Brain and DeepMind spent years in internal competition, with research and products hopelessly disconnected. Now that DeepMind has consolidated, Gemini 3, Nano Banana, and Veo prove that when Google gets serious, decades of accumulated data finally translate into model superiority.

Always criticized, always delivering surprises — Google’s comeback has been genuinely impressive.

Has Platform, Didn’t Build It

Meta: Not Can’t, Won’t

Meta owns Instagram and Facebook — arguably the world’s largest social image and video datasets. By the formula, Meta should be the multimodal king.

But Meta’s AI strategy has always run on two tracks: open-source language models (Llama) + ad recommendation systems.

Why no generative models? Because Meta’s business model doesn’t need them. Meta needs to understand content to sell ads, not generate content. The recommendation algorithm is Meta’s real battleground. Llama is a defensive investment to avoid being locked out by OpenAI and Google.

As for Llama itself, calling it irrelevant is too harsh. Llama 4’s Scout and Maverick remain competitive — it just no longer has a commanding lead. When everyone’s building open-source LLMs, first-mover advantage naturally erodes.

No Platform, Built It Anyway

MiniMax / Hailuo: The Most Interesting Case

MiniMax’s Hailuo video model is genuinely competitive internationally, despite the company owning no video platform. How?

Likely through a combination of approaches:

Public datasets: WebVid, Panda-70M, and other academic datasets provide baseline training material
Public video scraping: YouTube, Vimeo, and other platforms’ public content (a copyright gray area)
Synthetic data and augmentation: Stretching limited high-quality data further — the standard weapon of data-poor players
Licensed procurement: Buying data from rights holders — expensive but compliant
Architecture innovation: MiniMax’s CTO team came from SenseTime, with deep expertise in video understanding. Better model architectures and training techniques can partially compensate for data disadvantage

But long-term, companies without proprietary data sources face an uphill battle. As model architectures converge, data quality and scale will reassert themselves as the decisive factors.

OpenAI: Big Ideas, No Videos

OpenAI is another fascinating case. Their moat was never data — it was first-mover advantage, brand, and talent density.

Early GPT was built on audacity — scraping internet data at scale while others hesitated over copyright. But that advantage has been commoditized. In the video domain, Sora was delayed repeatedly and underwhelmed, precisely because lacking a proprietary video platform is a real handicap.

Can OpenAI win in the long run? Honestly, hard to say. Their text-domain lead is eroding, and they have no data advantage in video. But AI competition is full of wildcards — a single technical breakthrough can reshape the landscape overnight.

So, Does the Formula Hold?

Mostly yes, but with important caveats.

Owning a video platform = massive high-quality training data = natural advantage in video models. ByteDance, Kuaishou, and Google all validate this.

But the formula needs two corrections:

Having data doesn’t mean using it well. Meta has the data but strategically deprioritized generative models. Google had the data but lost years to organizational dysfunction. Data is necessary, not sufficient.
Lacking data doesn’t mean you can’t compete. MiniMax proved that architectural innovation combined with public and synthetic data can produce competitive models. The question is long-term sustainability.

The likely competitive landscape shakes out like this:

Platform giants (ByteDance, Google) maintain long-term advantages through continuous data supply. Platform-less innovators (MiniMax, OpenAI) achieve competitiveness through technical breakthroughs during certain windows, but must keep running to hold their position.

And players like Meta — with resources but different priorities — could pivot back at any time. The data is still there. They just chose a different race.

This article grew out of a conversation about AI video model competition. Views are my own.

☕ If you found this helpful, consider buying me a coffee to support more content like this.

Buy me a coffee