Yann and Me: Different Orbits, Same Planet
From World Models to Story Worlds
Ok, I’ll admit it. I’m a bit of a Yann LeCun fanboy. I mean, we have so much in common. Yann and I both went to school in France (him all the way through a PhD in computer science from Université Pierre et Marie Curie, me for 3 months improving my French when I was 12), both attended NYU (me as a film student, Yann as Silver Professor of Computer Science and Neural Science). We both worked in R&D at US telcos (me on an abortive attempt to launch interactive TV, Yann in groundbreaking Bell Labs work developing convolutional neural networks that revolutionized computer vision and became so effective at reading handwritten digits that NCR deployed it in bank check-reading machines). And of course, we both worked for Meta - me as a creative grunt in the AI content tech mines, Yann as Chief Scientist. It doesn’t take an AI to see the pattern that’s emerging here.
A couple of weeks after my final day at Meta, Yann LeCun announced that he too was leaving the company. Coincidence? Well, yes.
But what’s most interesting about Yann’s departure is what he plans to do next. While Meta will continue to focus on Large Language Models (LLMs), Yann is founding a new start-up to build world models - AI systems that develop internal understanding of environments to simulate cause-and-effect scenarios. Another AI luminary, Fei-Fei Li, recently released Marble through her company World Labs, a 3D world model. Both scientists believe that contextual intelligence will come from training AI in dimensional space, ultimately leading to real world understanding, not simply pattern prediction.
When Yann looks at the AI landscape, he sees the fundamental architecture required for a path toward scientific breakthroughs like artificial general intelligence. When I look at the AI landscape, I see ways for making immersive cartoons and experiencing novel storyworlds . But we’re both looking at the same systems, just through very different lenses. Same planet, completely different orbits.
What Orbit Are You In?
In The Context Maker, I’ve talked about “Horses for Courses” - casting the right model for the task at hand, as all have strengths and weaknesses. But I find that most people I talk to still see “AI” as a monolith. The tech industry simultaneously bombards us with new upgrades, tools, and releases while also presenting their AI as a single, Swiss Army knife brand. “Just ask ChatGPT.” “Claude can do that.” As if these were singular entities rather than orchestrated systems.
But actually, many interactions with ChatGPT or Claude are utilizing multiple models, coordinated by an overarching LLM interface. When you ask ChatGPT to search the web, you’re not talking to the same model that writes your emails. When you ask Claude to analyze an image, you’re triggering a completely different computational process than the one generating your story outline.
This matters because different AI architectures have different native properties - different artifacts, different strengths, different creative possibilities. You can’t lean into the weird if you don’t know which weird you’re working with.
So let’s demystify. Not with a technical taxonomy that requires a computer science degree (the ‘Yann’ version), but with a practical field guide for creators trying to navigate this landscape (the ‘Japhet’ version) .
Part 1: LLMs and Diffusion Models - The Current Generation
For the past several years, the AI revolution has been powered by two types of models. Large Language Models - for example, ChatGPT, Claude, Gemini - recognize and generate patterns in language. Diffusion models - Midjourney, DALL-E, Stable Diffusion, Runway, Veo and more - do the same for images and video.
Both are extraordinarily good at what they do. But here’s their fundamental limitation: they’re pattern matchers, not understanders.
For example, an LLM has been trained on millions of descriptions of kitchens. It knows that stoves are usually near counters, that recipes involve heating, that steam rises from boiling water. It can write perfectly plausible text about cooking. But it doesn’t understand heat, or causality, or physical space in the way you do.
As Fei-Fei Li puts it: “Today, leading AI technology such as large language models have begun to transform how we access and work with abstract knowledge. Yet they remain wordsmiths in the dark; eloquent but inexperienced, knowledgeable but ungrounded.”
When Yann looks at LLMs, he sees their fundamental limitations as pattern matchers - systems that can never achieve true intelligence because they lack grounding in physical reality. When I look at LLMs, I see tools that aren’t creative in and of themselves, but are capable of generating creative outputs if I use them creatively and lean into what they don’t know as much as what they do know to invent narratives we haven’t seen before. LLMs let other people talk to my constructs. Same technology, different concerns.
What LLMs actually do:
The breakthrough isn’t that LLMs are “smart” - it’s that they let humans and machines actually communicate. Natural language as the universal API. You don’t need to learn code, navigate complex menus, or master arcane commands. You just talk. Ask. Describe what you want.
This is why ChatGPT 3, when it launched 3 years ago, changed our relationship with machine learning forever. Not because LLMs are “smarter” than previous AI (though they’re more capable), but because they made AI accessible to anyone who can type or speak.
LLMs as orchestration layer:
Here’s what most people don’t realize: when you prompt ChatGPT or Claude, you’re triggering an orchestrated system of multiple models working together.
Ask it to search the web? It routes your query to a search model. Ask for an image? It hands off to a diffusion model like DALL-E or Nano Banana. Ask it to write or execute code? It’s running that in a separate environment and interpreting results. Say something that triggers safety concerns? There are classifier models running in parallel, evaluating your prompt and the response.
The LLM is the conductor, not the whole orchestra. It’s deciding which specialized model handles which task - search, image generation, code execution, or just conversation.
This convergence is visible in models like Nano Banana Pro, built on Gemini 3 Pro, which combines language understanding with image generation and can use Google Search to ground outputs in real information, creating contextually accurate infographics and diagrams.
Understanding this orchestration helps you work with the system rather than against it. You’re not just prompting one AI - you’re prompting an ecology of specialized models coordinated by a language interface.
But understanding what these systems are is only half the picture — the other half is how you choose to use them.
Part 2: How You Use It Matters As Much As What It Is
When Yann thinks about AI architectures, he’s focused on how they process information and whether they can achieve genuine reasoning. When I think about AI architectures, I care about something much more practical: is my audience experiencing a finished artifact I’ve created, or are they interacting directly with the AI system?
This distinction cuts across all types of AI - LLMs, diffusion models, world models - and it fundamentally changes how you work with them.
And this difference isn’t subtle. It’s the fork in the road that determines whether you’re crafting a fixed artifact or designing a living system.
Think of it as the difference between using AI behind the scenes versus letting it perform live in front of an audience.
AI as Production Tool:
You use AI during creation, but your audience consumes a fixed, finished artifact. Using Midjourney to generate concept art that becomes part of a film. Using ChatGPT to draft a script that you then refine and produce. Using AI video tools to create footage that gets edited into final content. Using Claude to ideate characters for a TV show.
The AI’s role ends before your audience ever sees the work. They’re experiencing your curation and contextualization of what the AI helped you make.
AI as Real-Time Engine:
Your audience directly interacts with the AI system. Every person gets a different experience. NPCs in games responding dynamically to player choices. My Quantum Teapot storyworld that adapts to each visitor. Voice AI characters having unique conversations with each user. Interactive narratives that generate content in response to user input.
The AI is performing live, in the moment, for each individual user.
This distinction matters more than most people realize.
When you’re using AI as a production tool, you control quality through curation. You can iterate until it’s right. Your taste and judgment shape the final artifact. The audience trusts you as curator.
When AI is the real-time engine, it performs without your supervision. Every interaction is unique and unrepeatable. You’re designing systems and boundaries, not crafting specific outputs. The audience experiences the AI’s native behavior directly.
Part 3: World Models - Beyond Pattern Matching
And this is exactly where the conversation is turning among people building the next generation of systems.
While LLMs and diffusion models have dominated the AI conversation for the past few years, a fundamental shift is happening. The smartest minds in AI are betting that the next breakthrough won’t come from bigger language models - it will come from AI that actually understands how the world works.
Yann is leaving Meta to build world models - AI systems that develop internal understanding of environments to simulate cause-and-effect scenarios. Li’s World Labs just launched Marble with $230 million in funding, creating what she calls “spatial intelligence” - the ability for AI to perceive, model, reason about, and take actions within physical or geometric space. Google DeepMind has released Genie 3, generating interactive 3D environments at 720p and 24fps that you can navigate in real-time.
When Yann looks at world models, he sees the path to AGI - systems that can finally understand causality and physical reality, not just recognize patterns in text. For researchers like Yann, world models aren’t valuable because they produce rich 3D spaces — they’re valuable because they represent the missing cognitive machinery needed for actual intelligence: grounding, prediction, and causal reasoning.
When I look at world models, I see a way to build immersive environments and interactive experiences without spending years learning Unreal Engine. Once again, same technology, very different motivations.
What makes world models different?
Current LLMs are extraordinary pattern matchers. As we’ve discussed, they can describe kitchens and what might go where. They can also write enticing text about the food you might cook in that kitchen (though of course they’ll have no idea what the dish from a recipe might actually taste like).
But they don’t understand heat. They don’t grasp causality. They can recite facts about what happens when you remove a hot pan from a stove, or explain that water boils at different temperatures at different altitudes. But they can’t feel the weight of the pan or the heat of the steam.
Li describes this beautifully: “Spatial intelligence plays a fundamental role in defining how we interact with the physical world. Every day, we rely on it for the most ordinary acts: parking a car by imagining the narrowing gap between bumper and curb, catching a set of keys tossed across the room, navigating a crowded sidewalk without collision.”
World models are different. They build internal representations of how things actually work - causality, physics, spatial relationships, consequences.
It’s the gap between noticing what often happens and understanding why it happens.
A pattern matcher like an LLM might say: “In my training data, when people describe moving objects, they often mention speed and direction and collision.”
A world model understands: “This object has mass and velocity. If it collides with that wall at this angle and speed, it will bounce back with this trajectory and reduced energy. The wall may crack depending on material properties.”
One is statistical correlation. The other is causal understanding.
Yann has been making this point forcefully. LLMs are trained on text - an amount that would take a human maybe 450,000 years to read. But a four-year-old child who’s been awake for 16,000 hours has processed vastly more sensory data about how the world actually works. The implication is stark: you can’t achieve human-level intelligence through text alone.
How world models fit the production/real-time framework:
Just like LLMs and diffusion models, world models serve both purposes - and understanding which mode you’re working in matters just as much as understanding the underlying architecture.
World Models as Production Tool:
Marble lets you turn text, images, or video into persistent, downloadable 3D environments for game development, VFX and virtual production, architectural visualization, training simulations. As Li explains: “World Labs’ Marble platform will be putting unprecedented spatial capabilities and editorial controllability in the hands of filmmakers, game designers, architects, and storytellers of all kinds, allowing them to rapidly create and iterate on fully explorable 3D worlds.”
You generate the world, refine it, export it, use it in your project. The AI’s role ends before your audience experiences it.
World Models as Real-Time Engine:
Genie 3 generates dynamic worlds you can navigate in real-time, with the environment responding to your actions and maintaining consistency for several minutes. Interactive gaming experiences that generate dynamically. Educational simulations where students explore generated environments. Virtual tourism through historical or fantastical spaces. Immersive storytelling where audiences shape the world.
Every user gets a unique experience. The AI performs live, in the moment.
Yann sees the opportunity to train AI agents in virtual spaces with physical laws - teaching them spatial reasoning before unleashing them in the real world. (Yes, the androids are coming, folks. But at least they’ll know not to walk into us.). I see the opportunity to raise talking plants in your own virtual garden that attract miniature flying ponies.
But the implications of world models extend far beyond the AGI debate.
Why this matters - even if you’re not chasing AGI:
Yann believes intelligence won’t emerge from scale alone - from bigger LLMs with more parameters - but from architecture and autonomy, from systems that understand cause and effect rather than just recognizing patterns. Li writes that “spatial intelligence represents the frontier beyond language—the capability that links imagination, perception and action, and opens possibilities for machines to truly enhance human life.”
They’re betting on world models because they believe you can’t achieve AGI through language alone. You need systems that understand how the world actually works.
I’m betting on world models because they’ll let me create interactive narrative experiences and immersive environments that would be impossible or prohibitively expensive with traditional tools.
But LLMs will likely remain the interface. You’ll still use natural language to communicate with these world systems. Just as Nano Banana Pro combines Gemini 3’s language understanding with image generation and Google Search to create contextually accurate outputs, future systems will orchestrate language models with world models.
You’ll describe what you want in words. Behind the scenes, world models will generate the spatially coherent, physically plausible, causally consistent results.
Every new model is a kind of Rorschach test, and what you see in it depends on the angle of your orbit.
Yann’s looking at this evolution from his lofty orbit and seeing the path to machines that genuinely think. I’m looking at it from mine and seeing new ways to tell stories. And the truth is, neither of us — from our respective orbits — can predict with any certainty where this all leads.
That’s one more thing Yann and I have in common.
What do you see from your orbit?


