🎉 Gate Square — Share Your Funniest Crypto Moments & Win a $100 Joy Fund!
Crypto can be stressful, so let’s laugh it out on Gate Square.
Whether it’s a liquidation tragedy, FOMO madness, or a hilarious miss—you name it.
Post your funniest crypto moment and win your share of the Joy Fund!
💰 Rewards
10 creators with the funniest posts
Each will receive $10 in tokens
📝 How to Join
1⃣️ Follow Gate_Square
2⃣️ Post with the hashtag #MyCryptoFunnyMoment
3⃣️ Any format works: memes, screenshots, short videos, personal stories, fails, chaos—bring it on.
📌 Notes
Hashtag #MyCryptoFunnyMoment is requi
Fei-Fei Li talks about the next step for LLM: AI must possess "spatial intelligence" to understand the real world. How does Marble achieve this?
Li Feifei, founder of World Labs and known as the “Godmother of AI”, was interviewed to discuss the need for AI to go beyond language and develop “spatial intelligence” so that machines can truly understand and construct the 3D physical world. (Synopsis: a16z Former Partner's Blockbuster Technology Report: How Is AI Eating the World? (Background added: Bridgewater Dalio: It's too early to sell AI stocks!) Because the “needle that pierces the bubble” has not yet played) At a time when large-scale language models are sweeping the world, Fei-Fei Li, a professor at Stanford University known as the “godmother of AI”, has set his sights on the next frontier of artificial intelligence: spatial intelligence. After leaving Google Cloud, Li Feifei founded the high-profile startup World Labs and launched the first world model product, Marble. In this in-depth interview with Eye on AI, Li Feifei elaborates on why AI must not only understand words, but must have the ability to “see”, “perceive” and “build” the 3D world. This interview touches on a number of core topics, including: Beyond Language: Why can't human knowledge be fully captured by words, and AI needs multimodal learning? Technical decryption: How can World Labs' “RTFM” model produce geometrically consistent 3D worlds with just one GPU? Academic Opinion: What are the similarities and differences between Li Feifei's methodology and the world model concept of Yann LeCun, Meta's chief AI scientist? Future outlook: When will AI truly understand the laws of physics and even demonstrate the creativity of scientific inquiry? Here is the full Chinese translation of this wonderful conversation. Moderator: I don't want to spend too much time talking about Marble—your new model that generates a consistent and persistent 3D world that moves the viewer through it, although it's really great. I want to explore more about why you focus on “world models” and “spatial intelligence”? Why is this necessary to go beyond language learning? And how is your method different from Yann LeCun's? First of all, can you talk about whether the world model is derived from your research in Ambient Intelligence, or is it a parallel research track? Feifei Li: The spatial intelligence work that I've been thinking about for the past few years is really a continuation of my entire career focused on computer vision and visual intelligence. I emphasize “space” because our technology has advanced to a point where its complexity and depth are no longer limited to looking at pictures or understanding simple films. It is depth-perceptive, spatial, and connected to robotics, embodied AI, and environmental AI. So from that point of view, it's really a continuation of my career in computer vision and AI. Moderator: I also talked about the importance of spatial intelligence on this podcast for a while. Language models learn from human knowledge encoded in words, but that's only a fraction of human knowledge. As you and many others have pointed out, humans often learn by interacting with the world without language. So that's important, and while current LLMs are amazing, if we want to go beyond them, we need to develop models that experience the world more directly and learn directly from it. Your approach—of course, Marble is an example—is to take the internal representations learned by the model and use those representations to create an external visual reality. LeCun's approach, on the other hand, builds internal representations from direct experience or video input, allowing the model to learn things like the laws of motion physics. Is there a parallel relationship between the two? Are the two approaches complementary or overlapping? Feifei Li: First of all, I don't actually pit me against Yann, because I think we're both on the academic spectrum leading to spatial intelligence and world models. You may have read my long recent article, “Manifesto of Spatial Intelligence,” in which I made it clear. I actually think that if we are to eventually consider a universal, omnipotent model of the world, then both “implicit representation” and eventually some degree of “explicit representation”—especially at the output level—may be needed. They each play a different role. For example, World Labs' current world model, Marble, does explicitly output 3D representations, but inside the model, there are implicit representations in addition to explicit output. Honestly, I think ultimately we need both. As for the input modalities, yes, it is very important to learn from the film. The whole world is an input made up of a large number of consecutive frames, but for an agent or simply an animal, the world is not just a passive view. It also includes movement, interaction, tactile experiences, sounds, smells, and embodied experiences such as physical force and temperature. So I think it's deep multimodal. Of course, Marble as a model is only the first step, but in our technical article that we published a few days ago, we made it clear that we believe that multimodality is both a learning paradigm and an input paradigm. There has been a lot of academic discussion about this, which also shows the early excitement in the field. So I wouldn't say we've fully explored the exact model architecture and representation. Moderator: In your world model, the input is mostly video, and then the model builds an internal representation of the world? Li Feifei: Not exactly. If you've experienced our world model, Marble, its input is actually very modal. You can use plain text, single or multiple images, movies, or you can enter a rough 3D layout such as squares or voxel Voxels. So it's multimodal, and we'll continue to deepen that as we evolve. Interviewer: In addition to being a great product with many applications, is your ambition to build a system – like I said the input is a film – one that learns from direct experience? Is it learning through video or other modalities, rather than through secondary mediums like text? Feifei Li: Yes, I think the world model is about learning about the world, and the world is very multimodal. Whether it's a machine or an animal, we're multisensory. Learning takes place through perception, and perception has different modalities. Words are one of those forms. This is also what sets us apart from animals, because most animals don't learn through complex language, but humans do. However, today's AI world model learns from a large number of language inputs and other modalities, but it is not limited to only language as a channel. Moderator: One of the limitations of LLM is that the model parameters are fixed after training, and the model does not continuously learn. While there is some level of learning in testing inference, is this what you're trying to solve in your model of the world? Because it stands to reason that the world model should be able to continuously learn when it encounters a new environment. Li Feifei: Yes…