ホームページ > 記事 > テクノロジー周辺機器 > LeCun の最新インタビュー: なぜ物理世界は最終的に LLM の「アキレス腱」になるのでしょうか?
人工知能の分野では、ヤン ルカンのように 65 歳になった今でもソーシャル メディアで精力的に活動している学者はほとんどいません。
ヤン・ルカンは、人工知能に対する率直な批判者として知られています。彼はオープンソース精神の積極的な支持者であり、Meta のチームを率いて人気の Llama 2 モデルを立ち上げ、オープンソースの大規模モデルの分野のリーダーになりました。多くの人が人工知能の将来について不安を抱き、起こり得る終末シナリオを心配していますが、ルカン氏は異なる見解を持っており、人工知能の発展、特に超知能の出現が社会にプラスの影響を与えると強く信じています。
最近、LeCun は再び Lex Fridman のポッドキャストに来て、オープンソースの重要性、LLM の限界、そしてなぜ人工知能が必要なのかについて 3 時間近くの会話を始めました。 AGI への道について間違っています。
視聴ページ: https://youtu.be/5t1vTLU7s40?feature=shared
usこのポッドキャストからの貴重なポイントをいくつか紹介します:
Lex Fridman: 自己回帰 LLM は私たちが作る方法ではないと言いましたね。超人的な知性への進歩。なぜ彼らは私たちをずっと連れて行ってくれないのでしょうか?
ヤン・ルカン: 理由はたくさんあります。まず、知的な行動には多くの特徴があります。たとえば、世界を理解する能力、物理的世界を理解する能力、物事を覚えて検索する能力、永続的な記憶、推論と計画の能力などです。これらは、知的システムまたはエンティティ、人間、動物の 4 つの基本的な特性です。 LLM はこれらを実行できないか、非常に原始的な方法でしか実行できず、物理世界を実際には理解していません。 LLM には永続的な記憶がなく、論理的に考えることもできず、もちろん計画を立てることもできません。したがって、システムがスマートであることを期待していても、これらのことは実行できないのであれば、それは間違いです。これは、自己回帰 LLM が役に立たないと言っているわけではありません。確かに便利ですが、面白くはなく、それらを中心にアプリのエコシステム全体を構築することはできません。しかし、人間レベルの知性へのパスポートとして必要な構成要素が欠けています。
私たちは、言語よりも感覚入力を通じてはるかに多くの情報を見ています。直観にもかかわらず、私たちが学び、知っていることのほとんどは、現実の世界ではなく観察し、対話することによって得られます。言葉を通して。私たちが人生の最初の数年間に学ぶことはすべて、そして確かに動物が学ぶことはすべて、言語とは何の関係もありません。
レックス・フリッドマン: LLM には物理世界の理解が欠けている、ということですか?したがって、直観的な物理学、物理空間や物理的現実についての常識的な推論は、あなたにとって特別なことではありません。これはLLMでは実現できない大きな飛躍なのでしょうか?
Yann LeCun: 現在使用している LLM では、さまざまな理由でこれを行うことができませんが、主な理由は、LLM のトレーニング方法がつまり、テキストの一部を取得し、テキスト内のいくつかの単語を削除し、マスクして空のトークンに置き換え、欠落している単語を予測するように遺伝的ニューラル ネットワークをトレーニングします。このニューラル ネットワークを特別な方法で構築し、左側の単語または予測しようとしている単語のみを参照できるようにすると、基本的にテキスト内の次の単語を予測しようとするシステムが完成します。したがって、テキストやプロンプトを与えて、次の単語を予測させることができます。次の単語を正確に予測することはできません。
つまり、辞書内のすべての可能な単語に対する確率分布を生成することです。実際、単語は予測されません。単語のチャンクをサブワード単位で予測するため、辞書に表示される単語の数が限られており、その分布を計算するだけなので、予測の不確実性を簡単に処理できます。次に、システムはこの分布から単語を選択します。もちろん、この分布では、より高い確率で単語が選択される確率が高くなります。したがって、その分布からサンプリングし、実際に単語を生成し、システムが 2 番目の単語を予測しないように、その単語を入力に移動します。
これは自己回帰予測と呼ばれます。そのため、これらの LLM は「自己回帰 LLM」と呼ばれる必要がありますが、ここでは単に LLM と呼びます。このプロセスは、単語を生成する前のプロセスとは異なります。
When you and I talk, and we are both bilingual, we think about what we are going to say, and that is relatively independent of the language we are going to say. When we talk about a mathematical concept, the thinking we do and the answer we intend to give have nothing to do with whether we express it in French, Russian or English.
Lex Fridman: Chomsky rolled his eyes, but I get it, so you're saying there's a larger abstraction that exists before language and maps to it?
Yann LeCun: For a lot of the thinking we do, yes.
Lex Fridman: Is your humor abstract? When you tweet, and your tweets are sometimes a little spicy, do you have an abstract representation in your brain before the tweet is mapped to English?
Yann LeCun: There really is an abstract representation to imagine the reader's reaction to the text. But thinking about a mathematical concept, or imagining what you want to make out of wood, or something like that, has absolutely nothing to do with language. You're not having an inner monologue in a specific language. You are imagining a mental model of things. I mean, if I ask you to imagine what this water bottle would look like if I rotated it 90 degrees, it has nothing to do with language. It's clear that most of our thinking occurs at a more abstract representational level. If the output is language, we will plan what we are going to say. Instead of outputting muscle movements, we will plan the answer before we make it. Answer.
LLM doesn’t do that and just instinctively says word after word. It's kind of like a subconscious move, where someone asks you a question and you answer it. There was no time to think about the answer, but it was simple. So you don't need to pay attention, it will react automatically. This is what LLM does. It doesn't really think about the answers. Because it has accumulated a lot of knowledge, it can retrieve some things, but it will just spit out token after token without planning the answer.
Lex Fridman: Generating token by token is necessarily simplistic, but if the world model is complex enough, it is most likely to generate a series of tokens, which will be a Esoteric things.
Yann LeCun: But this is based on the assumption that these systems actually have an eternal model of the world.
#Lex Fridman: So the real question is... Can we build a model that has a deep understanding of the world?
Yann LeCun: Can you build it out of predictions, the answer is probably yes. But can it be built by predicting words? The answer is most likely no, because language is very poor at weak or low bandwidth and doesn't have enough information. So building a model of the world means looking at the world, understanding why the world evolves the way it does, and then an additional component of the world model is being able to predict how the world will evolve as a result of the actions you might take.
So a true model is: here is my idea of the state of the world at time T, and here are the actions I might take. What is the predicted state of the world at time T 1? Now, the state of the world does not need to represent everything about the world, it just needs to represent enough information relevant to planning this operation, but not necessarily all the details.
Now, here comes the problem. Generative models cannot do this. So generative models need to be trained on video, and we've been trying to do that for 10 years, where you take a video, you show the system a video, and you're asked to predict the reminder of the video, basically predicting what's going to happen.
You can make large video mockups if you want. The idea of doing this has been around for a long time, at FAIR I and some of our colleagues have been trying to do it for 10 years, but you can't really do the same trick with LLM because LLM, as I said, you can't do it accurately Predict which word will follow a sequence of words, but you can predict the distribution of words. Now, if you look at a video, what you have to do is predict the distribution of all possible frames in the video, and we don't know how to do that correctly.
We don't know how to represent distributions on high-dimensional continuous spaces in a useful way. That's the main problem, and we can do this because the world is much more complex and information-rich than words. Text is discrete, while video is high-dimensional and continuous. There are a lot of details in this. So if I take a video of this room and the camera is panning around in the video, I simply can't predict everything that's going to be in the room as I pan around. The system also cannot predict what will appear in the room when the camera pans. Maybe it predicts that it's a room and there's a light in it and there's a wall and that sort of thing. It can't predict what a painting on a wall will look like or what the texture of a sofa will look like. Of course there is no way to predict the texture of a carpet. So I can't predict all those details.
So one possible way to deal with this, which we have been studying, is to build a model with so-called latent variables. The latent variables are fed into the neural network, which is supposed to represent all the information about the world that you haven't yet sensed. You need to enhance the predictive power of the system to be able to predict pixels well, including the subtleties of carpets, sofas and paintings on the wall. texture.
We have tried direct neural networks, we have tried GANs, we have tried VAEs, we have tried various regularized autoencoders. We also try to use these methods to learn good representations of images or videos, which can then be used as input to image classification systems and so on. Basically failed.
All systems that try to predict missing parts from a corrupted version of an image or video basically do this: take the image or video, corrupt it or convert it in some way , and then try to reconstruct the full video or image from the corrupted version, and then hopefully develop a good image representation inside the system that can be used for object recognition, segmentation, whatever. This approach is basically a complete failure, whereas it works extremely well when it comes to text. This is the principle used for LLM.
Lex Fridman: Where did the failure come from? Is it difficult to present the image well, such as embedding all the important information well into the image? Is it the consistency between image and image, image and image that forms the video? What would it look like if we made a compilation of all the ways you fail?
Yann LeCun: First, I have to tell you what doesn’t work, because there are other things that do. So, what doesn't work is training the system to learn representations of images, training it to reconstruct good images from corrupted images.
We have a whole set of techniques for this, which are all variants of denoising autoencoders. Some of my colleagues at FAIR developed something called MAE, or Masked Autoencoder. Encoder. So it's basically like an LLM or something like that, where you train the system by corrupting the text, but you corrupt the image, remove patches from it, and then train a giant neural network to reconstruct it. The features you get are not good, and you know they are not good, because if you now train the same architecture, but you train it supervised with labeled data, text descriptions of the images, etc., you do get good representations , the performance on the recognition task is much better than if you do this kind of self-supervised retraining.
The structure is good, and the structure of the encoder is also good, but the fact that you train the system to reconstruct images does not make it produce long and good general features of images. So what's the alternative? Another approach is joint embedding.
#Lex Fridman:: What is the fundamental difference between Joint Embedding Architecture and LLM? Can JEPA take us into AGI?
Yann LeCun: First, how does it differ from generative architectures like LLM? An LLM or a vision system trained through reconstruction generates the input. The raw input they generate is uncorrupted, untransformed, so you have to predict all the pixels, and it takes a lot of resources for the system to actually predict all the pixels and all the detail. In JEPA, you don't need to predict all pixels, you only need to predict an abstract representation of the input. This is much easier in many ways. Therefore, what the JEPA system has to do when training is to extract as much information as possible from the input, but only extract information that is relatively easy to predict. Therefore, there are many things in the world that we cannot predict. For example, if you have a self-driving car driving down the street or on the road, there might be trees around the road, and it might be a windy day. So the leaves on the tree move in a semi-chaotic, random way that you can't predict, and you don't care, and you don't want to predict. So you want the encoder to basically remove all of these details. It will tell you that the leaves are moving, but it won't tell you exactly what's going on. So when you predict in representation space, you don't have to predict every pixel of every leaf. Not only is this much simpler, but it also allows the system to essentially learn an abstract representation of the world, where what can be modeled and predicted is retained, and the rest is treated as noise by the encoder and eliminated.
Therefore, it increases the level of abstraction of the representation. If you think about it, this is definitely something we've been doing. Whenever we describe a phenomenon, we do so at a specific level of abstraction. We don't always use quantum field theory to describe every natural phenomenon. That is impossible. So we have multiple levels of abstraction to describe what's going on in the world, from quantum field theory to atomic theory, molecules, chemistry, materials, all the way to concrete objects in the real world and so on. So we can't just simulate everything at the lowest level. And this is exactly the idea behind JEPA, learning abstract representations in a self-supervised manner, and also learning them hierarchically. So I think that's an important part of smart systems. In terms of language, we don't have to do this, because language is already abstract to a certain extent and has eliminated a lot of unpredictable information. Therefore, we can directly predict words without doing joint embeddings or increasing the level of abstraction.
Lex Fridman: You're talking about language, and we're too lazy to use language because we've been given abstract representations for free, and now we have to zoom in and really think about intelligent systems in general. We have to deal with physical reality and reality that is a mess. And you really have to do that, jumping from full, rich, detailed reality to an abstract representation of reality based on what you can reason about, and all that kind of stuff.
Yann LeCun: That’s right. Self-supervised algorithms that learn by prediction, even in representation space, learn more concepts if the input data is more redundant. The more redundant the data, the better they capture the internal structure of the data. Therefore, in sensory input such as perceptual input and vision, there are much more redundant structures than in text. Language may actually represent more information because it has been compressed. You're right, but that also means it has less redundancy, so the self-supervision won't be as good.
Lex Fridman: Is it possible to combine self-supervised training on visual data with self-supervised training on language data? Even though you're talking about 10 to 13 tokens, there's a ton of knowledge that goes into it. These 10 to 13 tokens represent everything we humans have figured out, including the crap on Reddit, the content of all the books and articles, and everything the human intellect has ever created.
Yann LeCun: Well, ultimately yes. But I think if we do it too early, we risk being induced to cheat. And in fact, this is exactly what people are currently doing with visual language models. We are basically cheating, using language as a crutch to help our deficient visual systems learn good representations from images and videos.
The problem with this is that we can improve language models by feeding them images, but we won’t even reach the level of intelligence or understanding of the world that a cat or a dog has because they There is no language. They have no language but understand the world far better than any LLM. They can plan very complex actions and imagine the consequences of a sequence of actions. How do we get machines to learn this before combining it with language? Obviously if we combine this with language we will get results, but until then we have to focus on how to get the system to learn how the world works.
In fact, the technology we use is non-contrastive. Therefore, not only is the architecture non-generative, the learning procedures we use are also non-comparative. We have two sets of technologies. One set is based on the distillation method. There are many methods that use this principle. DeepMind has one called BYOL, there are several FAIRs, one is called vcREG, and one is called I-JEPA. It should be said that vcREG is not a distillation method, but I-JEPA and BYOL certainly are. There is also one called DINO or DINO, which is also produced by FAIR. The idea behind these methods is that you run the complete input, say an image, through an encoder, producing a representation, and then you destroy or transform the input, running it through what is essentially the same encoder, but with some nuances and then train a predictor.
Sometimes the predictor is very simple, sometimes the predictor does not exist, but a predictor is trained to predict the relationship between the first uncorrupted input and the corrupted input. But you only train the second branch. You only train the part of the network that takes corrupted input. The other network does not require training. But since they share the same weights, when you modify the first network, it also modifies the second network. Through various tricks, you can prevent the system from crashing, like the one I explained earlier, where the system basically ignores the input. Therefore, this method is very effective. Two technologies we developed at FAIR, DINO and I-JEPA, are very effective in this regard.
Our latest version is called V-JEPA. It's basically the same idea as I-JEPA, just applied to video. So you can take the entire video and then block a chunk of it. What we're masking out is actually a time pipe, so the entire clip for every frame in the entire video.
This is the first system we have that can learn good representations of video, so when you feed those representations into a supervised classifier head, it can Tells you with a fairly high degree of accuracy what action is taking place in the video. So this is the first time we're getting something of this quality.
The results seem to indicate that our system can use representations to tell whether a video is physically possible or completely impossible because some objects disappear or an object suddenly Jumping from one location to another, or changing shape or something.
Lex Fridman: Does this allow us to build a model of the world that understands it well enough to be able to drive a car?
Yann LeCun: It may take a while to get there. There are already some robotic systems based on this idea. What you need is a slightly modified version. Imagine that you have a complete video, and what you do with this video is time-shift it into the future. Therefore, you can only see the beginning of the video but not the second half of the original video, or only the second half of the video is blocked. You can then train a JEPA system or a system like the one I described to predict the complete representation of the occluded video. However, you also need to provide the predictor with an action. For example, the wheel turns 10 degrees to the right or something, right?
So if this is a car camera and you know the angle of the steering wheel, then to some extent you should be able to predict how what you see will change . Obviously, you can't predict all the details of the objects that appear in the view, but at the level of abstract representation, you may be able to predict what will happen. So, now you have an internal model that says, "This is my idea of the state of the world at time T, and here's the action I'm taking. Here's T plus 1, T plus delta T, T plus 2 seconds Prediction of the state of the world," whatever it is. If you have such a model, you can use it for planning. So now you can do what an LMS can't do, which is plan what you want to do. So when you reach a specific result or meet a specific goal.
So you can have many goals. I can predict that if I had an object like this and I opened my hand, it would fall. If I push it against the table with a specific force, it moves. If I push the table with the same force, it probably won't move. As a result, we have an internal model of the world in our minds, which allows us to plan a sequence of actions to achieve a specific goal. Now, if you have this model of the world, we can imagine a sequence of actions, predict the outcome of that sequence of actions, measure how well the final state satisfies a particular goal, such as moving the bottle to the left of the table, and then run Plan a series of actions to minimize this goal.
We’re not talking about learning, we’re talking about reasoning time, so that’s planning, really. In optimal control, this is a very classic thing. It's called model predictive control. You have a model of the system you want to control that predicts a sequence of states that corresponds to a sequence of instructions. And you're planning a sequence of instructions so that, based on your role model, the end state of the system will meet the goals you set. Rocket trajectories have been planned this way since the advent of computers, in the early 1960s.
#Lex Fridman: Suggestion to abandon generative models in favor of joint embedding architectures? You've been a critic of reinforcement learning for some time. It feels like court testimony, abandoning probabilistic models in favor of the energy-based models we talked about, abandoning contrastive methods in favor of regularization methods.
Yann LeCun: I don’t think it should be abandoned completely, but I think its use should be minimized because it’s sampling Very inefficient. Therefore, the correct way to train a system is to first have it learn a good representation of the world and a model of the world from primary observations (and maybe a little interaction).
Lex Fridman: Why does RLHF work so well? Yann LeCun of reinforcement learning.
Open Source
Meta revolves around a business model where you provide a service that is funded either by advertising or commercial clients.
For example, if you have an LLM that can help a pizza shop by talking to customers through WhatsApp, the customer only needs to order a pizza and the system will ask them: " What ingredients do you want or what sizes do you want, etc." Merchants will pay for it, and that's the model.Otherwise, if it is a more classic service system, it can be supported by advertising or have several modes. But the thing is, if you have a large enough potential customer base that you need to build the system for them anyway, there's no harm in releasing it into open source.
Lex Fridman: Meta’s bet is: Will we do better?
Yann LeCun
: No. We already have a huge user base and customer base.It doesn’t hurt that we provide open source systems or basic models and basic models for others to build applications on. If these apps are useful to our customers, we can purchase them directly from them. They may improve the platform. In fact, we've seen this happen. LLaMA 2 has been downloaded millions of times, and thousands of people have come up with ideas on how to improve the platform. So this obviously speeds up the process of making the system available to a wide range of users, and thousands of businesses are building applications using the system. Therefore, Meta's ability to generate revenue from this technology is not affected by the open source distribution of the underlying model.
Llama 3Lex Fridman: What are you most excited about about LLaMA 3?
Yann LeCun
: There will be various versions of LLaMA, which are improvements on the previous LLaMA, bigger, better, Multimodality, that sort of thing. And then, in future generations, there are planning systems that are able to actually understand how the world works, probably trained on video, so they will have some model of the world that might be able to do the type of reasoning and planning that I talked about earlier.How long will this take? When will research in this direction make its way into the product line? I don't know and I can't tell you. We basically have to go through some breakthroughs before we get there, but people are able to monitor our progress because we publish our research publicly. Therefore, last week we published the V-JEPA effort, a first step toward a video training system. The next step will be to train a world model based on this video creativity. DeepMind has similar work, and UC Berkeley has work on world models and videos. Many people are working on this. I think a lot of good ideas are coming. My bet is that these systems will be JEPA lightweight systems, they won't be generative models, and we'll see what happens in the future. More than 30 years ago, when we were working on combinatorial networks and early neural networks, I saw a path to human-level intelligence, systems that could understand the world, remember, plan, reason. There are some ideas that can move forward that might have a chance to work, and I'm really excited about that. What I like is that we are somehow moving in a good direction and maybe succeeding before my brain turns to white sauce or before I need to retire. Lex Fridman: Most of your excitement is still in the theoretical aspect, that is, the software aspect? Yann LeCun: I used to be a hardware guy many years ago. Scale is necessary, but not sufficient. It is possible that I will live ten years in the future, but I will still have to run a short distance. Of course, the further we go in terms of energy efficiency, the more progress we make in terms of hard work. We have to reduce power consumption. Today, a GPU consumes between half a kilowatt and a kilowatt. A human brain draws about 25 watts of power, while a GPU draws far less than a human brain. You'd need 100,000 or 1 million of power to match that, so we're pretty far apart. Lex Fridman: You often say that GI is not coming anytime soon, what is the underlying intuition behind that? Yann LeCun: The idea, popularized by science fiction and Hollywood, that someone will discover AGI or human-level AI Or the secret of AMI (whatever you want to call it), and then turning on the machine and we have AGI, is not going to happen. This will be a gradual process. Will we have systems that can understand how the world works from videos and learn good representations? It will take quite some time before we reach the scale and performance we observe in humans, not just a day or two. Do we allow systems to have large amounts of associative memory to remember things? Yes, but it won’t happen tomorrow either. We need to develop some basic technologies. We have a lot of these technologies, but getting them to work with a complete system is another story. Will we have systems that can reason and plan, perhaps like the goal-driven AI architectures I described earlier? Yes, but it will take a while to get it working properly. It's going to be at least a decade or more before we get all these things working together, before we get systems based on this that learn hierarchical planning, hierarchical representations, that can be configured the way a human brain can for the different situations at hand, Because there are a lot of problems that we don't see yet, that we haven't encountered yet, so we don't know if there are simple solutions within this framework. For the past decade or so, I’ve heard people claim that AGI is just around the corner, and they’re all wrong. IQ can measure something about humans, but because humans are relatively uniform in form. However, it only measures an ability that may be relevant to some tasks but not others. But if you're talking about other intelligent entities for which the basic things that are easy to do are completely different, then it doesn't make any sense. Therefore, intelligence is a collection of skills and the ability to acquire new skills efficiently. The set of skills that a particular intelligent entity possesses or is able to learn quickly is different from the set of skills of another intelligent entity. Because this is a multi-dimensional thing, the skill set is a high-dimensional space that you can't measure and you can't compare two things to see if one is smarter than the other. It is multi-dimensional. Lex Fridman: You often speak out against so-called AI doomsdayers, explain their views and why you think they are wrong. Yann LeCun: AI doomsdayers imagine various disaster scenarios, how the AI could escape or take control and essentially kill All of us, this relies on a bunch of assumptions, most of which are wrong. The first hypothesis is that the emergence of superintelligence will be an event and at some point we will discover the secrets and we will open a superintelligent machine because we have never done this before passed, so it will take over the world and kill us all. This is wrong. This won't be an event. We will have systems that are as smart as cats, they have all the characteristics of human intelligence, but their level of intelligence may be like a cat or a parrot or something. Then, we gradually improve their intelligence. While making them smarter, we also need to set up some guardrails on them and learn how to set up guardrails to make them behave more normally. In nature, it seems that the more intelligent species will eventually dominate the other species, sometimes even intentionally, and sometimes just by mistake to differentiate the other species. So you’re thinking, “Well, if AI systems are smarter than us, they’re definitely going to wipe us out, if not on purpose, just because they don’t care about us,” which is Absurd - The first reason is that they will not become a species that competes with us and will not have the desire to dominate, because the desire to dominate must be something inherent in intelligent systems. It is deeply ingrained in humans and is shared by baboons, chimpanzees, and wolves, but not in orangutans. This desire to dominate, obey, or otherwise gain status is unique to social species. Non-social species like orangutans have no such desire and are just as smart as we are. Lex Fridman: Do you think there will be millions of humanoid robots walking around soon? Yann LeCun: It won’t be soon, but it will happen. The next ten years, I think the robotics industry is going to be very interesting, the rise of the robotics industry has been 10, 20 years in the making and there's not really a Appear. The main question remains Moravec's Paradox, how do we get these systems to understand how the world works and plan actions? In this way, we can complete truly professional tasks. What Boston Dynamics did was basically through a lot of hand-crafted dynamic models and careful planning in advance, which is very classic robotics with a lot of innovation and a little bit of perception, but it was still not enough and they couldn't make a home robot. Additionally, we are still some distance away from fully autonomous L5 driving, such as a system that can train itself like a 17-year-old through 20 hours of driving. So we won’t make significant progress in robotics until we have a model of the world, a system that can train itself to understand how the world works. AGI
humanoid robots
以上がLeCun の最新インタビュー: なぜ物理世界は最終的に LLM の「アキレス腱」になるのでしょうか?の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。