Home >Technology peripherals >AI >Gary Marcus: Text-generated image systems cannot understand the world and are far from AGI

Gary Marcus: Text-generated image systems cannot understand the world and are far from AGI

WBOY
WBOYforward
2023-04-09 09:31:031361browse

This article is reproduced from Lei Feng.com. If you need to reprint, please go to the official website of Lei Feng.com to apply for authorization.

Since the advent of DALL-E 2, many people have believed that AI capable of drawing realistic images is a big step towards artificial general intelligence (AGI). OpenAI CEO Sam Altman once declared "AGI is going to be wild" when DALL-E 2 was released, and the media are also exaggerating the significance of these systems for the progress of general intelligence.

But is it really so? Gary Marcus, a well-known AI scholar and enthusiast who pours cold water on AI, expressed his "reservations."

Recently, he suggested that when evaluating progress in AGI, it is key to see whether systems like Dall-E, Imagen, Midjourney and Stable Diffusion truly understand the world and can reason based on this knowledge. and make decisions.

When judging the significance of these systems to AI (including narrow and broad AI), we can ask the following three questions:

Can the image synthesis system Generate high quality images?

Can they relate language input to the images they produce?

Do they understand the world behind the images they present?

1 AI does not understand the relationship between language and images

On the first question, the answer is yes. The only difference is that trained human artists can do a better job at using AI to generate images.

On the second question, the answer is not necessarily certain. These systems can perform well on certain language inputs. For example, the following picture is the "astronaut on a horse" generated by DALL-E 2:

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

But in other cases On some language inputs, these AIs perform poorly and are easily fooled. For example, Marcus pointed out on Twitter some time ago that these systems have difficulty generating corresponding accurate images when faced with "a horse riding an astronaut":

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

Although deep learning advocates have fiercely countered this, such as AI researcher Joscha Bach who believes that "Imagen may just use the wrong training set", machine learning professor Luca Ambrogioni counters that this shows that "Imagen already has a certain degree of common sense", so refuse to generate something ridiculous.

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

There is also a Google scientist Behnam Neyshabur who proposed that if "asked in the right way", Imagen can draw "a horse riding an astronaut":

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

However, Marcus believes that the key to the problem is not whether the system can generate images. Smart people can always find ways to make the system draw specific images, but these systems There is no deep understanding of the connection between language and images, which is the key.

2 Don’t know what a bicycle wheel is? How can it be called AGI?

The system's understanding of language is only one aspect. Marcus pointed out that the most important thing is that judging the contribution of systems such as DALL-E to AGI ultimately depends on the third question: If all the system can do is Converting many sentences into images in an accidental but stunning way, they may revolutionize human art, but still are not truly comparable to, and do not represent, AGI at all.

What makes Marcus despair about the ability of these systems to understand the world are recent examples, such as graphic designer Irina Blok’s “coffee cup with many holes” image generated using Imagen:

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

Normal people will think it goes against common sense after looking at this picture. It is impossible for coffee not to leak from the hole. Similar ones include:

"Bicycle with square wheels"

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

Gary Marcus: Text-generated image systems cannot understand the world and are far from AGI

"Toilet paper covered with cactus spines"

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

Gary Marcus: The text-generated image system cannot understand the world and is still far from AGI

It is easy to say "yes" but difficult to say "no", who Can you know what a thing that doesn't exist should look like? This is where the difficulty lies in getting AI to draw the impossible.

But maybe, the system just "wanted" to draw a surreal image. As DeepMind research professor Michael Bronstein said, he didn't think that was a bad result. Instead, it was He can also draw like this.

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

#So how to finally solve this problem? Gary Marcus found new inspiration in a recent conversation with philosopher Dave Chalmers.

In order to understand the system's understanding of parts and wholes, and functions, Gary Marcus proposed a task to have a clearer idea of ​​whether the system performance is correct, giving the text prompt "Sketch a bicycle and label the parts that roll on the ground" and "Sketch a ladder and label one of the parts you stand on" part).

The special thing about this test is that it does not directly give prompts such as "Draw a bicycle and mark the wheels" or "Draw a ladder and mark the pedals", but Letting AI deduce corresponding things from descriptions such as "parts rolling on the ground" and "parts standing" is a test of AI's ability to understand the world.

But Marcus’ test results show that Craiyon (formerly known as DALL-E mini) is terrible at this kind of thing. It does not understand what bicycle wheels and ladder pedals are:


Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

So is this a problem unique to DALL-E Mini?

Gary Marcus found that it was not the case. The same result also appeared in Stable Diffusion, the most popular text generation image system at present.

For example, let Stable Diffusion "Sketch a person and make the parts that hold things purple" (Sketch a person and make the parts that hold things purple), the result is:

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

Obviously, Stable Diffusion does not understand what human hands are.

And out of the next nine attempts, only one was successfully completed (in the upper right corner), and the accuracy was not high:

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

The next test is, "Draw a white bicycle and turn the part pushed by the foot into orange", and the resulting image is:

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

So it cannot understand what a bicycle pedal is.

And in the test of drawing "a sketch of the bicycle and marking the part rolling on the ground", its performance was not very good:

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

If the text prompt contains a negative word, such as "Draw a white bicycle without wheels", the result is as follows:

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

This Indicates that the system does not understand negative logical relationships.

Even if it is as simple as "drawing a white bicycle with green wheels" that only focuses on the relationship between the part and the whole, and does not have complex syntax or functions, the results still have problems. :

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

#So, Marcus asks, can a system that doesn’t understand what wheels are or what they are used for be considered a major step in artificial intelligence? Progress?

Today, Gary Marcus also issued a poll on this issue. He asked the question, "How much do systems such as Dall-E and Stable Diffusion know about the world they depict? ”

Among them, 86.1% of people think that systems do not understand the world well, and only 13.9% think that these systems understand the world to a high degree.

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

In response, Emad Mostique, CEO of Stability.AI, also responded that I voted for "not many" and admitted that "they are just puzzle pieces." A small piece of it."

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

Alexey Guzey from the scientific organization New Science also made a similar discovery to Marcus. He asked DALL-E to draw a bicycle , but the result is just a bunch of bike elements piled together.

Gary Marcus:文本生成图像系统理解不了世界,离 AGI 还差得远

#So he believes that there is no model that can truly understand what a bicycle is and how it works, and generating current ML models can almost rival or replace humans. Humans are ridiculous.

What do you think?

The above is the detailed content of Gary Marcus: Text-generated image systems cannot understand the world and are far from AGI. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete