Home >Technology peripherals >AI >Large language model beats diffusion model! Video image generation dual SOTA, Google CMU's latest research, a Peking University alumnus

Large language model beats diffusion model! Video image generation dual SOTA, Google CMU's latest research, a Peking University alumnus

PHPzforward: 2023-10-16 14:29:01955browse

The language model defeats the diffusion model and achieves double SOTA in video and image generation!

This is the latest research result from Google CMU.

According to reports, this is the first time that a language model has defeated the diffusion model on the iconic ImageNet benchmark. The key component behind it is

Visual tokenizer

(video tokenizer), which can map pixel space input into tokens suitable for LLM learning. The Google CMU research team proposed MAGVIT-v2, which surpassed the previous best visual word segmenter in two other tasks.

Large language model defeats diffusion model

It has been agreed that large language models have excellent performance in various generative fields. Such as text, audio, code generation, etc.

But in terms of visual generation, language models have always lagged behind diffusion models.

The team believes that the main reason is the lack of a good visual representation, similar to a self-developed language system, that can effectively model the visual world. Unlike natural language, humans have not evolved an optimal vocabulary for the visual world. This also limits the visual generation capabilities of large language models.

Based on this judgment, this research mainly completed three tasks:

Proposed a new visual tokenizer, which is superior to visual generation, video compression and action recognition. Best performance ever.

A new lookup-free quantification method can improve the visual generation quality of language models by learning a large number of vocabulary;
For the first time, there is evidence that under the same training data, With equivalent model sizes and similar training budgets, language models beat diffusion models on ImageNet.
According to the author, this is also the first time that a visual word segmenter has successfully achieved results comparable to standard codecs.

Based on the original SOTA visual tokenizer

MAGVIT

(Masked Generative Video Transformer), this method mainly completes two designs: Lookup-Free Quantization , LFQ) and image-video joint tokenizer.

Large language model beats diffusion model! Video image generation dual SOTA, Google CMUs latest research, a Peking University alumnus In the end, in video/image generation, ImageNet 512×512 and Kinetics-600 are both better than Diffusion Model.

Large language model beats diffusion model! Video image generation dual SOTA, Google CMUs latest research, a Peking University alumnus In terms of video compression and action recognition, it is also better than previous results.

##One is an alumnus of Peking University Large language model beats diffusion model! Video image generation dual SOTA, Google CMUs latest research, a Peking University alumnus

Yu Lijun is currently a doctoral student at the Institute of Language Technology, School of Computer Science, CMU, studying under Professor Alexander G. Hauptmann. He is also a Google Student Researcher. Research interests lie in multi-modal base models, especially multi-task video generation.

Before coming to CMU, he received a double bachelor's degree in computer science and economics from Peking University.

We also saw many other Chinese faces in the research team. Large language model beats diffusion model! Video image generation dual SOTA, Google CMUs latest research, a Peking University alumnus

The corresponding author Jiang Lu is currently a scientist at Google Research and an adjunct professor at CMU.

His research mainly focuses on the field of multi-modal big data, especially robust deep learning, generative artificial intelligence and multi-modal basic models.

Paper link:

https://arxiv.org/abs/2310.05737
https://magvit.cs.cmu .edu/v2/

The above is the detailed content of Large language model beats diffusion model! Video image generation dual SOTA, Google CMU's latest research, a Peking University alumnus. For more information, please follow other related articles on the PHP Chinese website!

Token 人工智能 transformer https

Statement：

This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete

Previous article：What is a digital human and what does the future hold?Next article：What is a digital human and what does the future hold?

See more

Large language model beats diffusion model! Video image generation dual SOTA, Google CMU's latest research, a Peking University alumnus

It has been agreed that large language models have excellent performance in various generative fields. Such as text, audio, code generation, etc.

Related articles