Home >Technology peripherals >AI >30 page paper! New work by Yu Shilun's team: AIGC comprehensive survey, development history from GAN to ChatGPT

30 page paper! New work by Yu Shilun's team: AIGC comprehensive survey, development history from GAN to ChatGPT

WBOY
WBOYforward
2023-04-04 11:50:021746browse

2022 can be said to be the first year of generative AI. Recently, Yu Shilun’s team published a comprehensive survey on AIGC, introducing the development history from GAN to ChatGPT.

The year 2022 that has just passed is undoubtedly the singular point of the explosion of generative AI.

Since 2021, generative AI has been selected into Gartner's "Artificial Intelligence Technology Hype Cycle" for two consecutive years and is considered an important AI technology trend in the future.

Recently, Yu Shilun’s team published a comprehensive survey on AIGC, introducing the development history from GAN to ChatGPT.

30 page paper! New work by Yu Shiluns team: AIGC comprehensive survey, development history from GAN to ChatGPT

Paper address: https://arxiv.org/pdf/2303.04226.pdf

This article excerpts part of the paper for introduction.

The singularity has arrived?

In recent years, artificial intelligence-generated content (AIGC, also known as generative AI) has attracted widespread attention outside the computer science community.

The entire society has begun to take great interest in various content generation products developed by large technology companies, such as ChatGPT and DALL-E-2.

AIGC refers to the use of generative artificial intelligence (GAI) technology to generate content and can automatically create a large amount of content in a short time.

ChatGPT is an AI system developed by OpenAI for building conversations. The system is able to effectively understand and respond to human language in a meaningful way.

In addition, DALL-E-2 is another state-of-the-art GAI model developed by OpenAI, capable of creating unique high-quality images from text descriptions in minutes.

30 page paper! New work by Yu Shiluns team: AIGC comprehensive survey, development history from GAN to ChatGPT

Example of AIGC in image generation

Technically speaking, AIGC refers to given instructions that can guide the model to complete the task, using GAI to generate satisfying The content of the instruction. This generation process usually consists of two steps: extracting intent information from instructions, and generating content based on the extracted intent.

However, as previous research has proven, the paradigm of the GAI model including the above two steps is not completely novel.

Compared with previous work, the core point of recent AIGC advancements is to train more complex generative models on larger data sets, use larger base model frameworks, and have access to a wide range of computing resources.

For example, the main framework of GPT-3 is the same as GPT-2, but the pre-training data size increases from WebText (38GB) to CommonCrawl (570GB after filtering), and the basic model size increases from 1.5B to 175B.

30 page paper! New work by Yu Shiluns team: AIGC comprehensive survey, development history from GAN to ChatGPT

Therefore, GPT-3 has better generalization ability than GPT-2 on various tasks.

In addition to the benefits of increased data volumes and computing power, researchers are also exploring ways to combine new technologies with GAI algorithms.

For example, ChatGPT utilizes reinforcement learning with human feedback (RLHF) to determine the most appropriate response to a given instruction, thereby improving the model’s reliability and accuracy over time. This approach enables ChatGPT to better understand human preferences in long conversations.

At the same time, in CV, Stable Diffusion proposed by Stability AI in 2022 has also achieved great success in image generation.

Unlike previous methods, generative diffusion models can help generate high-resolution images by controlling the balance between exploration and exploitation, thereby achieving diversity in the generated images, harmony with the similarity of the training data combination.

By combining these advances, the model has made significant progress in AIGC's mission and has been adopted by industries as diverse as art, advertising, and education.

In the near future, AIGC will continue to become an important area of ​​machine learning research.

Generally speaking, GAI models can be divided into two types: single-modal model and multi-modal model

Therefore, conduct a comprehensive review of past research and find out this Problems in the field are crucial. This is the first survey focusing on core technologies and applications in the AIGC field.

This is the first comprehensive survey of AIGC summarizing GAI in terms of technology and applications.

Previous surveys mainly introduced GAI from different perspectives, including natural language generation, image generation, and multi-modal machine learning generation. However, these previous works only focused on specific parts of AIGC.

In this survey, we first reviewed the basic technologies commonly used in AIGC. Then, a comprehensive summary of advanced GAI algorithms is further provided, including unimodal and multimodal generation. Additionally, the paper examines the applications and potential challenges of AIGC.

Finally, the future direction of this field is emphasized. In summary, the main contributions of this paper are as follows: - To the best of our knowledge, we are the first to provide a formal definition and comprehensive survey of AIGC and AI-augmented generative processes.

-We reviewed the history and basic technology of AIGC, and conducted a comprehensive analysis of the latest progress in GAI tasks and models from the perspectives of unimodal generation and multimodal generation.

-This article discusses the main challenges facing AIGC and future research trends.

Generative AI History

Generative models have a long history in artificial intelligence, dating back to the development of Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) in the 1950s.

These models generate continuous data such as speech and time series. However, it was not until the advent of deep learning that the performance of generative models improved significantly.

In early deep generative models, different domains usually did not overlap much.

30 page paper! New work by Yu Shiluns team: AIGC comprehensive survey, development history from GAN to ChatGPT

The development history of generative AI in CV, NLP and VL

In NLP, the traditional method of generating sentences is to use N-gram language model learning distribution of words, and then search for the best sequence. However, this method cannot effectively adapt to long sentences.

To solve this problem, Recurrent Neural Networks (RNNs) were later introduced to language modeling tasks, allowing relatively long dependencies to be modeled.

The second is the development of long short-term memory (LSTM) and gated recurrent units (GRU), which use gating mechanisms to control memory during training. These methods are able to handle approximately 200 tokens in a sample, which marks a significant improvement compared to N-gram language models.

At the same time, in CV, before the emergence of deep learning-based methods, traditional image generation algorithms used techniques such as texture synthesis (PTS) and texture mapping.

These algorithms are based on hand-designed features and have limited capabilities in generating complex and diverse images.

In 2014, Generative Adversarial Networks (GANs) were first proposed and became a milestone in the field of artificial intelligence because of their impressive results in various applications.

Variant autoencoders (VAEs) and other methods, such as generative diffusion models, have also been developed to provide more fine-grained control over the image generation process and enable the generation of high-quality images.

The development of generative models in different fields has followed different paths, but eventually an intersection emerged: the Transformer architecture.

In 2017, Transformer was introduced in NLP tasks by Vaswani et al., and was later applied to CV, and then became the dominant architecture for many generative models in various fields.

In the field of NLP, many well-known large-scale language models, such as BERT and GPT, adopt the Transformer architecture as their main building block. Advantages compared to previous building blocks, namely LSTM and GRU.

In CV, Vision Transformer (ViT) and Swin Transformer later developed this concept further, combining the Transformer architecture with a vision component, enabling it to be applied to image-based downlink systems.

In addition to the improvements brought by Transformer to a single modality, this crossover also enables models from different fields to be fused together to perform multi-modal tasks.

An example of a multimodal model is CLIP. CLIP is a joint visual language model. It combines the Transformer architecture with a visual component, allowing training on large amounts of text and image data.

Due to combining visual and linguistic knowledge in pre-training, CLIP can also be used as an image encoder in multi-modal cue generation. In short, the emergence of Transformer-based models has revolutionized the generation of artificial intelligence and led to the possibility of large-scale training.

In recent years, researchers have also begun to introduce new technologies based on these models.

For example, in NLP, in order to help the model better understand task requirements, people sometimes prefer few-shot hints. It refers to including in the prompt some examples selected from the dataset.

In visual languages, researchers combine pattern-specific models with self-supervised contrastive learning goals to provide more powerful representations.

In the future, as AIGC becomes more and more important, more and more technologies will be introduced, which will give this field great vitality.

AIGC Basics

This section introduces the commonly used basic models of AIGC.

Basic Model

Transformer

Transformer is the backbone architecture of many state-of-the-art models, such as GPT-3, DALL-E-2, Codex and Gopher.

It was first proposed to solve the limitations of traditional models, such as RNNs, in processing variable-length sequences and context awareness.

The architecture of Transformer is mainly based on a self-attention mechanism, which enables the model to pay attention to different parts of the input sequence.

Transformer consists of an encoder and a decoder. The encoder receives an input sequence and generates a hidden representation, while the decoder receives a hidden representation and generates an output sequence.

Each layer of the encoder and decoder consists of a multi-head attention and a feed-forward neural network. Multi-head attention is the core component of Transformer, which learns to assign different weights based on the relevance of tags.

This information routing approach enables the model to better handle long-term dependencies and, therefore, improves performance in a wide range of NLP tasks.

Another advantage of Transformer is that its architecture makes it highly parallel and allows the data to overcome inductive bias. This feature makes Transformer very suitable for large-scale pre-training, allowing Transformer-based models to adapt to different downstream tasks.

Pre-trained language model

Since the introduction of the Transformer architecture, it has become a mainstream choice for natural language processing due to its parallelism and learning capabilities.

Generally speaking, these Transformer-based pre-trained language models can usually be divided into two categories according to their training tasks: autoregressive language models, and mask language models.

Given a sentence consisting of multiple tokens, the goal of masked language modeling, such as BERT and RoBERTa, is to predict the probability of the masked token given contextual information.

The most notable example of a masked language model is BERT, which includes masked language modeling and next sentence prediction tasks. RoBERTa uses the same architecture as BERT, improving its performance by increasing the amount of pre-training data and incorporating more challenging pre-training objectives.

XL-Net is also based on BERT, which incorporates permutation operations to change the order of predictions for each training iteration, enabling the model to learn more cross-label information.

Autoregressive language models, such as GPT-3 and OPT, model the probability given the previous token, and are therefore left-to-right language models. Unlike masked language models, autoregressive language models are more suitable for generative tasks.

Reinforcement Learning from Human Feedback

Despite being trained on large-scale data, AIGC may not always output content consistent with user intent.

To make AIGC output better match human preferences, reinforcement learning from human feedback (RLHF) has been applied to model fine-tuning in various applications, such as Sparrow, InstructGPT, and ChatGPT.

Normally, the entire process of RLHF includes the following three steps: pre-training, reward learning and fine-tuning of reinforcement learning.

Computing

Hardware

In recent years, hardware technology has made significant progress, facilitating the training of large models.

In the past, training a large neural network using a CPU could take days or even weeks. However, with the increase in computing power, this process has been accelerated by several orders of magnitude.

For example, NVIDIA’s NVIDIA A100 GPU is 7 times faster than V100 and 11 times faster than T4 in BERT large-scale inference process.

In addition, Google’s Tensor Processing Unit (TPU) is designed for deep learning and provides higher computing performance compared to the A100 GPU.

The accelerated advancement of computing power has significantly improved the efficiency of artificial intelligence model training, providing new possibilities for the development of large and complex models.

Distributed training

Another major improvement is distributed training.

In traditional machine learning, training is usually performed on a machine using a single processor. This approach works well for small datasets and models, but becomes impractical when dealing with large datasets and complex models.

In distributed training, the training tasks are distributed to multiple processors or machines, which greatly improves the training speed of the model.

Some companies have also released frameworks that simplify the distributed training process of deep learning stacks. These frameworks provide tools and APIs that allow developers to easily distribute training tasks across multiple processors or machines without having to manage the underlying infrastructure.

Cloud computing

Cloud computing also plays a vital role in training large models. Previously, models were often trained locally. Now, with cloud computing services such as AWS and Azure providing access to powerful computing resources, deep learning researchers and practitioners can create large GPU or TPU clusters required for large model training on demand.

Collectively, these advances make it possible to develop more complex and accurate models, opening up new possibilities in various areas of artificial intelligence research and applications.

Introduction to the author

Philip S. Yu is a scholar in the field of computer science, an ACM/IEEE Fellow, and a distinguished professor in the Department of Computer Science at the University of Illinois at Chicago (UIC).

He has made world-renowned achievements in the theory and technology of big data mining and management. In response to the challenges of big data in terms of scale, speed and diversity, he has proposed effective and cutting-edge solutions in data mining and management methods and technologies, especially in integrating diverse data, mining data streams, frequent patterns, and subspaces. He made groundbreaking contributions to graphs.

30 page paper! New work by Yu Shiluns team: AIGC comprehensive survey, development history from GAN to ChatGPT

He also made pioneering contributions in the field of parallel and distributed database processing technology, and applied it to the IBM S/390 Parallel Sysplex system, successfully integrating traditional IBM mainframe Transition to parallel microprocessor architecture.

The above is the detailed content of 30 page paper! New work by Yu Shilun's team: AIGC comprehensive survey, development history from GAN to ChatGPT. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete