search
HomeTechnology peripheralsAIRead all SOTA generative models in one article: a complete review of 21 models in nine categories!

In the past two years, there has been a surge in the release of large-scale generative models in the AI ​​industry, especially after the open source of Stable Diffusion and the open interface of ChatGPT, which has further stimulated the industry's enthusiasm for generative models.

But there are many types of generative models and the release speed is very fast. If you are not careful, you may miss sota

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Recently, from Comilla, Spain Researchers from Bishop St. John's University comprehensively reviewed the latest progress in AI in various fields, divided generative models into nine categories according to task modes and fields, and summarized 21 generative models released in 2022 to understand generation at once. The development history of the model!

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Paper link: https://arxiv.org/abs/2301.04655

Generative AI classification

The model can follow the input and The output data types are classified, currently mainly including 9 categories.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Interestingly, behind these large published models, only six organizations (OpenAI, Google, DeepMind, Meta, Runway, Nvidia) are involved in deploying these latest models. Advanced models.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

The main reason is that in order to be able to estimate the parameters of these models, one must have extremely large computing power, as well as highly skilled and experienced people in data science and data engineering. team.

Thus, only these companies, with the help of acquired startups and collaborations with academia, can successfully deploy generative AI models.

In terms of big companies getting involved in startups, you can see Microsoft investing $1 billion in OpenAI and helping them develop models; similarly, Google acquired Deepmind in 2014.

On the university side, VisualGPT was developed by King Abdullah University of Science and Technology (KAUST), Carnegie Mellon University and Nanyang Technological University, and the Human Motion Diffusion model was developed by Tel Aviv University in Israel.

Similarly, other projects are developed by a company and a university, such as Stable Diffusion is developed by Runway, Stability AI and the University of Munich; Soundify is developed by Runway and Carnegie Mellon University; DreamFusion A collaboration between Google and the University of California, Berkeley.

Text-to-image model

DALL-E 2

DALL-E 2, developed by OpenAI, is able to generate original, realistic, Realistic images and art, and OpenAI has provided an API to access the model.

What is special about DALL-E 2 is its ability to combine concepts, attributes and different styles. Its ability is derived from the language-image pre-trained model CLIP neural network, so that it can use natural language to indicate the most relevant Text snippet.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Specifically, CLIP embedding has several ideal properties: the ability to perform stable transformations of image distribution; having strong zero-shot capabilities; and achieving after fine-tuning state-of-the-art results.

To obtain a complete image generation model, the CLIP image embedding decoder module is combined with a prior model to generate relevant CLIP image embeddings from a given text caption

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Other models include Imagen, Stable Diffusion, Muse

Text-to-3D model

For some industries, only 2D images can be generated and automation cannot be completed , for example, in the gaming field, 3D models need to be generated.

Dreamfusion

DreamFusion, developed by Google Research, uses a pre-trained 2D text-to-image diffusion model for text-to-3D synthesis.

Dreamfusion replaces the CLIP technique with a loss obtained from the distillation of a two-dimensional diffusion model, that is, the diffusion model can be used as a loss in a general continuous optimization problem to generate samples.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Compared with other methods, which mainly sample pixels, sampling in parameter space is much more difficult than sampling in pixel space. DreamFusion uses a differentiable generator , focuses on creating 3D models that render images from random angles.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Other models such as Magic3D are developed by NVIDIA.

Image-to-Text model

It is also useful to obtain a text describing the image, which is equivalent to the inverse version of image generation.

Flamingo

This model was developed by Deepmind and can be performed on open-ended visual language tasks with just a few input/output example prompts- shot learning.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Specifically, Flamingo’s input includes an autoregressive text generation model under visual conditions, which can receive text token sequences interleaved with images or videos and generate text as output .

Users can enter a query into the model and attach a photo or video, and the model will answer with a text answer.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

#The Flamingo model leverages two complementary models: a visual model that analyzes visual scenes and a large language model that performs basic forms of reasoning.

VisualGPT

VisualGPT is an image description model developed by OpenAI that leverages knowledge from the pre-trained language model GPT-2.

In order to bridge the semantic gap between different modalities, the researchers designed a new encoder-decoder attention mechanism with rectification gating function.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

The biggest advantage of VisualGPT is that it does not require as much data as other image-to-text models. It can improve the data efficiency of image description models and can be applied in niche fields. Or describe rare objects.

Text-to-Video model

Phenaki

This model was developed and produced by Google Research. Given a series of text prompts, Perform realistic video synthesis.

Phenaki is the first model capable of generating videos from open-domain time-variable cues.

To solve the data problem, the researchers jointly trained on a large image-text pair dataset and a smaller number of video-text examples, ultimately achieving generalization capabilities beyond the video dataset.

Mainly image-text datasets tend to have billions of input data, while text-video datasets are much smaller, and computing videos of different lengths is also a difficult problem.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

The Phenaki model contains three parts: C-ViViT encoder, training Transformer and video generator.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

After converting the input token into embedding, it then passes through the temporal Transformer and spatial Transformer, and then uses a single linear projection without activation to map the token back to the pixel space.

The final model can generate videos with temporal coherence and diversity conditioned on open-domain cues, and is even able to handle some new concepts that do not exist in the dataset.

Related models include Soundify.

Text-to-Audio model

For video generation, sound is also an indispensable part.

AudioLM

This model was developed by Google and can be used to generate high-quality audio with consistency over long distances.

What’s special about AudioLM is that it maps the input audio into a discrete token sequence and uses audio generation as a language modeling task in this representation space.

By training on a large corpus of raw audio waveforms, AudioLM successfully learned to generate natural and coherent continuous speech under brief prompts. This method can even be extended to speech other than human voices, such as continuous piano music, etc., without adding symbolic representation during training.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Since audio signals involve the abstraction of multiple scales, it is very challenging to achieve high audio quality while displaying consistency across multiple scales during audio synthesis. The AudioLM model is implemented by combining recent advances in neural audio compression, self-supervised representation learning, and language modeling.

For subjective evaluation, raters are asked to listen to a 10-second sample and decide whether it is human speech or synthesized speech. Based on 1000 ratings collected, the rate is 51.2%, which is not statistically different from randomly assigned labels, i.e. humans cannot distinguish between synthetic and real samples.

Other related models include Jukebox and Whisper

Text-to-Text model

Commonly used in question and answer tasks.

ChatGPT

The popular ChatGPT was developed by OpenAI to interact with users in a conversational manner.

The user asks a question or the first half of the prompt text, and the model will complete the subsequent parts, and can identify incorrect input conditions and reject inappropriate requests.

Specifically, the algorithm behind ChatGPT is Transformer, and the training process is mainly reinforcement learning based on human feedback.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

The initial model was trained using fine-tuning under supervised learning, and then humans provided conversations in which they played each other as the user and AI assistant, and then The human corrects the responses returned by the model and helps the model improve with the correct answers.

Mix the produced dataset with InstructGPT's dataset and convert it into conversational format.

Other related models include LaMDA and PEER

Text-to-Code model

is similar to text-to-text, except that it generates a special type of text, namely code.

Codex

This model, developed by OpenAI, can translate text into code.

Codex is a general programming model that can be applied to basically any programming task.

Human activities when programming can be divided into two parts: 1) decomposing a problem into simpler problems; 2) mapping these problems to already existing existing code (library, API or function) middle.

The second part is the most time-wasting part for programmers, and it is also what Codex is best at.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Training data was collected from public software repositories hosted on GitHub in May 2020, containing 179GB of Python files and fine-tuned on GPT-3 , which already contains powerful natural language representations.

Related models also include Alphacode

Text-to-Science model

Scientific research texts are also one of the goals of AI text generation, but there is still a long way to go to achieve results. Gotta go.

Galactica

This model was jointly developed by Meta AI and Papers with Code and can be used to automatically organize large-scale models of scientific text.

The main advantage of Galactica is that even after training multiple episodes, the model will still not be overfitted, and the upstream and downstream performance will improve with the reuse of tokens.

And the design of the dataset is crucial to this approach, as all data is processed in a common markdown format, enabling the mixing of knowledge from different sources.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Citations are processed through a specific token, allowing researchers to predict a citation in any input context. The ability of Galactica models to predict citations increases with scale.

Additionally, the model uses a Transformer architecture in a decoder-only setting with GeLU activation for all sizes of the model, allowing the performance of multi-modal tasks involving SMILES chemical formulas and protein sequences.

Minerva

The main purpose of Minerva is to solve mathematical and scientific problems. For this purpose, it has collected a large amount of training data and solved quantitative reasoning problems and large-scale models. To develop questions, state-of-the-art reasoning techniques are also used.

The Minerva sampling language model architecture solves the problem of input by using step-by-step reasoning, that is, the input needs to contain calculations and symbolic operations without introducing external tools.

Other models

There are also some models that do not fall into the previously mentioned categories.

AlphaTensor

Developed by Deepmind, it is a completely revolutionary model in the industry because of its ability to discover new algorithms.

In the published example, AlphaTensor created a more efficient matrix multiplication algorithm. This algorithm is so important that everything from neural networks to scientific computing programs can benefit from this efficient multiplication calculation.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

This method is based on the deep reinforcement learning method, in which the training process of the agent AlphaTensor is to play a single-player game, and the goal is to find tensor decomposition in a limited factor space.

At each step of TensorGame, players need to choose how to combine different entries of the matrix to perform multiplication, and receive bonus points based on the number of operations required to achieve the correct multiplication result. AlphaTensor uses a special neural network architecture to exploit the symmetry of the synthetic training game.

GATO

This model is a general agent developed by Deepmind. It can be used as a multi-modal, multi-task or multi-embodiment generalization strategy.

The same network with the same weight can host very different capabilities, from playing Atari games, describing pictures, chatting, stacking blocks, and more.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Using a single neural sequence model across all tasks has many benefits, reducing the need to hand-craft strategic models with their own inductive biases and increasing the amount of training data and Diversity.

This general-purpose agent is successful on a wide range of tasks and can be tuned with little additional data to succeed on even more tasks.

Currently GATO has approximately 1.2B parameters, which can control the model scale of real-world robots in real time.

Read all SOTA generative models in one article: a complete review of 21 models in nine categories!

Other published generative artificial intelligence models include generating human motion, etc.

Reference: https://arxiv.org/abs/2301.04655

The above is the detailed content of Read all SOTA generative models in one article: a complete review of 21 models in nine categories!. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
从VAE到扩散模型:一文解读以文生图新范式从VAE到扩散模型:一文解读以文生图新范式Apr 08, 2023 pm 08:41 PM

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了Apr 08, 2023 pm 06:21 PM

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

普林斯顿陈丹琦:如何让「大模型」变小普林斯顿陈丹琦:如何让「大模型」变小Apr 08, 2023 pm 04:01 PM

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉Transformer解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉TransformerApr 09, 2023 pm 02:01 PM

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Apr 07, 2023 pm 11:21 PM

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药Apr 09, 2023 pm 07:01 PM

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:​https://spj.scien

​什么是Transformer机器学习模型?​什么是Transformer机器学习模型?Apr 08, 2023 pm 06:31 PM

译者 | 李睿审校 | 孙淑娟​近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军Apr 09, 2023 pm 01:51 PM

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),