search
HomeTechnology peripheralsAIWhat technologies does ByteDance have behind the misunderstood 'Chinese version of Sora'?

At the beginning of 2024, OpenAI dropped a blockbuster in the field of generative AI: Sora.

In recent years, technological iterations in the field of video generation have continued to accelerate, and many technology companies have also announced relevant technological progress and implementation results. Prior to this, Pika and Runway had launched similar products, but the demo released by Sora clearly single-handedly raised the standards in the field of video generation.

In the future competition, which company will be the first to create a product that surpasses Sora is still unknown.

Domestically, attention is focused on a number of major technology companies.

Previously, it was reported that Bytedance had developed a video generation model called Boximator before the release of Sora.

Boximator provides a way to precisely control the generation of objects in videos. Users do not need to write complex text instructions, but simply draw a box in the reference image to select the target, and then add additional boxes and lines to define the target's end position or the entire cross-frame motion path, as shown in the following figure:

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

ByteDance has maintained a low-key attitude towards this. Relevant people responded to the media that Boximator is their project to research technical methods for controlling object movement in the field of video generation. It is not yet fully finished, and there is still a big gap between it and leading foreign video generation models in terms of picture quality, fidelity and video duration.

It is mentioned in the relevant technical paper (https://arxiv.org/abs/2402.01566) that Boximator runs as a plug-in and can be easily integrated with existing video generation models. Integrate. By adding motion control capabilities, it not only maintains video quality but also improves flexibility and usability.

Video generation involves technologies in multiple subdivisions and is closely related to image/video understanding, image generation, super-resolution and other technologies. After in-depth research, it was found that ByteDance has publicly published some research results in multiple branches.

This article will introduce 9 studies from ByteDance’s intelligent creation team, involving many latest achievements such as Wensheng Picture, Wensheng Video, Tusheng Video, and Video Understanding. We might as well track the technological progress of exploring visual generative models from these studies.

Regarding video generation, what achievements does Byte have?

In early January this year, ByteDance released a video generation model MagicVideo-V2, which once triggered heated discussions in the community.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?


  • Paper title: MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
  • Paper link: https://arxiv.org/abs/2401.04468
  • Project address: https://magicvideov2.github.io/

The innovation of MagicVideo-V2 is the text-to-image model, video motion generator, reference The image embedding module and frame interpolation module are integrated into the end-to-end video generation pipeline. Thanks to this architectural design, MagicVideo-V2 can maintain a stable high-level performance in terms of "aesthetics", not only generating beautiful high-resolution videos, but also having relatively good fidelity and smoothness.

Specifically, the researchers first used the T2I module to create a 1024×1024 image that encapsulates the described scene. The I2V module then animates this static image to generate a 600×600×32 sequence of frames, with the underlying noise ensuring continuity from the initial frame. The V2V module enhances these frames to 1048×1048 resolution while refining the video content. Finally, the interpolation module extends the sequence to 94 frames, resulting in a 1048×1048 resolution video, and the generated video has high aesthetic quality and temporal smoothness.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

Large-scale user evaluation conducted by the researchers proves that MagicVideo-V2 is preferred over some well-known T2V methods (green, gray and pink bars Represents that MagicVideo-V2 is rated as better, fair or worse respectively).

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

Behind high-quality video generation

Research paradigm that unifies visual and language learning

From the MagicVideo-V2 paper, we can see that the progress of video generation technology is inseparable from the paving the way of AIGC technologies such as Vincent Picture and Picture Video. The basis for generating high-aesthetic content lies in understanding, especially the improvement of the model's ability to learn and integrate visual and language modalities.

In recent years, the scalability and general capabilities of large language models have given rise to a research paradigm that unifies vision and language learning. In order to bridge the natural gap between the two modalities of "visual" and "language", researchers connect the representations of pre-trained large language models and visual models, extract cross-modal features, and complete tasks such as visual question answering, Tasks such as image captioning, visual knowledge reasoning, and dialogue.

In these directions, ByteDance also has related explorations.

For example, to address the challenge of multi-objective reasoning and segmentation in open-world vision tasks, ByteDance teamed up with researchers from Beijing Jiaotong University and University of Science and Technology Beijing to propose an efficient large-scale pixel-level reasoning model called PixelLM. , and make it open source.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?


  • ##Paper title: PixelLM: Pixel Reasoning with Large Multimodal Model
  • Paper link: https://arxiv.org/pdf/2312.02228.pdf
  • Project Address: https://pixellm.github.io/

PixelLM can skillfully handle tasks with any number of open set objectives and varying inference complexity, The figure below demonstrates PixelLM's ability to generate high-quality object masks in various segmentation tasks.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

The core of PixelLM is a novel pixel decoder and a segmentation codebook: the codebook contains learnable tokens that encode different The visual scale target refers to relevant context and knowledge, and the pixel decoder generates the target mask based on the hidden embedding of the codebook token and image features. While maintaining the basic structure of LMM, PixelLM can generate high-quality masks without additional, expensive visual segmentation models, thus improving efficiency and transferability to different applications.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

It is worth noting that the researchers constructed a comprehensive multi-objective inference segmentation data set MUSE. They selected a total of 910k high-quality instance segmentation masks and detailed text descriptions based on image content from the LVIS dataset, and used these to construct 246k question-answer pairs.

Compared to images, if video content is involved, the challenge encountered by the model increases a lot. Because video not only contains rich and varied visual information, but also involves dynamic changes in time series.

When existing large multi-modal models process video content, they usually convert video frames into a series of visual tokens and combine them with language tokens to generate text. However, as the length of the generated text increases, the influence of the video content will gradually weaken, causing the generated text to deviate more and more from the original video content, producing so-called "illusions."

Faced with this problem, Bytedance and Zhejiang University proposed Vista-LLaMA, a multi-modal large model specifically designed for the complexity of video content.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

  • Paper title: Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
  • Paper link: https ://arxiv.org/pdf/2312.08870.pdf
  • Project address: https://jinxxian.github.io/Vista-LLaMA/

Vista-LLaMA adopts an improved attention mechanism - Visual Equidistance Token Attention (EDVT), which removes the traditional attention mechanism when processing visual and text tokens. Relative position encoding, while retaining the relative position encoding between texts. This method greatly improves the depth and accuracy of the language model's understanding of video content.

In particular, the serialized visual projector introduced by Vista-LLaMA provides a new perspective on the time series analysis problem in video, which encodes the temporal context of visual tokens through a linear projection layer , which enhances the model’s ability to understand dynamic changes in the video.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

In a study recently accepted by ICLR 2024, ByteDance researchers also explored a boosting model for video content learning Ability pre-training methods.

Due to the limited scale and quality of video-text training corpus, most visual language basic models adopt image-text data sets for pre-training and mainly focus on visual semantic representation modeling. Temporal semantic representation and correlation are ignored.

To solve this problem, they proposed COSA, a concatenated sample pre-trained visual language base model.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?


  • ##Paper title: COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
  • Paper link: https://arxiv.org/pdf/2306.09085.pdf
  • Project homepage: https://github.com/TXH-mercury/COSA
##COSA uses only image-text corpus for visual content and event-level temporal clues Perform joint modeling. The researchers concatenated multiple image-text pairs in sequence as input for pre-training. This transformation effectively converts an existing image-text corpus into a pseudo-long-form video-paragraph corpus, enabling richer scene transitions and explicit event-description correspondences. Experiments demonstrate that COSA can consistently improve performance on a variety of downstream tasks, including long/short video-text tasks and image-text tasks such as retrieval, subtitles, and question answering.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

From image to video

Re-recognized "Diffusion model"

In addition to the visual-language model, the diffusion model is also a technology used by most video generation models.

Through rigorous training on a large dataset of image-text pairs, diffusion models are able to generate detailed images based entirely on textual information. In addition to image generation, diffusion models can also be used for audio generation, time series generation, 3D point cloud generation, and more.

For example, in some short video applications, users only need to provide a picture to generate a fake action video.

The Mona Lisa, who has maintained a mysterious smile for hundreds of years, can run immediately:

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

The technology behind this interesting application is "MagicAnimate" jointly launched by researchers from the National University of Singapore and ByteDance.

MagicAnimate is a diffusion-based human image animation framework that can well ensure the temporal consistency of the entire animation and improve animation fidelity in the task of generating videos based on specific motion sequences. Moreover, the MagicAnimate project is open source.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

  • ##Paper title: MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
  • Paper link: https://arxiv.org/pdf/2311.16498.pdf
  • Project address: https://showlab .github.io/magicanimate/

In order to solve the common "flickering" problem of generated animations, the researchers merged the temporal attention (temporal attention) blocks into the diffusion backbone network to build a video diffusion model for temporal modeling.

MagicAnimate breaks the entire video into overlapping segments and simply averages the predictions of the overlapping frames. Finally, the researchers also introduced an image-video joint training strategy to further enhance the reference image retention capability and single-frame fidelity. Although only trained on real human data, MagicAnimate has demonstrated the ability to generalize to a variety of application scenarios, including animation of unseen domain data, integration with text-image diffusion models, and multi-person animation. .

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

Another research based on the idea of ​​diffusion model, "DREAM-Talk", solves the problem of generating talking emotional words from a single portrait image. Face task.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?


  • ##Paper title: DREAM-Talk: Diffusion-based Realistic Emotional Audio- driven Method for Single Image Talking Face Generation
  • Paper link: https://arxiv.org/pdf/2312.13578.pdf
  • Project address: https://dreamtalkemo.github.io/
We know that in this task, It is difficult to achieve expressive emotional dialogue and accurate lip synchronization at the same time. Usually, in order to ensure the accuracy of lip synchronization, the expressiveness is often greatly compromised.

"DREAM-Talk" is a diffusion-based audio driver framework, divided into two stages: First, the researchers proposed a novel diffusion module EmoDiff, which can be used based on audio and reference Emotion styles generate a variety of highly dynamic emotional expressions and head poses. Given the strong correlation between lip movements and audio, the researchers then improved the dynamics using audio features and emotional styles to improve lip synchronization accuracy, and also deployed a video-to-video rendering module to achieve Transfer expressions and lip movements to any portrait.

From the effect point of view, DREAM-Talk is indeed good in terms of expression, lip synchronization accuracy and perceived quality:

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

But whether it is image generation or video generation, current research based on the diffusion model route still has some basic challenges that need to be solved.

For example, many people are concerned about the quality of generated content (corresponding to SAG, DREAM-Talk). This may be related to some steps in the generation process of the diffusion model, such as guided sampling.

Guided sampling in diffusion models can be roughly divided into two categories: those that require training and those that do not require training. Training-free guided sampling utilizes ready-made pre-trained networks (such as aesthetic evaluation models) to guide the generation process, aiming to obtain knowledge from the pre-trained models with fewer steps and higher accuracy. Current training-unguided sampling algorithms are based on one-step estimation of clean images to obtain the guidance energy function. However, since the pre-trained network is trained on clean images, the one-step estimation process for clean images may be inaccurate, especially in the early stages of the diffusion model, resulting in inaccurate guidance at early time steps.

In response to this problem, ByteDance and researchers from the National University of Singapore jointly proposed the Symplectic Adjoint Guidance (SAG).

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

  • ##Paper title: Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint Method
  • Paper link: https://arxiv.org/pdf/2312.12030.pdf
##SAG computes gradient guidance through two inner stages : First, SAG estimates the clean image through n function calls, where n serves as a flexible parameter that can be adjusted according to specific image quality requirements. Second, SAG uses the symmetric dual method to obtain gradients with respect to memory requirements accurately and efficiently. This approach can support a variety of image and video generation tasks, including style-guided image generation, aesthetic improvement, and video stylization, and effectively improves the quality of generated content.

A paper recently selected for ICLR 2024 focuses on the "critical sensitivity method of gradient backpropagation of diffusion probability model".

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?


    ##Paper title: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models
  • Paper link: https://arxiv.org/pdf/2307.10711.pdf
  • Since the sampling process of the diffusion probability model involves recursive calls to the denoising U-Net, naive gradient backpropagation needs to store the intermediate states of all iterations, resulting in extremely high memory consumption.

In this paper, the AdjointDPM proposed by the researchers first generates new samples from the diffusion model by solving the corresponding probability flow ODE. Then, the gradient of the loss in model parameters (including conditioning signals, network weights, and initial noise) is backpropagated using the adjacency sensitivity method by solving another augmented ODE. In order to reduce numerical errors during forward generation and gradient backpropagation, the researchers further reparameterized the probabilistic flow ODE and enhanced ODE into simple nonrigid ODEs using exponential integration.

The researchers pointed out that AdjointDPM is extremely valuable in three tasks: converting visual effects into recognized text embeddings, fine-tuning diffusion probability models for specific types of stylization, and optimization Initial noise to generate adversarial samples for security auditing to reduce costs in optimization efforts.

For visual perception tasks, the method of using text-to-image diffusion model as a feature extractor has also received more and more attention. In this direction, ByteDance researchers proposed a simple and effective solution in their paper.

What technologies does ByteDance have behind the misunderstood Chinese version of Sora?

##Paper title; Harnessing Diffusion Models for Visual Perception with Meta Prompts
  • Paper link: https://arxiv.org/pdf/2312.14733.pdf
  • The core innovation of this paper is in pre- Learnable embeddings (meta-cues) are introduced into the trained diffusion model to extract perceptual features, without relying on additional multi-modal models to generate image captions, nor using class labels from the dataset.

Meta-cues serve two purposes: first, as a direct replacement for text embeddings in T2I models, they can activate task-relevant features during feature extraction; second, they will be used to rearrange the extracted features to ensure the model focuses on the features most relevant to the task at hand. In addition, the researchers also designed a cyclic refinement training strategy to fully utilize the characteristics of the diffusion model to obtain stronger visual features.

How far is there to go before the "Chinese version of Sora" is born?

In these new papers, we have learned about a series of active explorations in video generation technology by domestic technology companies such as ByteDance.

But compared with Sora, whether it is ByteDance or a number of star companies in the field of AI video generation, there is a gap visible to the naked eye. Sora's advantages are based on its belief in Scaling Law and breakthrough technological innovation: unifying video data through patches, relying on technical architectures such as Diffusion Transformer and the semantic understanding capabilities of DALL・E 3, it has truly achieved "far ahead".

From the explosion of Wenshengtu in 2022 to the emergence of Sora in 2024, the speed of technological iteration in the field of artificial intelligence has exceeded everyone's imagination. In 2024, I believe there will be more “hot products” in this field.

Byte is obviously also stepping up investment in technology research and development. Recently, Google VideoPoet project leader Jiang Lu, and Chunyuan Li, a member of the open source multi-modal large model LLaVA team and former Microsoft Research chief researcher, have all been revealed to have joined the ByteDance intelligent creation team. The team is also vigorously recruiting, and a number of positions related to large model algorithms have been posted on the official website.

Not only Byte, old giants such as BAT have also released many eye-catching video generation research results, and a number of large model startups are even more aggressive. What new breakthroughs will be made in Vincent Video Technology? We'll see.

The above is the detailed content of What technologies does ByteDance have behind the misunderstood 'Chinese version of Sora'?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
从VAE到扩散模型:一文解读以文生图新范式从VAE到扩散模型:一文解读以文生图新范式Apr 08, 2023 pm 08:41 PM

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了Apr 08, 2023 pm 06:21 PM

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

普林斯顿陈丹琦:如何让「大模型」变小普林斯顿陈丹琦:如何让「大模型」变小Apr 08, 2023 pm 04:01 PM

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉Transformer解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉TransformerApr 09, 2023 pm 02:01 PM

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Apr 07, 2023 pm 11:21 PM

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药Apr 09, 2023 pm 07:01 PM

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:​https://spj.scien

​什么是Transformer机器学习模型?​什么是Transformer机器学习模型?Apr 08, 2023 pm 06:31 PM

译者 | 李睿审校 | 孙淑娟​近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军Apr 09, 2023 pm 01:51 PM

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.