Generating videos is so easy, just give a hint, and you can also try it online-AI-php.cn

Home

Technology peripherals

Generating videos is so easy, just give a hint, and you can also try it online

王林

May 20, 2023 pm 05:16 PM

videodevelop

You enter text and let AI generate a video. This idea only appeared in people's imagination before. Now, with the development of technology, this function has been realized.

In recent years, generative artificial intelligence has attracted huge attention in the field of computer vision. With the advent of diffusion models, generating high-quality images from text prompts, i.e., text-to-image synthesis, has become very popular and successful.

Recent research has attempted to successfully extend the text-to-image diffusion model to the task of text-to-video generation and editing by reusing it in the video domain. Although such methods have achieved promising results, most of them require extensive training using large amounts of labeled data, which may be too expensive for many users.

In order to make video generation cheaper, Tune-A-Video proposed by Jay Zhangjie Wu et al. last year introduced a mechanism to apply the Stable Diffusion (SD) model to the video field . Only one video needs to be adjusted, greatly reducing training workload. Although this is much more efficient than previous methods, it still requires optimization. Furthermore, Tune-A-Video's generation capabilities are limited to text-guided video editing applications, and compositing videos from scratch remains beyond its capabilities.

In this article, researchers from Picsart AI Resarch (PAIR), the University of Texas at Austin and other institutions have used zero-shot and no training to achieve a new method of text-to-video synthesis. A step forward in the problem direction of generating videos based on text prompts without any optimization or fine-tuning.

Generating videos is so easy, just give a hint, and you can also try it online

##Paper address: https://arxiv.org/ pdf/2303.13439.pdf
Project address: https://github.com/Picsart-AI-Research/Text2Video-Zero
Trial address: https://huggingface.co/spaces/PAIR/Text2Video-Zero

Let’s see how it works. For example, a panda is surfing; a bear is dancing in Times Square:

Generating videos is so easy, just give a hint, and you can also try it online

This research can also generate actions based on the target :

Generating videos is so easy, just give a hint, and you can also try it online

In addition, edge detection can also be performed:

Generating videos is so easy, just give a hint, and you can also try it online

A key concept of the approach proposed in this paper is to modify a pre-trained text-to-image model (such as Stable Diffusion) to enrich it with time-consistent generation. By building on already trained text-to-image models, our approach leverages their excellent image generation quality, enhancing their applicability to the video domain without requiring additional training.

In order to enhance temporal consistency, this paper proposes two innovative modifications: (1) first enrich the latent encoding of the generated frame with motion information to keep the global scene and background temporally consistent; (2) ) then uses a cross-frame attention mechanism to preserve the context, appearance, and identity of foreground objects throughout the sequence. Experiments show that these simple modifications can produce high-quality and temporally consistent videos (shown in Figure 1).

Generating videos is so easy, just give a hint, and you can also try it online

Although other people’s work trained on large-scale video data, our method achieves similar and sometimes better performance (shown in Figures 8 and 9).

Generating videos is so easy, just give a hint, and you can also try it online

#The method in this article is not limited to text-to-video synthesis, but is also suitable for conditional (see Figures 6 and 5) and specialized video generation (see Figure 7), as well as instruction-guided video editing, which can be called It is Video Instruct-Pix2Pix driven by Instruct-Pix2Pix (see Figure 9).

Generating videos is so easy, just give a hint, and you can also try it online

#In this paper, this paper uses the text-to-image synthesis capability of Stable Diffusion (SD) to handle the text-to-video task in zero-shot situations. For the needs of video generation rather than image generation, SD should focus on the operation of underlying code sequences. The naive approach is to independently sample m potential codes from a standard Gaussian distribution, i.e.

Generating videos is so easy, just give a hint, and you can also try it online N (0, I) , and apply DDIM Sample to get the corresponding tensor

Generating videos is so easy, just give a hint, and you can also try it online

, where k = 1,…,m, then decode to Get the generated video sequence

Generating videos is so easy, just give a hint, and you can also try it online

. However, as shown in the first row of Figure 10, this results in completely random image generation, sharing only the semantics described by Generating videos is so easy, just give a hint, and you can also try it online

without consistency in object appearance or motion. Generating videos is so easy, just give a hint, and you can also try it online

Generating videos is so easy, just give a hint, and you can also try it online

In order to solve this problem, this article recommends the following two methods: (i) In the potential encoding

# Introduce motion dynamics between ## to maintain the temporal consistency of the global scene; (ii) Use a cross-frame attention mechanism to preserve the appearance and identity of foreground objects. Each component of the method used in this paper is described in detail below, and an overview of the method can be found in Figure 2 .

Generating videos is so easy, just give a hint, and you can also try it online

Note that to simplify notation, this article represents the entire potential code sequence as:

Generating videos is so easy, just give a hint, and you can also try it online

Experiment

Qualitative results

## All applications of Text2Video-Zero show that it successfully generates videos with temporal consistency of global scene and background, foreground The context, appearance, and identity of the object are maintained throughout the sequence.

In the case of text-to-video, it can be observed that it produces high-quality videos that are well aligned with the text prompts (see Figure 3). For example, a panda is drawn to walk naturally on the street. Likewise, using additional edge or pose guidance (see Figure 5, Figure 6, and Figure 7), high-quality videos matching prompts and guidance were generated, showing good temporal consistency and identity preservation.

Generating videos is so easy, just give a hint, and you can also try it online

In the case of Video Instruct-Pix2Pix (see Figure 1), the generated video High fidelity relative to the input video while strictly following instructions.

Comparison with Baseline

This paper compares its method with two publicly available baselines: CogVideo and Tune -A-Video. Since CogVideo is a text-to-video method, this article compares it with it in a plain text-guided video synthesis scenario; using Video Instruct-Pix2Pix for comparison with Tune-A-Video.

For quantitative comparison, this article uses the CLIP score to evaluate the model. The CLIP score represents the degree of video text alignment. By randomly obtaining 25 videos generated by CogVideo, and synthesizing the corresponding videos using the same tips according to the method in this article. The CLIP scores of our method and CogVideo are 31.19 and 29.63 respectively. Therefore, our method is slightly better than CogVideo, although the latter has 9.4 billion parameters and requires large-scale training on videos.

Figure 8 shows several results of the method proposed in this paper and provides a qualitative comparison with CogVideo. Both methods show good temporal consistency throughout the sequence, preserving the identity of the object as well as its context. Our method shows better text-video alignment capabilities. For example, our method correctly generates a video of a person riding a bicycle in the sun in Figure 8 (b), while CogVideo sets the background to moonlight. Also in Figure 8 (a), our method correctly shows a person running in the snow, while the snow and the running person are not clearly visible in the video generated by CogVideo.

Video Qualitative results for Instruct-Pix2Pix and visual comparison with per-frame Instruct-Pix2Pix and Tune-AVideo are shown in Figure 9. While Instruct-Pix2Pix shows good editing performance per frame, it lacks temporal consistency. This is especially noticeable in videos depicting skiers, where the snow and sky are drawn using different styles and colors. These issues were solved using the Video Instruct-Pix2Pix method, resulting in temporally consistent video editing throughout the sequence.

Although Tune-A-Video creates time-consistent video generation, compared with this article's method, it is less consistent with instruction guidance, difficult to create local edits, and Details of the input sequence are lost. This becomes apparent when looking at the edit of the dancer's video depicted in Figure 9 , left. Compared to Tune-A-Video, our method paints the entire outfit brighter while better preserving the background, such as the wall behind the dancer remaining almost unchanged. Tune-A-Video paints a heavily deformed wall. In addition, our method is more faithful to the input details. For example, compared to Tune-A-Video, Video Instruction-Pix2Pix draws dancers using the provided poses (Figure 9 left) and displays all skiers appearing in the input video. (As shown in the last frame on the right side of Figure 9). All the above mentioned weaknesses of Tune-A-Video can also be observed in Figures 23, 24.

Generating videos is so easy, just give a hint, and you can also try it online

The above is the detailed content of Generating videos is so easy, just give a hint, and you can also try it online. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

From Friction To Flow: How AI Is Reshaping Legal WorkMay 09, 2025 am 11:29 AM

The legal tech revolution is gaining momentum, pushing legal professionals to actively embrace AI solutions. Passive resistance is no longer a viable option for those aiming to stay competitive. Why is Technology Adoption Crucial? Legal professional

This Is What AI Thinks Of You And Knows About YouMay 09, 2025 am 11:24 AM

Many assume interactions with AI are anonymous, a stark contrast to human communication. However, AI actively profiles users during every chat. Every prompt, every word, is analyzed and categorized. Let's explore this critical aspect of the AI revo

7 Steps To Building A Thriving, AI-Ready Corporate CultureMay 09, 2025 am 11:23 AM

A successful artificial intelligence strategy cannot be separated from strong corporate culture support. As Peter Drucker said, business operations depend on people, and so does the success of artificial intelligence. For organizations that actively embrace artificial intelligence, building a corporate culture that adapts to AI is crucial, and it even determines the success or failure of AI strategies. West Monroe recently released a practical guide to building a thriving AI-friendly corporate culture, and here are some key points: 1. Clarify the success model of AI: First of all, we must have a clear vision of how AI can empower business. An ideal AI operation culture can achieve a natural integration of work processes between humans and AI systems. AI is good at certain tasks, while humans are good at creativity and judgment

Netflix New Scroll, Meta AI's Game Changers, Neuralink Valued At $8.5 BillionMay 09, 2025 am 11:22 AM

Meta upgrades AI assistant application, and the era of wearable AI is coming! The app, designed to compete with ChatGPT, offers standard AI features such as text, voice interaction, image generation and web search, but has now added geolocation capabilities for the first time. This means that Meta AI knows where you are and what you are viewing when answering your question. It uses your interests, location, profile and activity information to provide the latest situational information that was not possible before. The app also supports real-time translation, which completely changed the AI experience on Ray-Ban glasses and greatly improved its usefulness. The imposition of tariffs on foreign films is a naked exercise of power over the media and culture. If implemented, this will accelerate toward AI and virtual production

Take These Steps Today To Protect Yourself Against AI CybercrimeMay 09, 2025 am 11:19 AM

Artificial intelligence is revolutionizing the field of cybercrime, which forces us to learn new defensive skills. Cyber criminals are increasingly using powerful artificial intelligence technologies such as deep forgery and intelligent cyberattacks to fraud and destruction at an unprecedented scale. It is reported that 87% of global businesses have been targeted for AI cybercrime over the past year. So, how can we avoid becoming victims of this wave of smart crimes? Let’s explore how to identify risks and take protective measures at the individual and organizational level. How cybercriminals use artificial intelligence As technology advances, criminals are constantly looking for new ways to attack individuals, businesses and governments. The widespread use of artificial intelligence may be the latest aspect, but its potential harm is unprecedented. In particular, artificial intelligence

A Symbiotic Dance: Navigating Loops Of Artificial And Natural PerceptionMay 09, 2025 am 11:13 AM

The intricate relationship between artificial intelligence (AI) and human intelligence (NI) is best understood as a feedback loop. Humans create AI, training it on data generated by human activity to enhance or replicate human capabilities. This AI

AI's Biggest Secret — Creators Don't Understand It, Experts SplitMay 09, 2025 am 11:09 AM

Anthropic's recent statement, highlighting the lack of understanding surrounding cutting-edge AI models, has sparked a heated debate among experts. Is this opacity a genuine technological crisis, or simply a temporary hurdle on the path to more soph

Bulbul-V2 by Sarvam AI: India's Best TTS ModelMay 09, 2025 am 10:52 AM

India is a diverse country with a rich tapestry of languages, making seamless communication across regions a persistent challenge. However, Sarvam’s Bulbul-V2 is helping to bridge this gap with its advanced text-to-speech (TTS) t

See all articles