Home >Technology peripherals >AI >The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

王林
王林forward
2024-04-28 13:04:011106browse

You said that the box should be filled with diamonds, so the box was filled with diamonds, which was even more dazzling than the real shot. What crew wouldn't like such skills?

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

This is the "magic" presented by Adobe's video editing software Premiere Pro some time ago. This software introduces AI video tools such as Sora, Runway, and Pika to achieve the ability to add objects, remove objects, and generate video clips in videos. This is regarded as another technological innovation in the video field.

From February when Sora swept the world to Adobe's magic again, overseas is in full swing. In contrast, China is still in a "waiting" state in the video field, especially in the direction of long video generation. Over the past two months, we've heard some claims of pursuing Sora, but have yet to see significant domestic progress. But the short film just released today by 生生科技 has given us a lot of surprises. The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

This is the official video of the latest video model "Vidu" released by Shengshu Technology and Tsinghua University. It can be seen that the video it generates is no longer a "GIF" that lasts for a few seconds, but reaches more than ten seconds (the longest can reach about 16 seconds). Of course, what is even more surprising is that the picture effect of "Vidu" is very close to Sora. It performs very well in multi-lens language, time and space consistency, and follows physical laws, and it can also Fictional surreal images that do not exist in the real world are difficult to achieve with current video generation models. In just two months, Shengshu Technology has been able to achieve such results, which is really surprising.

The first video model in China that comprehensively benchmarks Sora

Since the release of Sora, the battle for "domestic Sora" has begun. But when the industry focuses on the "long" feature, they all ignore that behind Sora is actually the improvement of comprehensive effects, such as consistency, realism, beauty, etc. in long sequences.

From the perspective of comprehensive effect,

"Vidu" is the first and only video model that fully benchmarks Sora at the effect level

, not only domestically, but also globally, and is also the first after Sora A video model of a breakthrough. From the specific effects, we can clearly see several obvious advantages:

Inject "lens language" into the video

There is a very important concept in video production— -Camera language. It is the main way to express the storyline, reveal the character's psychology, create the atmosphere and guide the audience's emotions through pictures. Different shot choices, angles, movements, and combinations will greatly affect the narrative and the audience's experience.

Existing AI-generated videos can clearly feel the monotony of the lens language, and the movement of the lens is limited to simple shots such as slight push, pull, and shift. The main reason behind this is that most existing video content generation first generates a single frame and then makes continuous predictions of previous and next frames. However, with the mainstream technical path, it is difficult to achieve long-term coherent prediction. Small dynamic forecasts.

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

## This science fiction movie trailer "Trailer: Genesis" ("Genesis") generated in July last year.

"Vidu" breaks through these limitations. In a clip with the theme of "Seaside House", we can see that "Vidu" generates a clip at one time involving multiple shots. The picture includes both a close-up of the house and a distant view of the sea. The overall view There is a sense of narrative from inside the house to the corridor to enjoying the scenery by the railing. It can be seen that "Vidu" can switch between different shots such as long shot, close shot, medium shot, and close-up around a unified subject in a frame.

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

Prompt: In a quaint seaside cottage, sunlight bathes the room, the camera slowly transitions to a balcony overlooking the tranquil sea, and finally The camera freezes on the floating sea, sailboats and reflective clouds. (Complete video clip released by the official website of Shengshu’s PixWeaver product)

In addition, it can be seen from multiple clips in the short film that "Vidu" can directly generate effects such as transitions, focus tracking, and long shots, including the ability to generate film and television-level lens images, inject lens language into the video, and enhance the image. the overall sense of narrative.

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

##Maintain consistency in time and space

The coherence and fluency of video images are crucial. Behind this is actually the spatio-temporal consistency of characters and scenes. For example, the movement of characters in space is always consistent, and the scene cannot mutate without any transitions. This is difficult for AI to achieve, especially over time, the videos generated by AI will have problems such as narrative breaks, visual incoherence, and logical errors. These problems will seriously affect the realism and enjoyment of the video.

"Vidu" overcomes these problems to a certain extent. From the video of "Cat with a Pearl Earring" generated by it, we can see that as the camera moves, the cat as the subject of the picture always maintains the same expression and clothing in the 3D space, and the video as a whole is very coherent and smooth. , maintaining good time and space consistency.

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

Tips: This is a portrait of an orange cat with blue eyes, slowly rotating, inspired by Vermeer's "Dead with a Pearl Earring" "Girl", wearing pearl earrings, brown hair like a Dutch hat, black background, studio lighting. (Complete video clip released by the official website of Shengshu’s PixWeaver product)

Simulating the real physical world

One of Sora’s amazing features is that It is the ability to simulate the movement of the real physical world, such as the movement and interaction of objects. One of the classic cases released by Sora - the picture of "an old SUV driving on a hillside", very well simulates the dust raised by the tires, the light and shadow in the woods, and the shadow changes during the driving of the car. Under the same prompt word, the generated effects of "Vidu" and Sora are highly similar, and details such as dust, light and shadow are very close to human experience in the real physical world.

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws Tip: The camera follows an old white SUV with a black roof rack as it speeds down a steep dirt road surrounded by pine trees on a steep hillside. The tires kicked up dust and the sun shone on the SUV, casting a warm glow on the entire scene. The dirt road wound gently into the distance, with no other cars or vehicles in sight. There are redwood trees on both sides of the road, with patches of green scattered here and there. Viewed from behind, the car follows curves with ease and looks like it's driving over rough terrain. The dirt road is surrounded by steep hills and mountains, with clear blue skies and wisps of clouds above. (Complete video clip released by the official website of Shengshu’s PixWeaver product) The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws##                           Sora’s generation effect. Of course, "Vidu" failed to generate the partial details of "with black roof rack". But its flaws do not hide its merits, and its overall effect is highly close to the real world.

Rich imagination

Compared with real-life shooting, using AI to generate videos has a big advantage - it can generate pictures that do not exist in the real world . In the past, these pictures often required a lot of manpower and material resources to build or create special effects, but AI can automatically generate them in a short time.

For example, in the scene below, "sailing boat" and "waves" rarely appear in the studio, and the interaction between waves and sailing boats is very natural. The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

                                                                                                                                                       . (Complete video clip released by the official website of Shengshu’s PixWeaver product)

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

The clip of the "fish tank girl" in the short film is also fantasy but has a certain degree of sense of reasonableness. This ability to fabricate scenes that do not exist in the real world is very helpful for creating surrealist content. It can not only inspire creators and provide novel visual experiences, but also broaden the boundaries of artistic expression, bringing richer and more Diversified content formats.

Understanding Chinese Elements

In addition to the above four characteristics, we also saw some different surprises from the short film released by "Vidu", "Vidu" It can generate pictures with unique Chinese elements, such as pandas, dragons, palace scenes, etc. The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

Tips: By the tranquil lake, a panda plays the guitar eagerly, making the whole environment come alive. Reflected on calm waters under clear skies, the scene is captured in vivid panoramic shots that blend realism with the giant panda's lively spirit, creating a harmonious blend of energy and calm. (Complete video clip released by the official website of Shengshu’s PixWeaver product)

How did you achieve this rapid breakthrough in two months?

The R&D team behind "Vidu" Shengshu Technology is a domestic entrepreneurial team in the direction of multi-modal large models. The core members are from the Artificial Intelligence Research Institute of Tsinghua University. The team focuses on images, 3D, video, etc. The field of multimodal generation.

In January this year, Shengshu Technology launched a short video generation function on its visual creative design platform PixWeaver, supporting 4-second highly aesthetic short video content. After the launch of Sora in February, it is reported that Shengshu Technology established a formal internal research team to speed up the research and development progress of the original video direction. In March, it achieved 8-second video generation internally, and then broke through the 16-second generation in April. , achieving breakthroughs in all aspects of generation quality and duration.

As we all know, Sora has not announced too many technical details, but the core behind it being able to achieve breakthroughs in such a short period of time is the team’s deep technical accumulation and many original achievements from 0 to 1, especially in the latest The core technical architecture level.

The bottom layer of "Vidu" is based on the completely self-developed U-ViT architecture, which was proposed by the team in September 2022 and is earlier than the DiT architecture adopted by Sora. It is global The first architecture that integrates Diffusion and Transformer.

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

Two months before the release of the DiT paper, Zhu Jun’s team from Tsinghua University submitted a paper - "All are Worth Words: A ViT Backbone for Diffusion Models". This paper proposes a network architecture U-ViT that uses Transformer to replace the CNN-based U-Net. This is the most important technical foundation of "Vidu".

On the technical route, "Vidu" adopts a Diffusion and Transformer fusion architecture that is exactly the same as Sora. Different from using the multi-step processing method of interpolating frames to generate long videos, "Vidu" adopts the same route as Sora, that is, directly generating high-quality videos in a single step. From a low-level perspective, this is a "one-step" implementation method that is completely end-to-end based on a single model. It does not involve intermediate frame insertion and other multi-step processing. The conversion of text to video is direct and continuous.

In addition, based on the U-ViT architecture, in March 2023, the team trained a multi-modal model with 1 billion parameters - UniDiffuser on the open source large-scale graphic data set LAION-5B, and Open source (see "Tsinghua Zhu Jun's team open sourced the first large multi-modal diffusion model based on Transformer, with text and pictures interdependent and rewritten to win").

UniDiffuser is mainly good at graphic and text tasks and can support arbitrary generation and conversion between graphic and text modes. The implementation of UniDiffuser has an important value - it has verified for the first time the scalability (Scaling Law) of the fusion architecture in large-scale training tasks, which is equivalent to running through all the processes of the U-ViT architecture in large-scale training tasks. . It is worth mentioning that UniDiffuser is one year ahead of the introduction of Stable Diffusion 3, a graphic model with the same DiT architecture.

These engineering experiences accumulated in graphics and text tasks have laid the foundation for the development of video models. Because video is essentially a stream of images, it is equivalent to an expansion of the image on the timeline. Therefore, the results achieved in image and text tasks can often be reused in video tasks. Sora does just that: it uses DALL・E 3’s re-annotation technology to generate detailed descriptions for the visual training data, allowing the model to more accurately follow the user’s textual instructions to generate videos. This effect will inevitably occur on "Vidu".

According to previous news, "Vidu" also reuses a lot of experience of Shengshu Technology in graphic and text tasks, including training acceleration, parallel training, low memory training, etc., thus quickly running through the training process. It is reported that they used video data compression technology to reduce the sequence dimension of the input data, and at the same time adopted a self-developed distributed training framework. While ensuring calculation accuracy, the communication efficiency was doubled, the memory overhead was reduced by 80%, and the training speed was increased by a cumulative 40 times. .

From the unification of graph tasks to the integration of video capabilities, “Vidu” can be regarded as a general visual model that can support the generation of more diverse and longer video content. Officials also revealed that “Vidu” is currently accelerating iterative improvements. Facing the future, “Vidu”’s flexible model architecture will also be compatible with a wider range of multi-modal capabilities.

A capable team from Tsinghua University

Finally, let’s talk about the team behind “Vidu” - Shengshu Technology, which is a capable team with Tsinghua background .

The core team of Shengshu Technology comes from the Artificial Intelligence Research Institute of Tsinghua University. The chief scientist is Zhu Jun, deputy director of Tsinghua Artificial Intelligence Research Institute; CEO Tang Jiayu studied in the Computer Science Department of Tsinghua University with a bachelor's degree and a master's degree, and is a member of the THUNLP group; CTO Bao Fan is a doctoral student in the Department of Computer Science at Tsinghua University and a member of Professor Zhu Jun's research group. He has long been concerned with research in the field of diffusion models. He led the completion of both U-ViT and UniDiffuser.

The team has been engaged in research on generative artificial intelligence and Bayesian machine learning for more than 20 years, and conducted in-depth research in the early stages of breakthroughs in deep generative models. In terms of diffusion models, the team took the lead in launching research in this direction in China, and the results involve full-stack technology directions such as backbone networks, high-speed inference algorithms, and large-scale training.

The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws

The team published nearly 30 papers related to the multi-modal field at top artificial intelligence conferences such as ICML, NeurIPS, and ICLR, among which the proposed training-free inference algorithms Analytic-DPM and DPM Breakthrough achievements such as -Solver won the ICLR Outstanding Paper Award, were adopted by foreign cutting-edge institutions such as OpenAI, Apple, and Stability.ai, and were used in star projects such as DALL・E 2 and Stable Diffusion.

Since its establishment in 2023, the team has been recognized by many well-known industrial institutions such as Ant Group, Qiming Venture Partners, BV Baidu Ventures, Byte Jinqiu Fund, and completed hundreds of millions of yuan in financing. It is reported that Shengshu Technology is currently the entrepreneurial team with the highest valuation in the multi-modal large model track in China. The launch of "Vidu" is another innovation and leadership of Shenshu Technology in the field of multi-modal native large models.

Related reading:

Exclusive interview with Tang Jiayu of Shengshu Technology: After receiving hundreds of millions of financing, Transformer is used to build multi-modal large models

Are domestic companies expected to make Sora? This large model team from Tsinghua University gives hope

The above is the detailed content of The most powerful domestic Sora at present! Tsinghua team breaks through 16-second long video, understands multi-lens language, and can simulate physical laws. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jiqizhixin.com. If there is any infringement, please contact admin@php.cn delete