More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture-AI-php.cn

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

王林

Jul 12, 2024 pm 06:49 PM

getting StartedIntelligent future

In the blink of an eye, 2024 is already halfway through. It is not difficult to find that there is an increasingly obvious trend in the field of AI, especially AIGC: the Wenshengtu track has entered a stage of steady advancement and accelerated commercial implementation, but at the same time, only generating static images can no longer satisfy people's demand for generative AI capabilities. Looking forward to it, the demand for dynamic video creation has never been higher.

Therefore, the Wensheng video track continues to be hot, especially since OpenAI released Sora at the beginning of the year, the video generation model with Diffusion Transformer (DiT) as the underlying architecture has ushered in a blowout period. On this track, domestic and foreign video generation model manufacturers are quietly launching a technology competition.

In China, a generative AI start-up company founded in March last year that focuses on building visual multi-modal basic models and applications continues to appear in people's field of vision. It is HiDream.ai. Its self-developed visual multi-modal basic model realizes the generation and conversion between different modalities, supports Wensheng pictures, Wensheng videos, Wensheng videos and Wensheng 3D, and has launched The one-stop AI image and video generation platform "Pixeling" is available for the public to get started.

Experience address: www.hidreamai.com

Since the Zhixiang large model was launched in August 2023, it has gone through several iterations and polishings, and has optimized the basic model to deeply explore and expand the Wensheng diagram and Vincent Video and other AIGC capabilities. Especially in the field of video generation, the supported generation time has been increased from the initial 4 seconds to 15 seconds, and the imaging effect is also visibly better.

Now, the Zhixiang large model has been upgraded again. The unique DiT architecture based on Chinese native releases more powerful, more stable, and more user-friendly image and video generation capabilities, including

more aesthetic and artistic Image generation, text embedding in images, minute-level video generation, etc..

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The demonstration of all these new image and video generation skills is inseparable from Zhixiang Future’s technological accumulation and continuous innovation in the field of multi-modal visual generation.

The generation effect continues to improve

The more powerful basic model capability is the engine

Zhixiang Large Model has been targeting the joint modeling of text, image, video and 3D from the beginning. Interactive generation technology enables precise and controllable multi-modal content generation and builds powerful prototype capabilities, allowing users to have a better creative experience in its Vincent Picture and Vincent Video AIGC platforms.

This

Intelligent Elephant Large Model 2.0 overall upgrade has qualitative changes in the underlying architecture, training data and training strategies compared to the 1.0 version, which brings text, images, videos and 3D Another leap in multi-mode capabilities and a tangible improvement in interactive experience.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

It can be said that the upgraded smart elephant model has ushered in all-round enhancements in the field of image and video generation, and has injected stronger driving force into the one-stop AIGC generation platform for multi-modal large model creation.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

Vincent Picture skills have evolved again

With a higher level of "pursuit"

As AIGC's one-stop generation platform, Vincent Tu is the premise and important technical barrier of Vincent Video. Therefore, Zhixiang has placed high expectations in the direction of Wenshengtu in the future, and will promote more diverse functions, more realistic visual effects, and a more user-friendly experience at its own pace.

After a series of targeted adjustments and optimizations, the Vincentian diagram capability of Zhixiang Large Model 2.0 has been significantly improved compared to previous versions, and it is easy to see from multiple external presentation effects.

First of all, the images generated by Zhixiang Large Model 2.0 are more beautiful and artistic. The current Vincentian large model can do very well in more intuitive aspects such as semantic understanding, generating image structure and picture details, but it may not be satisfactory in partial sensory aspects such as texture, beauty, and artistry. Therefore, the pursuit of beauty has become the focus of this Vincent Picture upgrade. What is the effect? We can look at the following two examples.

The prompt input for the first example is "a little girl wearing a huge hat with many castles, flowers, trees, birds, colorful, close-up, details, illustration style" on the hat.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The prompt input in the second example is "close-up photo of green plant leaves, dark theme, water drop details, mobile wallpaper".

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The two images generated look eye-catching in terms of composition, tone, and richness of details, which greatly enhances the overall beauty of the picture.

In addition to making the generated images look more beautiful, the correlation of the generated images is also stronger. This is also an aspect that everyone pays great attention to after image generation has developed to a certain stage.

In order to improve the relevance of generated images, the large model of Intelligent Image focuses on strengthening the understanding of some complex logic, such as different spatial layouts, positional relationships, different types of objects, the number of generated objects, etc., these are An important factor in achieving higher relevance. After some training, the large model of Intelligent Elephant can easily handle image generation tasks involving multiple objects, multi-location distribution, and complex spatial logic, and better meet the actual needs of users in real life.

Let’s look at the following three generation examples that require a deep understanding of different objects and spatial position relationships. The results show that Vincent Diagram can now easily handle long and short text prompts containing complex logic.

The prompt input for the first example is "There are three baskets filled with fruit on the kitchen table. The middle basket is filled with green apples. The left basket is filled with strawberries. The right basket is filled with Blueberries. Behind the basket is a white dog. The background is a turquoise wall with the colorful text "Pixeling v2".

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The input prompt of the second example is "a cat is on the right, a dog is on the left, and a green cube is placed on a blue ball in the middle".

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

The prompt input for the third example is "On the moon, an astronaut is riding a cow, wearing a pink tutu skirt and holding a blue umbrella. To the right of the cow is a cow wearing a top hat penguin. The text "HiDream.Al" is written at the bottom.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

At the same time, the generation of text embedded in images is more accurate and efficient, which is a function that is used more frequently in posters or marketing copywriting.

In terms of technical implementation, generating embedded text in images requires a large model to deeply understand the visual appearance description and precise text content in the input Prompt, so as to achieve accurate depiction of text content while ensuring the overall beauty and artistry of the image.

In an exclusive interview with this site, Dr. Yao Ting, CTO of Zhixiang Future, mentioned that for such tasks, previous versions were often unable to generate them. Even if they could be generated, there were still problems, in terms of generated characters or accuracy. All are lacking. Now these problems have been well solved. The large model of Zhixiang has realized the embedding generation of long text in images, which can be up to dozens of words.

The three generated examples from left to right below show good text embedding effects, especially the right side of the picture where more than twenty words and punctuation marks are accurately embedded.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

It can be said that the Vincentian diagram function of the Intelligent Elephant model has achieved industry-leading results in the industry, laying a key foundation for video generation.

Video generation has reached the minute level

If the upgraded Intelligent Image Model 2.0 has achieved steady progress in the direction of Vincentian graphics, then it has made a leap forward in the direction of Vincentian videos.

In December last year, the Vincent video of the Zhixiang large model broke the 4-second limit and supported the generation time of more than 15 seconds. Half a year later, Wensheng Video has significantly improved in terms of duration, naturalness of pictures, content and character consistency, and this is thanks to its self-developed mature DiT architecture.

Compared with U-Net, the DiT architecture is more flexible and can enhance the quality of image and video generation. The emergence of Sora more intuitively verifies this. Diffusion models using this type of architecture show a natural tendency to generate high-quality images and videos, and have relative advantages in customizability and controllability of generated content. For the Intelligent Elephant Large Model 2.0, the DiT architecture it adopts has some unique features.

We know that the underlying implementation of the DiT architecture is based on Transformer. Intelligence Model 2.0 adopts completely self-developed modules in the entire Transformer network structure, training data composition and training strategy, especially in network training The strategy has been well-thought-out.

First of all, the Transformer network structure adopts an efficient spatio-temporal joint attention mechanism, which not only fits the characteristics of video in both spatial and temporal domains, but also solves the problem that the traditional attention mechanism cannot keep up with the speed during the actual training process. difficult problem.

Secondly, the generation of long shots in AI video tasks puts higher requirements on the source and screening of training data. Therefore, the Zhixiang large model supports training of video clips of up to several minutes or even ten minutes, making it possible to directly output minutes-long videos. At the same time, describing minute-level video content is also difficult. Zhixiang Future has independently developed a Captioning Model to generate video descriptions, achieving detailed and accurate description output.

Finally, in terms of training strategy, due to limited long-lens video data, the Intelligent Elephant Model 2.0 uses video clips of different lengths for joint training of video and picture data, and dynamically changes the sampling of videos of different lengths. rate, and then complete long-shot training. At the same time, reinforcement learning will be performed based on user feedback data during training to further optimize model performance.

Therefore, the more powerful self-developed DiT architecture provides technical support for the further improvement of the Wensheng video effect.

Now, the video duration supported by Intelligent Elephant Large Model 2.0 has been increased from about 15 seconds to minutes, reaching a high level in the industry.

In addition to the video duration reaching the minute level, variable duration and size are also a major highlight of this Wensheng video feature upgrade.

The current video generation model usually has a fixed generation duration, which users cannot choose. In the future, Zhixiang will open the choice of generation duration to users, allowing them to specify the duration or make dynamic judgments based on the input Prompt content. If it is more complex, a longer video will be generated, and if it is relatively simple, a shorter video will be generated. Through such a dynamic process, the user's creative needs can be adaptively met. The size of the generated video can also be customized as needed, making it very user-friendly.

In addition, The overall picture look and feel has become better, the actions or movements of objects in the generated video are more natural and smooth, the details are rendered more in place, and it supports 4K ultra-clear image quality.

In just half a year, compared with previous versions, the upgraded Vincent Video function can be described as "reborn". However, in Dr. Yao Ting’s view, most video generation, whether it is Intelligent Future or other peers, is still in the single-lens stage. If compared to the L1 to L5 stages in the autonomous driving field, Vincent Video is roughly at the L2 stage. With the help of this upgrade of basic model capabilities, Zhixiang will pursue higher-quality multi-lens video generation in the future, and has also taken a key step towards exploring the L3 stage.

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

智象未來表示，迭代後的文生視訊功能將在 7 月中旬上線使用。大家可以狠狠地期待一波了！

寫在最後

成立不到一年半的時間，無論是基礎模型能力的持續迭代，還是文生圖、文生視頻實際體驗的提升，智象未來在視覺多模態生成這一方向上走得既穩又快，並收穫了大量C 端和B 端用戶。

我們了解到，智象未來 C 端用戶單月訪問量超過了百萬，生成 AI 圖像和視頻的總數量也超過千萬。低門檻、好應用構成了智像大模型的特質，並基於它打造了最適合社會大眾使用的首款 AIGC 應用平台。

在B 端，智象未來積極與中國移動、聯想集團、科大訊飛、上影集團、慈文集團、神州數碼、央視網、印象筆記、天工異彩、杭州靈伴等企業達成策略合作協議，深化模型應用場景，將模型能力延展到包括營運商、智慧終端、影視製作、電子商務、文旅宣傳和品牌行銷在內的更多產業，最終在商業化落地過程中發揮模型潛能並創造價值。

目前，智像大模型擁有約 100 家頭部企業客戶，並為 30000 + 小型企業客戶提供了 AIGC 服務。

More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture

在智像大模型2.0 發布之前，智象未來已經聯合中國移動咪咕集團推出了國民級AIGC 應用“AI 一語成片”，不僅為普通用戶提供零基礎AI 視頻彩鈴創作功能，也協助企業客戶產生豐富的品牌及行銷影片內容，讓企業擁有屬於自己的彩鈴品牌，讓我們看到了影片生成融合產業場景的巨大潛力。

此外，AI 生態也是大模型廠商發力的重要陣地。在這方面，智象未來持開放的態度，聯合聯想集團、科大訊飛、神州數碼等大客戶、小型開發團隊和獨立開發者共建包括視頻生成在內的廣泛AI 生態，覆蓋用戶的更多元化需求。

2024 年被視為大模型應用落地元年，對所有廠商來說都是關鍵的發展節點。智象未來正在圍繞更強大的基模能力做深文章。

一方面，在統一的框架中強化圖像、視頻、3D 多模態的理解與生成能力，例如在視頻生成領域繼續優化底層架構、算法、數據以求得時長、質量上的更大突破，成為推動未來通用人工智慧的不可或缺的一部分；另一方面在使用者體驗、創新應用、產業生態等多個方向發力，擴大自身的產業影響力。

搶佔視頻生成賽道的高地，智象未來已經做好了充足準備。

The above is the detailed content of More beautiful image generation, minute-level video output, a leapfrog journey of domestic self-developed DiT architecture. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

An easy-to-understand explanation of how to set up two-step authentication in ChatGPT!May 12, 2025 pm 05:37 PM

ChatGPT Security Enhanced: Two-Stage Authentication (2FA) Configuration Guide Two-factor authentication (2FA) is required as a security measure for online platforms. This article will explain in an easy-to-understand manner the 2FA setup procedure and its importance in ChatGPT. This is a guide for those who want to use ChatGPT safely. Click here for OpenAI's latest AI agent, OpenAI Deep Research ⬇️ [ChatGPT] What is OpenAI Deep Research? A thorough explanation of how to use it and the fee structure! table of contents ChatG

[For businesses] ChatGPT training | A thorough introduction to 8 free training options, subsidies, and examples!May 12, 2025 pm 05:35 PM

The use of generated AI is attracting attention as the key to improving business efficiency and creating new businesses. In particular, OpenAI's ChatGPT has been adopted by many companies due to its versatility and accuracy. However, the shortage of personnel who can effectively utilize ChatGPT is a major challenge in implementing it. In this article, we will explain the necessity and effectiveness of "ChatGPT training" to ensure successful use of ChatGPT in companies. We will introduce a wide range of topics, from the basics of ChatGPT to business use, specific training programs, and how to choose them. ChatGPT training improves employee skills

A thorough explanation of how to use ChatGPT to streamline your Twitter operations!May 12, 2025 pm 05:34 PM

Improved efficiency and quality in social media operations are essential. Particularly on platforms where real-time is important, such as Twitter, requires continuous delivery of timely and engaging content. In this article, we will explain how to operate Twitter using ChatGPT from OpenAI, an AI with advanced natural language processing capabilities. By using ChatGPT, you can not only improve your real-time response capabilities and improve the efficiency of content creation, but you can also develop marketing strategies that are in line with trends. Furthermore, precautions for use

[For Mac] Explaining how to get started and how to use the ChatGPT desktop app!May 12, 2025 pm 05:33 PM

ChatGPT Mac desktop app thorough guide: from installation to audio functions Finally, ChatGPT's desktop app for Mac is now available! In this article, we will thoroughly explain everything from installation methods to useful features and future update information. Use the functions unique to desktop apps, such as shortcut keys, image recognition, and voice modes, to dramatically improve your business efficiency! Installing the ChatGPT Mac version of the desktop app Access from a browser: First, access ChatGPT in your browser.

What is the character limit for ChatGPT? Explanation of how to avoid it and upper limits by modelMay 12, 2025 pm 05:32 PM

When using ChatGPT, have you ever had experiences such as, "The output stopped halfway through" or "Even though I specified the number of characters, it didn't output properly"? This model is very groundbreaking and not only allows for natural conversations, but also allows for email creation, summary papers, and even generate creative sentences such as novels. However, one of the weaknesses of ChatGPT is that if the text is too long, input and output will not work properly. OpenAI's latest AI agent, "OpenAI Deep Research"

What is ChatGPT's voice input and voice conversation function? Explaining how to set it up and how to use itMay 12, 2025 pm 05:27 PM

ChatGPT is an innovative AI chatbot developed by OpenAI. It not only has text input, but also features voice input and voice conversation functions, allowing for more natural communication. In this article, we will explain how to set up and use the voice input and voice conversation functions of ChatGPT. Even when you can't take your hands off, ChatGPT responds and responds with audio just by talking to you, which brings great benefits in a variety of situations, such as busy business situations and English conversation practice. A detailed explanation of how to set up the smartphone app and PC, as well as how to use each.

An easy-to-understand explanation of how to use ChatGPT for job hunting and job hunting!May 12, 2025 pm 05:26 PM

The shortcut to success! Effective job change strategies using ChatGPT In today's intensifying job change market, effective information gathering and thorough preparation are key to success. Advanced language models like ChatGPT are powerful weapons for job seekers. In this article, we will explain how to effectively utilize ChatGPT to improve your job hunting efficiency, from self-analysis to application documents and interview preparation. Save time and learn techniques to showcase your strengths to the fullest, and help you make your job search a success. table of contents Examples of job hunting using ChatGPT Efficiency in self-analysis: Chat

An easy-to-understand explanation of how to create and output mind maps using ChatGPT!May 12, 2025 pm 05:22 PM

Mind maps are useful tools for organizing information and coming up with ideas, but creating them can take time. Using ChatGPT can greatly streamline this process. This article will explain in detail how to easily create mind maps using ChatGPT. Furthermore, through actual examples of creation, we will introduce how to use mind maps on various themes. Learn how to effectively organize and visualize your ideas and information using ChatGPT. OpenAI's latest AI agent, OpenA

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver CS6

Visual web development tools

Atom editor mac version download

The most popular open source editor

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),