Zhipu AI has open sourced the large model it developed in-house.
The field of domestic video generation is becoming more and more popular. Just now, Zhipu AI announced that it will open source CogVideoX, a video generation model with the same origin as "Qingying". Earn 4k stars in just a few hours.
- Code repository: https://github.com/THUDM/CogVideo
- Model download: https://huggingface.co/THUDM/CogVideoX-2b
- Technical report: https: //github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf
On July 26, Zhipu AI officially released the video generation product "Qingying", which has been widely praised by everyone. . As long as you have a good idea (a few words to a few hundred words) and a little patience (30 seconds), "Qingying" can generate a high-precision video with 1440x960 resolution. It is officially announced that from now on, Qingying will launch Qingyan App, and all users can experience it in an all-round way. Friends who want to try it can go to "Zhipu Qingyan" to experience the ability of "Qingying" to generate videos. The emergence of "Qingying" is hailed as the first Sora available to everyone in China. Six days after its release, the number of videos generated by "Qingying" exceeded one million.
- PC access link: https://chatglm.cn/
- Mobile access link: https://chatglm.cn/download?fr=web_home
Why Is the Zhipu AI open source model so popular? You must know that although video generation technology is gradually maturing, there is still no open source video generation model that can meet the requirements of commercial-level applications. The familiar Sora, Gen-3, etc. are all closed source. The open source of CogVideoX is like OpenAI open source of the model behind Sora, which is of great significance to the majority of researchers. CogVideoX open source model includes multiple models of different sizes. Currently, Zhipu AI open source CogVideoX-2B requires only 18GB of video memory for inference at FP-16 accuracy and only 40GB of video memory for fine-tuning. This means that a single A 4090 graphics card can perform inference, while a single A6000 graphics card can complete fine-tuning. CogVideoX-2B’s prompt word limit is 226 tokens, the video length is 6 seconds, the frame rate is 8 frames/second, and the video resolution is 720*480. Zhipu AI has reserved a vast space for the improvement of video quality, and we look forward to developers' open source contributions to prompt word optimization, video length, frame rate, resolution, scene fine-tuning, and the development of various functions around video. Models with stronger performance and larger parameters are on the way, so stay tuned and look forward to it.
Video data contains spatial and temporal information, so its data volume and computational burden far exceed that of image data. To address this challenge, Zhipu proposed a video compression method based on 3D variational autoencoder (3D VAE). 3D VAE simultaneously compresses the spatial and temporal dimensions of video through three-dimensional convolution, achieving higher compression rates and better reconstruction quality.
The model structure includes an encoder, a decoder and a latent space regularizer, and compression is achieved through four stages of downsampling and upsampling. Temporal causal convolution ensures the causality of information and reduces communication overhead. Zhipu uses contextual parallelism technology to adapt to large-scale video processing. In the experiment, Zhipu AI found that large-resolution encoding is easy to generalize, but increasing the number of frames is more challenging. Therefore, Zhipu trains the model in two stages: first training on lower frame rates and mini-batches, and then fine-tuning on higher frame rates through contextual parallelism. The training loss function combines L2 loss, LPIPS perceptual loss, and GAN loss for the 3D discriminator. Wisdom Spectrum AI uses VAE’s encoder to compress the video into a latent space, then splits the latent space into chunks and expands it into long sequence embeddings z_vision. At the same time, Zhipu AI uses T5 to encode text input into text embedding z_text, and then splice z_text and z_vision along the sequence dimension. The spliced embeddings are fed into a stack of expert Transformer blocks for processing. Finally, the embeddings are back-stitched to recover the original latent space shape and decoded using VAE to reconstruct the video.
Video generation model training requires screening high-quality video data to learn real-world dynamics. Video may be inaccurate due to human editing or filming issues. Wisdom AI developed negative tags to identify and exclude low-quality videos such as over-edited, choppy motion, low-quality, lecture-style, text-dominated, and screen-noise videos. Through filters trained by video-llama, Zhipu AI annotated and filtered 20,000 video data points. At the same time, optical flow and aesthetic scores are calculated, and the threshold is dynamically adjusted to ensure the quality of the generated video. Video data usually does not have text descriptions and needs to be converted into text descriptions for text-to-video model training. Existing video subtitle datasets have short subtitles and cannot fully describe the video content. Zhipu AI proposes a pipeline to generate video subtitles from image subtitles and fine-tunes the end-to-end video subtitle model to obtain denser subtitles. This approach generates short captions using the Panda70M model, dense image captions using the CogView3 model, and then summarizes using the GPT-4 model to generate the final short video. Zhipu AI also fine-tuned a CogVLM2-Caption model based on CogVLM2-Video and Llama 3, trained using dense subtitle data to accelerate the video subtitle generation process.
テキストからビデオへの生成の品質を評価するために、Zhipu AI は人間のアクション、シーン、ダイナミクスなどの VBench の複数の指標を使用します。 Zhipu AI は、ビデオの動的特性に焦点を当てた、Devil の Dynamic Quality と Chrono-Magic の GPT4o-MT スコアという 2 つの追加ビデオ評価ツールも使用します。以下の表に示すとおりです。 Zhipu AI は、ビデオ生成におけるスケーリング則の有効性を検証しており、今後は、より画期的なイノベーションとより効率的なビデオ情報を備えた新しいモデル アーキテクチャを模索しながら、データ スケールとモデル スケールのスケールアップを継続します。 、テキストとビデオコンテンツをより完全に融合させたものです。 最後に「Clear Shadow」の効果を見てみましょう。 ヒント: 「美しく彫刻されたマストと帆を備えた繊細な木製のおもちゃのボートは、海の波を模倣した豪華な青いカーペットの上を滑らかに滑ります。船体は豊かな茶色に塗装され、小さな窓が付いています。カーペットは柔らかく、質感があり、広大な海を思わせる完璧な背景を提供し、ボートの周りにはさまざまなおもちゃや子供向けのアイテムがあり、遊び心のある環境を示唆しています。このシーンは、おもちゃのボートでの無限の冒険を象徴しています。 ヒント: 「カメラは、黒いルーフラックを備えた古い白い SUV が急な坂道を駆け上がり、タイヤが砂埃を巻き上げ、太陽が照りつけている様子を追跡します。舗装されていない道路を疾走するSUVは、暖かい光を照らしながら遠くに向かってカーブしており、道路の両側にはセコイアの木々が茂っていました。後ろから見ると、車はカーブをスムーズに進み、険しい丘や山々に囲まれ、上には薄い雲が広がっているような印象を与えます。雪に覆われた木々が立ち並び、地面も雪で覆われ、明るく穏やかな雰囲気を醸し出しています。ビデオのスタイルは、雪に覆われた森の美しさと道の静けさに焦点を当てた自然風景のショットです。軽い焦げと軽い煙のあるグリルのグリルのアップ。」The above is the detailed content of The open source version of Sora is a hit: 4K Star is available, 4090 runs on a single card, and A6000 can be fine-tuned. For more information, please follow other related articles on the PHP Chinese website!
Statement:The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn