Home  >  Article  >  Technology peripherals  >  Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled

Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled

WBOY
WBOYOriginal
2024-07-12 05:52:581083browse

At the 2024 World Artificial Intelligence Conference, many people lined up in front of a booth just to let the big AI model "arrange" an errand for them in heaven.

Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled

Process:
  1. Provide personal photos
  2. Generate fairyland image photos (referring to the style of "Havoc in Heaven")
  3. Interactive plot selection and conversation session
  4. Evaluate MBTI personality type based on choices and answers
  5. "Arrange" heavenly errands according to personality type

Experience method:

  • Queue on site
  • Online experience (scan the QR code below)

    Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled

    Big model startup company Stepping Stars announces big move

The AI ​​interactive experience "AI + Havoc in Heaven" in cooperation with Shanghai Film Studio is just an appetizer for Stepping Stars to showcase the charm of large models. During WAIC, they grandly launched the following big move:

  1. MoE large model with trillions of parameters: Step-2 official version
  2. Multi-modal large model with hundreds of billions of parameters: Step-1.5V
  3. Image generation large model: Step-1X

Step-2 trillion parameter large model

After debuting with Step Stars in March, Step-2 has evolved to be fully close to GPT -4 level, with excellent performance in mathematical logic, programming, Chinese knowledge, English knowledge and instruction following.

Step-1.5V multi-modal large model

Based on the Step-2 model, Step Star developed the multi-modal large model Step-1.5V, which not only has powerful perception and video understanding capabilities, but also can Image content for advanced reasoning (such as solving math problems, writing code, composing poetry).

Step-1X large image generation model

The image generation in "AI + Upheaval in Heaven" is completed by the Step-1X model, which is deeply optimized for Chinese elements and has excellent semantic alignment and instruction following ability.

Step Star has established a complete large model matrix covering trillion-parameter MoE large models and multi-modal large models, becoming the first echelon of large model startups. This is due to their persistence in Scaling Law and matching technology and resource strength.

The

Step-2 trillion parameter large model

trained from scratch will significantly improve the model’s reasoning capabilities in fields such as mathematics and programming. Step-2 can solve more complex mathematical logic and programming problems than the 100-billion-level model, and has been quantitatively confirmed by benchmark evaluations.

Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiledIn addition, its Chinese and English capabilities and command following ability have also been significantly improved.
The reason why Step-2 performs so well is, on the one hand, its huge number of parameters, and on the other hand, its training method.
We know that there are two main ways to train MoE models. One is upcycle, which is to further improve model performance in a more efficient and economical way by reusing the intermediate results of the training process or the already trained model. This training method requires low computing power and has high training efficiency, but the trained model often has a lower upper limit. For example, when training a MoE model, if multiple expert models are obtained by copying and fine-tuning the same basic model, there may be a high degree of similarity between these expert models. This homogeneity will limit the performance improvement of the MoE model. space.
Considering these limitations, Step Stars chose another approach - completely independent research and development and training from scratch. Although this method is difficult to train and consumes a lot of computing power, it can achieve a higher model upper limit.
Specifically, they first made some innovations in MoE architecture design, including parameter sharing by some experts, heterogeneous expert design, etc. The former ensures that certain common capabilities are shared among multiple experts, but at the same time each expert still retains its uniqueness. The latter increases the diversity and overall performance of the model by designing different types of expert models so that each expert has unique advantages on specific tasks.
Based on these innovations, Step-2 not only has a total number of parameters reaching the trillion level, but also the number of parameters activated for each training or inference exceeds most dense models on the market.
In addition, training such a trillion-parameter model from scratch is also a big test for the system team. Fortunately, the Step Star System team has rich practical experience in system construction and management, which allowed them to successfully break through key technologies such as 6D parallelism, extreme video memory management, and fully automated operation and maintenance during the training process, and successfully completed Step-2. train. The Step-1.5V multi-modal large model standing on the shoulders of Step-2
Three months ago, Step Star released the Step-1V multi-modal large model. Recently, with the release of the official version of Step-2, this large multi-modal model has also been upgraded to version 1.5.
Step-1.5V mainly focuses on multi-modal understanding capabilities. Compared with previous versions, its perceptual capabilities have been greatly improved. It can understand complex charts and flowcharts, accurately perceive complex geometric positions in physical space, and can also process high-resolution and extreme aspect ratio images.

Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled

In addition, it can also understand videos, including objects, characters, environments, and the overall atmosphere and characters' emotions in the videos.

As mentioned earlier, Step-2 played an indispensable role in the birth of Step-1.5V. This means that during Step-1.5V’s RLHF (reinforcement learning based on human feedback) training process, Step-2 is used as a supervised model, which is equivalent to Step-1.5V having a trillion parameters. Models become teachers. Under the guidance of this teacher, Step-1.5V's reasoning ability has been greatly improved, and it can perform various advanced reasoning tasks based on image content, such as solving math problems, writing code, composing poetry, etc. This is also one of the capabilities recently demonstrated by OpenAI GPT-4o. This capability has made the outside world full of expectations for its application prospects.

The multi-modal generation capability is mainly reflected in the new model Step-1X. Compared with some similar models, it has better semantic alignment and command following capabilities. At the same time, it has been deeply optimized for Chinese elements and is more suitable for the aesthetic style of Chinese people.

The AI ​​interactive experience of "Havoc in Heaven" created based on this model integrates image understanding, style transfer, image generation, plot creation and other capabilities, richly and three-dimensionally showing the industry-leading multi-modality of Step Stars level. For example, when generating the initial character, the system will first determine whether the photo uploaded by the user meets the requirements for "face pinching", and then flexibly give feedback in a very "Havoc in Heaven" language style. This reflects the model's picture understanding ability and large language model ability. With the support of large model technology, this game allows players to obtain a completely different interactive experience from traditional online H5 games. Because all interactive questions, user images, and analysis results are generated after the model learns features in real time, the possibility of thousands of people and faces and unlimited plots is truly realized.

Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiledThese excellent performances are inseparable from the DiT model architecture developed by Step Star Full Link (OpenAI’s Sora is also a DiT architecture). In order to allow more people to use this model, Step Star has designed three different parameter quantities for Step-1X: 600M, 2B, and 8B to meet the needs of different computing power scenarios.

At the debut event in March, Jiang Daxin, the founder of Step Star, clearly stated that he believed that the evolution of large models will go through three stages:

  1. In the first stage, each modality such as language, vision, and sound develops independently, and the model of each modality focuses on learning and characterizing the characteristics of its specific modality.
  2. In the second stage, different modes begin to merge. However, this integration is not complete, and the understanding and generation tasks are still separated, which results in the model having strong understanding ability but weak generation ability, or vice versa.
  3. In the third stage, generation and understanding are unified in a model, and then fully integrated with the robot to form embodied intelligence. Next, embodied intelligence actively explores the physical world, and then gradually evolves into a world model, thereby realizing AGI.

This is also the route that Jiang Daxin and others have been adhering to since the beginning of their business. On this road, "Trillions of parameters" and "multi-mode fusion" are indispensable. Step-2, Step-1.5V, and Step-1X are all nodes they have reached on this road.

Moreover, these nodes are linked together. Take OpenAI as an example. The video generation model Sora they released at the beginning of the year used OpenAI's internal tool (most likely GPT-4V) for annotation; and GPT-4V was trained based on GPT-4 related technologies. From the current point of view, the powerful capabilities of single-modal models will lay the foundation for multi-modality; the understanding of multi-modality will lay the foundation for generation. Relying on such a model matrix, OpenAI realizes the left foot stepping on the right foot. And Step Star is confirming this route in China.

We look forward to this company bringing more surprises to the domestic large model field.

The above is the detailed content of Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn