Home > Article > Technology peripherals > Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled
At the 2024 World Artificial Intelligence Conference, many people lined up in front of a booth just to let the big AI model "arrange" an errand for them in heaven.
Process:Experience method:
The AI interactive experience "AI + Havoc in Heaven" in cooperation with Shanghai Film Studio is just an appetizer for Stepping Stars to showcase the charm of large models. During WAIC, they grandly launched the following big move:
Step-2 trillion parameter large model
After debuting with Step Stars in March, Step-2 has evolved to be fully close to GPT -4 level, with excellent performance in mathematical logic, programming, Chinese knowledge, English knowledge and instruction following.
Step-1.5V multi-modal large model
Based on the Step-2 model, Step Star developed the multi-modal large model Step-1.5V, which not only has powerful perception and video understanding capabilities, but also can Image content for advanced reasoning (such as solving math problems, writing code, composing poetry).
Step-1X large image generation model
The image generation in "AI + Upheaval in Heaven" is completed by the Step-1X model, which is deeply optimized for Chinese elements and has excellent semantic alignment and instruction following ability.
Step Star has established a complete large model matrix covering trillion-parameter MoE large models and multi-modal large models, becoming the first echelon of large model startups. This is due to their persistence in Scaling Law and matching technology and resource strength.
The
Step-2 trillion parameter large model
trained from scratch will significantly improve the model’s reasoning capabilities in fields such as mathematics and programming. Step-2 can solve more complex mathematical logic and programming problems than the 100-billion-level model, and has been quantitatively confirmed by benchmark evaluations.In addition, its Chinese and English capabilities and command following ability have also been significantly improved.
The reason why Step-2 performs so well is, on the one hand, its huge number of parameters, and on the other hand, its training method.
We know that there are two main ways to train MoE models. One is upcycle, which is to further improve model performance in a more efficient and economical way by reusing the intermediate results of the training process or the already trained model. This training method requires low computing power and has high training efficiency, but the trained model often has a lower upper limit. For example, when training a MoE model, if multiple expert models are obtained by copying and fine-tuning the same basic model, there may be a high degree of similarity between these expert models. This homogeneity will limit the performance improvement of the MoE model. space.
Considering these limitations, Step Stars chose another approach - completely independent research and development and training from scratch. Although this method is difficult to train and consumes a lot of computing power, it can achieve a higher model upper limit.
Specifically, they first made some innovations in MoE architecture design, including parameter sharing by some experts, heterogeneous expert design, etc. The former ensures that certain common capabilities are shared among multiple experts, but at the same time each expert still retains its uniqueness. The latter increases the diversity and overall performance of the model by designing different types of expert models so that each expert has unique advantages on specific tasks.
Based on these innovations, Step-2 not only has a total number of parameters reaching the trillion level, but also the number of parameters activated for each training or inference exceeds most dense models on the market.
In addition, training such a trillion-parameter model from scratch is also a big test for the system team. Fortunately, the Step Star System team has rich practical experience in system construction and management, which allowed them to successfully break through key technologies such as 6D parallelism, extreme video memory management, and fully automated operation and maintenance during the training process, and successfully completed Step-2. train. The Step-1.5V multi-modal large model standing on the shoulders of Step-2
Three months ago, Step Star released the Step-1V multi-modal large model. Recently, with the release of the official version of Step-2, this large multi-modal model has also been upgraded to version 1.5.
Step-1.5V mainly focuses on multi-modal understanding capabilities. Compared with previous versions, its perceptual capabilities have been greatly improved. It can understand complex charts and flowcharts, accurately perceive complex geometric positions in physical space, and can also process high-resolution and extreme aspect ratio images.
As mentioned earlier, Step-2 played an indispensable role in the birth of Step-1.5V. This means that during Step-1.5V’s RLHF (reinforcement learning based on human feedback) training process, Step-2 is used as a supervised model, which is equivalent to Step-1.5V having a trillion parameters. Models become teachers. Under the guidance of this teacher, Step-1.5V's reasoning ability has been greatly improved, and it can perform various advanced reasoning tasks based on image content, such as solving math problems, writing code, composing poetry, etc. This is also one of the capabilities recently demonstrated by OpenAI GPT-4o. This capability has made the outside world full of expectations for its application prospects.
The multi-modal generation capability is mainly reflected in the new model Step-1X. Compared with some similar models, it has better semantic alignment and command following capabilities. At the same time, it has been deeply optimized for Chinese elements and is more suitable for the aesthetic style of Chinese people.The AI interactive experience of "Havoc in Heaven" created based on this model integrates image understanding, style transfer, image generation, plot creation and other capabilities, richly and three-dimensionally showing the industry-leading multi-modality of Step Stars level. For example, when generating the initial character, the system will first determine whether the photo uploaded by the user meets the requirements for "face pinching", and then flexibly give feedback in a very "Havoc in Heaven" language style. This reflects the model's picture understanding ability and large language model ability. With the support of large model technology, this game allows players to obtain a completely different interactive experience from traditional online H5 games. Because all interactive questions, user images, and analysis results are generated after the model learns features in real time, the possibility of thousands of people and faces and unlimited plots is truly realized.
These excellent performances are inseparable from the DiT model architecture developed by Step Star Full Link (OpenAI’s Sora is also a DiT architecture). In order to allow more people to use this model, Step Star has designed three different parameter quantities for Step-1X: 600M, 2B, and 8B to meet the needs of different computing power scenarios.
At the debut event in March, Jiang Daxin, the founder of Step Star, clearly stated that he believed that the evolution of large models will go through three stages:This is also the route that Jiang Daxin and others have been adhering to since the beginning of their business. On this road, "Trillions of parameters" and "multi-mode fusion" are indispensable. Step-2, Step-1.5V, and Step-1X are all nodes they have reached on this road.
Moreover, these nodes are linked together. Take OpenAI as an example. The video generation model Sora they released at the beginning of the year used OpenAI's internal tool (most likely GPT-4V) for annotation; and GPT-4V was trained based on GPT-4 related technologies. From the current point of view, the powerful capabilities of single-modal models will lay the foundation for multi-modality; the understanding of multi-modality will lay the foundation for generation. Relying on such a model matrix, OpenAI realizes the left foot stepping on the right foot. And Step Star is confirming this route in China.
We look forward to this company bringing more surprises to the domestic large model field.
The above is the detailed content of Revealed: Step Star trillion MoE+ multi-modal large model matrix unveiled. For more information, please follow other related articles on the PHP Chinese website!