Home > Article > Technology peripherals > Recognition from the first prize of Science and Technology Progress Award: Tencent solved the problem of training large models with trillions of parameters
The Science and Technology Award recognizes the research and application of machine learning platform projects, especially in the context of the rapid development of large-scale models, and pays attention to the value and importance of model training platforms fully recognized.
With the rise of deep learning, major companies have begun to realize the importance of machine learning platforms in the development of artificial intelligence technology. Companies such as Google, Microsoft, and Nvidia have launched their own machine learning platforms to speed up the training process of artificial intelligence models. These platforms provide developers with convenient support, allowing them to build and optimize complex artificial intelligence systems faster. This trend has prompted people to pay more attention to the development of machine learning technology and laid a solid foundation for future artificial intelligence applications.
Starting in 2023, the rise of large-scale models will further promote the increase in the number of model parameters. Major companies have launched models with parameter scales reaching hundreds of billions or even trillions, and these models generally adopt deep neural network structures. However, this development has also brought about two core pain points: the difficulty of distributed training of models and the model design challenges caused by application complexity.
Why Angel machine learning platform?
Detailed explanation of the four core technology breakthroughs
The appraisal committee composed of a number of academicians and other authoritative experts believes that Tencent Angel machine learning platform has high technical complexity and It is difficult to develop, highly innovative, and has broad application prospects. The overall technology has reached the international advanced level, among which the efficient cache scheduling and management technology for all-to-all communication, adaptive pre-sampling and graph structure search technology have reached the international leading level.
## recording Architecture, the characteristic of this architecture is that the two tasks of storing model parameters and performing model calculations are run on different servers. By adding more servers, larger models with higher computational requirements can be supported. This architecture makes the model training process more efficient and can handle large-scale data sets and complex model calculations. The design of the distributed parameter server enables the system to have good scalability and flexibility, and can meet machine learning tasks of different scales and needs. The advantage of this architecture is that it can effectively utilize cluster resources, improve computing efficiency, and provide users with faster and more efficient services. Achieve technological breakthroughs in core areas such as caching, model storage and scheduling, multi-modal model and fusion learning sorting, and large-scale graph models and structure search technology.
In order to improve training efficiency, terabyte-level machine learning models usually adopt distributed training methods, which require a large number of parameters and gradient synchronization. Taking the 1.8T model kilocalorie training as an example, The IO communication volume reaches 25TB, accounting for 53% of the time consumption. In addition, coupled with the heterogeneous network environment between different computing power clusters, the communication network delay is different, which puts higher requirements on the communication overhead during the model training process. . Tencent Angel machine learning platform is based on the efficient communication and cache scheduling management technology of Tencent Cloud Xingmai network, which can effectively solve the problem of high communication overhead for TB-level model training, reduce network communication time by 80%, and achieve distributed training performance that reaches the mainstream solution in the industry. 2.5 times.
Under the current computing power conditions, although the model reaches TB level, the video memory of mainstream GPU is still only 80G, and there is a bottleneck in parameter storage. In response to the key issue of difficulty in storing terabyte-level model training parameters, Tencent Angel machine learning platform proposes a storage management mechanism from a unified perspective of video memory and main memory, which achieves a model storage capacity that is doubled compared to the industry and a training performance that is twice that of mainstream solutions in the industry.
To develop a large model into a general model, it is inseparable from the processing support of multi-modal data. It is difficult to align, integrate and understand data of different modalities, such as text, images, videos and so on. In the training of multi-modal models, Tencent Angel machine learning platform proposes a full-link ranking advertising recommendation technology based on multi-modal fusion learning for advertising scenarios, helping to increase the advertising recall rate by more than 40%.
In addition, for graph model training for recommendation systems, Tencent Angel machine learning platform has designed graph node feature adaptive graph network structure search technology, which can automatically output the optimal structure , which solves the problem of "difficulty in graph data mining" in TB graph model applications, improves model training performance by 28 times, and has the best scalability compared with the industry.
The Road to Forging Tencent Angel Machine Learning Platform
Tencent Hunyuan Large Model Expands to Trillion Scale
As Tencent The basic platform for artificial intelligence technology, Tencent Angel platform was born in 2015 and supports PS-Worker distributed training and the training of billion-parameter LDA models.
In 2017, the Angel framework was open sourced on Github and open to developers. At the same time, technically, Angel solved the communication problem under heterogeneous networks and further improved its performance. In 2019, we made a breakthrough in the multi-modal understanding technology of scalable graph models, solving the problem of scalable graph models with trillions of nodes. In 2021, GPU memory unified perspective storage technology is proposed to solve the problem of large model parameter storage and performance.
In the creation of Tencent’s general artificial intelligence large model Tencent Hunyuan, Tencent’s Angel machine learning platform also played an important role.
In September 2023, Tencent’s Hunyuan large model was officially unveiled. The pre-training corpus exceeds 2 trillion tokens, and it has strong Chinese understanding and creation capabilities, logical reasoning capabilities, and reliable task execution capabilities.
Faced with the need to build Tencent Hunyuan's large model, Tencent's Angel machine learning platform has created self-developed machine learning frameworks Angel PTM and Angel HCF for large model training and inference, supporting single tasks at the 10,000-card level. Scale training and large-scale inference service deployment. The efficiency of large model training is increased to 2.6 times that of mainstream open source frameworks. Training of hundreds of billions of large models can save 50% of computing power costs. After the upgrade, it supports ultra-large-scale training of 10,000 cards per task. In terms of reasoning, the reasoning speed of Tencent Angel machine learning platform has been increased by 1.3 times. In the application of Tencent Hunyuan large model Wenshengtu, the reasoning time is shortened from the original 10 seconds to 3 to 4 seconds.
In addition, Angel also provides a one-stop platform from model development to application implementation, allowing users to quickly call Tencent's Hunyuan large model capabilities through API interfaces or fine-tuning, accelerating the construction of large model applications. Tencent More than 400 Tencent products and scenarios, including conferences, Tencent News, and Tencent Video, have been connected to Tencent Hunyuan internal testing.
Tencent Hunyuan has expanded the model to trillions of parameters by adopting a hybrid expert model (MoE) structure, promoting performance improvement and reduction of inference costs. As a general model, Tencent Hunyuan leads the industry in Chinese performance, especially in text generation, mathematical logic and multi-turn dialogue. Currently, Tencent Hunyuan is also actively developing multi-modal models to further enhance the capabilities of Vincent pictures and Vincent videos.
Tencent’s large number of application scenarios provide an experimental ground for the implementation of Tencent’s Angel machine learning platform. In addition to Tencent's Hunyuan large model, Tencent's Angel machine learning platform also supports products such as Tencent advertising and Tencent conferences, and serves multiple industries and corporate customers through Tencent Cloud, assisting the digital and intelligent development of all walks of life.
Take Tencent Advertising as an example, using innovative technologies such as Tencent Angel machine learning flat distributed training optimization and multi-modal understanding graph data mining, the training speed of multi-modal large models in advertising business scenarios has been increased by 5 times. The model scale is increased by 10 times, and the advertising recall rate is greatly improved.
The above is the detailed content of Recognition from the first prize of Science and Technology Progress Award: Tencent solved the problem of training large models with trillions of parameters. For more information, please follow other related articles on the PHP Chinese website!