Home >Technology peripherals >AI >Huawei Cloud brings AI computing power into the world
Throughout 2023, the global technology community is buzzing with big models. As the shortest path between large AI models and industrial scenarios, the cloud computing industry has naturally attracted much attention in the wave of large models. At this stage, cloud vendors have entered the large model industry one after another and explored the possibilities that large models bring to the cloud computing industry from multiple angles.
But we tend to ignore such a key node: the first challenge of a large model with "hundreds of models and thousands of states" is AI computing power. To do a good job in large model services, we must also do a good job in computing power services.
In the most basic capability of AI computing power, there is hidden the real answer to how AI cloud services can go further.
On September 21, during the Huawei Connected Conference 2023, Huawei Managing Director and Huawei Cloud CEO Zhang Pingan delivered a keynote speech focusing on "Building a Cloud Base for an Intelligent World, Letting AI Reshape Thousands of Industries" and released the Pangu Big Data Model innovative services in the fields of mining, government affairs, automobiles, meteorology, medicine, digital people, research and development, etc., and announced the official launch of Huawei Cloud Ascend AI cloud service, accelerating the realization of inclusive benefits for large models in thousands of industries. Focusing on the implementation of the Pangu large model, the theme of this Huawei Cloud sharing is "solve difficult problems and do difficult things". Making AI computing power usable, sufficient, easy to use, and of great use is the first problem Huawei Cloud solves.
Computing power has entered the country, and AI has begun to take off.
In order to do this important "difficult thing", Shengteng AI Cloud Service set out.
Big mountains and big rivers require AI to have huge computing power
Since the information revolution, humans have gradually discovered that the extent of technological innovation is directly proportional to the consumption of computing power. This has been confirmed once again by large models.
The emergence and maturity of large models has brought new opportunities to the intelligence of thousands of industries. It can be said that every scene in every industry is worthy of being integrated into the big model, and even most of them can be reshaped by the big model. Whether it is the characteristics of the large model itself with large model scale and many data parameters, or the emerging demand for large models in various industries, they all point to the same result: the AI computing power consumed by the social production system will increase exponentially.
Large models require large computing power, which has become an industry consensus. But if we open up this issue specifically, we will find that the industrial challenges surrounding AI computing power are very diverse. We can summarize it into four types: contradiction between supply and demand, energy efficiency challenges, operation and maintenance needs, and security concerns.
Let’s first look at the core challenge of AI computing power, which is the objective imbalance between supply and demand.
As of July 2023, a total of 130 large models have been released in China. The “Battle of 100 Models” has brought about a huge increase in AI computing power. According to relevant data reports, global AI computing power demand has increased by 300,000 times in the past 10 years, and we will face a 500-fold increase in demand in the next 10 years. According to the "2022-2023 China Artificial Intelligence Computing Power Development Assessment Report", the total amount of AI computing performed in China in 2022 has exceeded general computing. In the foreseeable future, AI computing power will become the computing form with the greatest demand from all walks of life, the largest gap between supply and demand, and the most resource-constrained computing.
Secondly, large models and large computing power bring acute energy efficiency issues.
Since large model training requires clustered AI computing, its training tasks rely heavily on data centers. The power density of AI servers is far higher than that of ordinary servers, and the power consumption of a single cabinet is 6 to 8 times higher than in the past. In the context of dual carbon, the energy efficiency ratio of data centers must continue to decline. Therefore, the computing energy efficiency problem caused by large models has become urgent. How to strike a balance between the improvement of AI computing power and the decrease of energy efficiency ratio has become a problem that the industry must face.
In addition, we also need to see a series of operation and maintenance problems that occur during the application process of AI computing power. Since the training and deployment goals of large models are different, and their respective training and deployment environments are also very different, operation and maintenance problems such as network delay, model reliability, and operation and maintenance management thresholds will naturally arise during the period. For example, some large models require extremely large computing power clusters to support, and problems often arise in the coordination between a large number of servers and computing units. Once a problem occurs in the computing unit, developers need to restart training. The huge cost of time, talent, and computing power in large model training occurs due to frequent failures and restarts of training tasks. Therefore, large models not only require sufficient AI computing power, but also require sophisticated computing power services to help users reduce overall operation and maintenance problems.
Finally, we will also see that large models bring new security concerns.
Since the deployment scenarios of large models are mostly related to the national economy and people’s livelihood, all security risks must be eliminated. In areas such as data access, storage encryption, and transmission security, large models still have many security risks.
Overall, large models are not highly consistent products. Its technical classification is complex and the engineering paths are changeable, and each user also needs to fine-tune and customize the large model according to their own needs. These problems give large models differentiated demands for AI computing power services from multiple aspects and angles.
Meeting the AI computing power requirements of large models has become the first test question in the era of large models.
Make Shengteng AI cloud services practical, refined, and competitive
For Huawei Cloud, to answer the question of computing power well, it needs to take into account two aspects of construction: One is how to make AI computing power sufficient and available, and the other is how to operate and maintain across computing power services. , safety, energy efficiency and other challenges. AI computing power must be both practical and precise.
In July this year, Huawei Cloud released the latest Ascend AI cloud service, which can provide users with surging AI computing power in thousands of industries. Behind it is Huawei Cloud's solid computing infrastructure construction.
Up to now, Huawei Cloud has built three major AI computing centers in Gui’an, Ulanqab, and Wuhu. Based on this, Shengteng AI cloud service has achieved a latency circle of 20ms nationwide. Users can achieve nearby access, and a single optical fiber can connect to the surging AI computing power, and the service can be used out of the box. In order to ensure the full life cycle security of large model training data, Shengteng AI cloud service also adopts multiple technologies such as data transmission and storage encryption, data security clearing, data access control, and data watermarking to prevent leakage. It is worth noting that, whether for enterprises or society as a whole, cloud services are the most energy-efficient way to obtain AI computing power in the dual-carbon era.
In order to achieve the ultimate performance of AI computing power, Huawei Cloud has also carried out technical optimization for AI cloud services on top of the infrastructure. For example, ModelArts provides three-layer acceleration of data, training, and inference. Through DataTurbo data acceleration technology, it can use computing node storage resources to build a distributed cache, reducing data reading latency to sub-milliseconds. TrainTurbo training acceleration technology can shorten the data reading time by 50% when the training data exceeds 100T, and improve the overall training efficiency by more than 20%. In the inference process of the model, InferTurbo inference acceleration technology accelerates model inference through graph compilation technology, and improves the inference performance of large models by 30% through full-link vertical collaborative optimization.
With the support of such powerful infrastructure and core technologies, users can obtain the ultimate AI computing performance. In order to make the computing power not only "sufficient" but also "easy to use", Huawei Cloud has carried out a series of measures to improve the AI computing power. The exploration of "being fine".
For example, AI development relies on comprehensive tools and platforms. Behind the Ascend AI cloud service, there are a series of AI underlying development tools and technology platforms that Huawei continues to build, such as the heterogeneous computing architecture CANN, the full-scenario AI framework MindSpore, and the AI development production line ModelArts, to provide distributed parallel acceleration for large models. Key capabilities such as operator and compilation optimization and cluster-level communication optimization lay the foundation for AI computing power services.
As mentioned above, we also face a series of operation and maintenance, energy efficiency and other issues during the training and deployment of large models. In terms of optimization of computing power services, Ascend AI Cloud Service can provide longer and more stable AI computing power services. The 30-day long-term stability rate of kilocalorie training reaches 90%. It can also achieve minute-level information acquisition, 2-hour demarcation, and 24-hour provision. Solution: The breakpoint recovery time is no more than 10 minutes, and the task recovery time is less than half an hour.
In terms of model migration, Huawei Cloud provides users with a migration tool chain and integrates full-stack development tools, which can achieve typical model migration efficiency as low as 2 weeks and self-service migration in mainstream scenarios. In addition, Ascend AI cloud service is also adapted to the industry's mainstream open source large models, such as LLAMA, Stable Diffusion, etc., thus truly allowing Ascend AI cloud service to meet the needs of "various models and thousands of states" of large models.
For athletes, the basic skill is competitiveness. On the track of cloud large models, Huawei Cloud has also developed AI computing power to become competitive through the integration of infrastructure, technology, and services.
The surging and easy-to-use AI computing power is the cornerstone of the industrialization of large models and the beginning of all stories.
Computing power, technology, scenarios: building a large model is a positive cycle
In the face of the sudden huge intelligent opportunity of large models, users from thousands of industries have huge and differentiated needs. Some people lack computing power, some need models, some are looking for scenarios, and some need the help of specific technical tools. The lack of any one condition will prevent the flywheel of a large model from rotating.
Looking at it from another perspective, the scenario-based application of computing resources, technical tools, and models can all become fulcrums. These elements can help business users and developers embark on the road to large models through full-stack cooperation and mutual promotion.
The Shengteng AI cloud service provided by Huawei Cloud not only has AI computing power that is useful, sufficient, and easy to use, it can also be linked with a series of layouts of Huawei Cloud to make large models "useful" from now on, so as to jointly achieve The large Pangu model solves difficult problems and does difficult things.
For example, we can see that many technological innovations and Shengteng AI cloud services complement each other and jointly solve the challenge of high demand for AI computing power and difficulty in supply. Huawei Cloud’s newly released distributed QingTian architecture has such an effect.
The computing power requirements of large model applications represented by AIGC are highly dependent on distributed computing capabilities, which also poses new challenges to the computing power architecture. As a new generation of peer-to-peer architecture, the distributed QingTian architecture can evolve traditional data center computing clusters into peer-to-peer pooled system clusters based on high-speed buses, thereby breaking the performance limitations of a single component and achieving software and hardware collaboration. The management and control plane is completely offloaded, with zero loss of resources and performance, ultimately bringing users the ultimate experience in terms of performance, reliability, security and trustworthiness.
In addition, Huawei Cloud has also made technological innovations in a series of fields such as AI cloud storage, GaussDB vector database, digital intelligence fusion, and Serverless large model engineering suites, focusing on the evolving needs of AI. From the system architecture layer, The data processing layer, model training layer, and application development layer achieve systematic innovation, thus clearing the obstacles for users to use AI cloud services and allowing AI computing power to truly align with the scenario-based needs of thousands of industries.
On top of the basic AI computing power and series of technological innovations, there is the Pangu series of large models provided by Huawei Cloud. At the Huawei Full Connection Conference 2023, we can also see that the Pangu model has been implemented in a more in-depth scenario in the industry.
In the automotive industry, Pangu Automobile’s large models can cover the design, production, marketing, R&D and other aspects of car companies, and play unique value in fields such as autonomous driving training and special scene implementation. In the field of autonomous driving, Pangu Automobile's large model can build a digital twin space of the scene based on the photos and videos collected from the actual scene, adding movable objects, editable weather, lighting, etc., to generate scene samples for autonomous driving learning. In operating scenarios such as ports and mining areas, the multi-scenario and multi-vehicle control algorithm of Pangu Automobile's large model can make the lateral error of a 60-ton heavy truck less than 0.2 meters and the precise docking error less than 0.1 meters. Currently, 23 unmanned heavy trucks are operating in the mines 24 hours a day in Xinjiang Jiangna Mining Industry and Inner Mongolia Yimin Open-pit Coal Mine.
In the field of live broadcast digital people, the Pangu Digital People's Congress model is pre-trained based on 100,000-level high-quality live broadcast speech skills. It can automatically generate professional speech skills to accurately and fluently introduce products, and can also automatically capture barrages and audiences. Real-time interaction. In Danzhai, Guizhou, the Pangu digital NPC model has brought the intangible heritage industry batik to the world.
To summarize, it is not difficult to find that Huawei Cloud has formed a "large model positive cycle" with AI computing power as the base, technological innovation as the driver, and the integration of the Pangu large model into industry scenarios. The computing power of ThePaper AI can drive the use of large models; technological innovation will continue to lower the threshold for large models; industry scenarios will drive the large-scale implementation of large models. Computing power drives the development of scenarios and technology; technological progress enables computing power to be fully released, and the value of scenarios is deeply explored; the progress of scenarios will drive the construction of computing power and lead technological progress. The three are rolling forward, attracting thousands of industries to look for answers to large models on the cloud.
Based on computing power and using computing power as the key, let AI integrate into the mountains and rivers - this is Huawei Cloud's AI long-term song.
The above is the detailed content of Huawei Cloud brings AI computing power into the world. For more information, please follow other related articles on the PHP Chinese website!