Home > Article > Technology peripherals > AI is advancing rapidly, and we need to be a pioneer to save our strength
On May 30, at the 2023 Zhongguancun Forum results conference, the "Implementation Plan for Beijing to Accelerate the Construction of a Globally Influential Artificial Intelligence Innovation Center (2023-2025)" was officially released. The "Implementation Plan" requires that innovation entities be supported to focus on breakthroughs in technologies such as distributed efficient deep learning frameworks and new infrastructure for large models, and strive to promote technological innovation related to large models.
This is regarded by the industry as another proof that China will vigorously promote the development of large models. In fact, recently, from the central ministries and commissions to local provinces and cities, the policy inclination towards developing AI technology and seizing the opportunities of large models has been increasing. Both the density of policy introduction and the overall strategic height have reached astonishing levels.
There is reason to believe that China will achieve a spurt in AI with large models as the breakthrough point. Since launching a new generation of artificial intelligence development strategy in 2017, China will further develop in the current window of opportunity and promote the overall explosion of the AI industry.
We all know that seizing AI development opportunities requires technological breakthroughs and infrastructure construction. When it comes to the infrastructure of the AI industry, AI chips, deep learning frameworks, and pre-trained large models are generally mentioned. However, Another key issue is often overlooked: large models will bring huge data pressure, and data storage is also the backbone of the AI development process.
ChatGPT is the lead of this round of AI explosion, and the data problems caused by the large-scale application of large models have actually been written in ChatGPT.
Faced with this coming pressure, is China ready?
Looking at the data challenges brought about by the rise of AI from ChatGPT
Since Google released BERT in 2018, the industry has started the road to pre-training large models. The characteristic of large models is that the scale of training data and model parameters are huge, which will bring severe challenges to storage, which is also evident in ChaGPT.
The so-called "bigness" of the pre-trained large model is reflected in the fact that the model's deep learning network has many layers, many links, complex parameters, and the types of data sets used for training are more complex and the amount of data is richer. When the deep learning algorithm was first born, mainstream models only had a few million parameters, but when BERT was released, the model parameters had exceeded 100 million, advancing deep learning to the large model stage. At the stage of ChatGPT, mainstream models already have hundreds of billions of parameters, and the industry has even begun planning trillions of models. In a few years, the parameters of AI models have increased thousands of times, and such huge data and models need to be stored. This has become the first major test for storage caused by the outbreak of AI.
In addition, it will be widely mentioned that the large AI model adopts a new model structure, so it will have better absorption effect and robustness for unstructured data. This is very important for the final effect of AI, but it also brings Here's a derivative question: We need to properly handle storing and recalling massive amounts of unstructured data. For example, after the upgrade, ChatGPT has added multi-modal capabilities such as image recognition, so its training data also needs to add a large number of pictures on top of text. Another example is self-driving vehicles, which need to store a large number of field test videos every day as a basis for model training. . The growth of these unstructured data has brought about the problem of massive growth of AI-related data, involving data storage and processing.
According to statistics, 80% of the world's new data is currently unstructured data, with a compound annual growth rate of 38%. Coping with the surge of diversified data has become a difficulty that must be overcome in the era of large models.
There is another problem. Large models often need to read and call data frequently. ChatGPT's data access usage reaches 1.76 billion times in a single month, and the average response speed is within 10 seconds. The workflow of the AI model includes four parts: collection, preparation, training, and reasoning. Each stage requires reading and writing different types of data. Therefore, large models also impose requirements on storage performance.
In addition, a series of data sovereignty and data protection disputes surrounding ChatGPT also remind us that large AI models bring new risks to data security. Just imagine, if criminals attack the database and cause the large language model to generate wrong information to deceive users, the harm will be serious and hidden.
Overall, although ChatGPT is good, it poses challenges to the scale, performance, security and other aspects of data storage. When we are committed to developing large models and ChatGPT-like applications, storage must be passed.
China has its strength, is it ready?
In recent years, we have been saying that computing power is productivity. But if you plan, you must have savings. The limit of savings also determines the upper limit of the development of digital productivity.
So, is China’s reserve force ready for the inevitable surge of China’s big model? Unfortunately, from several aspects, China's preparations for its reserve capacity are still insufficient today and need to be further upgraded and developed. We can pay attention to several problems in China's Cunli to see whether they can cope with the data pressure brought by large models.
1. Insufficient capacity limits the upper limit of the development of the AI industry
Large models will bring massive amounts of data, so the first priority is to properly store this data. But at the current stage, China still has the problem of insufficient storage capacity, and a large amount of data cannot even enter the storage stage. Judging from the data in 2022, China's data production has reached an astonishing 8.1ZB, ranking second in the world. However, China's storage capacity is only about 1,000 EB, which means that the data storage rate is only 12%, and the vast majority of data cannot be effectively saved. While China has clearly defined the status of data as the fifth factor of production, and the development of intelligence needs to rely on data and make full use of data, there is a huge amount of data that is difficult to save. This problem is not unserious. China still needs to maintain high-speed and large-scale capacity growth in order to seize the AI technology development opportunities brought by large models.
2. Under the impact of massive data, management efficiency and access efficiency are low
As mentioned earlier, the main data challenge brought by large AI models is the inefficiency of managing huge data and processing data acquisition and storage. Improving access efficiency requires data to be stored and written in a high-efficiency, low-energy-consuming manner. However, currently 75% of data in China still uses mechanical hard drives. Compared with flash drives, mechanical hard drives have low capacity density, slow data reading, high energy consumption, and poor reliability. Relatively speaking, all-flash memory has a series of advantages such as high density, low energy consumption, high performance, and high reliability. However, China’s all-flash memory Replacement still has a long way to go.
3. Multiple data concerns lead to a serious storage security situation
Data security issues have become an urgent concern for AI companies and even the AI industry. In 2020, a data security incident occurred at the Clearview AI company in the United States, resulting in the leakage of 3 billion pieces of data from more than 2,000 customers. This case shows us that the data security situation in the AI industry is very serious, and we must pay attention to security starting from the data storage stage. Especially when large AI models play an increasingly important role in the national economy and people's livelihood, it is even more necessary to improve storage security capabilities to deal with various possible risks.
Objectively speaking, China Cunli has maintained a high development speed, but it still has certain deficiencies in terms of overall scale, proportion of all-flash memory, and technological innovation capabilities. The time has come for a storage upgrade that caters to industrial intelligence needs and the large-scale implementation of AI.
Facing the intelligent era, the opportunities and directions of the storage industry
Combining the pressure brought by the large AI model represented by ChatGPT to storage, as well as the development status of China's storage capacity itself, we can clearly draw a conclusion: China's storage must support the rise of AI and complete large-scale upgrades.
We can clearly see the development direction of the storage industry. The urgency and broad space of these directions constitute a major opportunity for the storage industry.
First of all, it is necessary to expand the scale of storage capacity and accelerate the construction of all-flash memory.
All-flash memory replaces mechanical hard disks with "silicon advancing and magnetic retreating", which has been the overall development trend of the storage industry for many years. Facing the industrial opportunities arising from the rise of AI, China's storage industry needs to accelerate the implementation and implementation of all-flash memory replacement and maximize the advantages of all-flash memory such as high performance and high reliability to cope with the data storage needs brought by large AI models.
In addition, it must also be noted that the opportunities for all-flash distributed storage are increasing. With the rise of large AI models and the explosion of unstructured data, the importance of data is significantly increasing. At the same time, AI has penetrated into the production core of large government enterprises. More enterprise users tend to conduct localized AI training and adopt file-based AI training. Protocol data storage, rather than putting data on public cloud platforms, has led to increased and strengthened demand for distributed storage.
The combination of the two will continue to rapidly promote the implementation of all-flash in the storage industry, and it will become the core track for the development of China's storage industry.
Secondly, storage technology innovation needs to be improved to adapt to the development needs of AI models.
As mentioned above, the data test brought by AI is not only the large scale of data, but also the challenge of data complexity and application process diversity. Therefore, the advanced nature of storage must be further improved. For example, in order to cope with the frequent data access requirements of AI, storage read and write bandwidth and access efficiency need to be upgraded. In order to meet the data needs of large AI models, the storage industry needs to carry out comprehensive technical upgrades.
In terms of data storage formats, the original design intention of traditional data formats, such as "files" and "objects" is not to match the training needs of AI models, and the data formats of unstructured data are not uniform, making it difficult for AI models to In the process of calling data, a lot of work will be required to re-understand and align the file format, which will lead to a decrease in model operating efficiency and an increase in training computing power consumption.
To this end, a new "Data Paradigm" needs to be formed on the storage side. Taking autonomous driving training as an example, different types of data are involved in the data training process. If a new data paradigm is adopted on the storage side, it can help unify various data and better adapt to AI model training. , thereby accelerating the training of autonomous vehicles. For example, if you imagine AI as a new animal, it needs to eat a new kind of feed. If you feed it data in traditional formats, it will suffer from indigestion problems. The new data paradigm is to store data. Build data that is completely suitable for AI, making the process of "feeding AI" smooth.
In AI development work, data management accounts for a huge proportion of the workload, and there are also data island problems between different data sets, and data weaving technology can effectively deal with these problems. Through data weaving, the storage can have built-in data analysis capabilities and integrate physically and logically scattered data to form a global view of data scheduling and flow capabilities, thereby effectively managing the massive data brought by AI and improving data utilization efficiency.
These technological innovations on the storage side can form a closer fit between data storage and AI development.
In addition, security capabilities need to be incorporated into the storage itself to strengthen active security capabilities.
As the value of AI increases, data security issues bring more losses to enterprise users. Therefore, enterprises must improve their data security capabilities. The most important point is to improve data resilience, make the storage itself have security capabilities, and protect data security from the source. Next, more data resilience capabilities will be embedded into data storage products, such as ransomware detection, data encryption, security snapshots and AirGap quarantine recovery features.
It is worth noting that the industry has already explored and attempted to comprehensively upgrade storage in response to the rise of large AI models. Through high-quality all-flash products, Huawei Storage integrates advanced storage technology and built-in security capabilities to achieve a close fit between storage innovation and AI development, and work towards each other.
Overall, the development of the storage industry and the progress of China’s storage capacity are of decisive significance to the implementation of large-scale AI models and even the intelligent upgrading of thousands of industries. Without the development of storage, the data flood brought by AI will be difficult to properly resolve. AI technology may even become a rootless tree due to lack of data support.
The opportunities and responsibilities of the intelligent era happen to be faced by the storage industry at the same time. With the continued exploration of excellent brands such as Huawei, China's storage is facing unprecedented opportunities and is also shouldering the responsibilities given by the times.
Many industry experts believe that the large language model is the "iPhone moment" in the history of AI. Then the wave of storage upgrades brought by AI technology may also become a milestone moment in China's storage industry and the prelude to a golden age.
The above is the detailed content of AI is advancing rapidly, and we need to be a pioneer to save our strength. For more information, please follow other related articles on the PHP Chinese website!