Home >Technology peripherals >AI >Big data AI integrated interpretation
In the past year, the development of large models has been very rapid, and the stacking of computing power and data has made The model has some general structures and the ability to answer questions, leading people to the stage of artificial intelligence they have always dreamed of. For example, when chatting with a large language model, you will feel that you are not facing a blunt robot, but a flesh-and-blood person. It opens up more space for our imagination. The original human-computer interaction required using the keyboard and mouse to tell the machine our instructions through some formatting methods. Now, people can interact with computers through language, and machines can understand what we mean and respond.
In order to keep up with the trend, many technology companies have begun to focus on the research of large models. 2023 is considered the first year of artificial intelligence, just like the launch of the iPhone opened a new era of mobile Internet. The real breakthrough this time lies in the application of large-scale computing power and massive data.
From the perspective of model structure, the Transformer structure has actually been launched for a long time. In fact, the GPT model was published a year earlier than the Bert model. However, due to the limitations of computing power at the time, GPT was far less effective than Bert. Therefore, Bert became popular first and was used for translation, with very good results. But the focus this year has become GPT. The reason behind it is because of the very high computing power. Because of the efforts of hardware manufacturers and some progress in packaging and storage particles, we have the ability to use very high computing power. Stacked together, they promote in-depth understanding of more data and bring breakthrough results in AI. Based on the strong support of the underlying platform, algorithm students can develop and iterate models more conveniently and efficiently, promoting rapid model evolution.
General model The development cycle is shown in the figure below:
Many people think that model training is the most critical step. But in fact, before model training, there is a large amount of data that needs to be collected, cleaned, and managed. In this process, you can see that there are many steps that need to be verified, such as whether there is dirty data and whether the statistical distribution of the data is representative. After the model comes out, it needs to be tested and verified. This is also the verification of the data. The data is used to give feedback on the effectiveness of the model.
Better machine learning is 80% data plus 20% model, and the focus should be on the data.
This also reflects the evolutionary trend of model development. The original model development was model-centered, but now it has become data-centered. .
In the early days of deep learning, supervised learning was the main focus, and the most important thing was to have labeled data. The labeled data is divided into two categories, one is training data and the other is verification data. Use the training data to train the model, and then verify whether the model can give good results on the test data. The cost of labeling data is very high because people are required to label it. If you want to improve the effect of the model, you need to spend a lot of time and manpower on the model structure, improve the generalization ability of the model through structural changes, and reduce the overfit of the model. This is the model-centered development paradigm.
With the accumulation of data and computing power, unsupervised learning has gradually begun to be used. Through massive data, the model can autonomously discover the relationships existing in the data. At this time, Entering a data-centric development paradigm.
In the data-centered development model, the model structures are similar, basically a stack of Transformers, so more attention is paid to how to utilize data. In the process of using data, there will be a lot of data cleaning and comparison, which will take a lot of time because it requires a large amount of data. How to precisely control the data determines the speed of model convergence and iteration.
Alibaba Cloud has always emphasized the integration of AI and big data. Therefore, we built a platform with very good infrastructure, including high-bandwidth GPU clusters to provide high-performance AI computing power, and CPU clusters to provide cost-effective storage and data management capabilities. On top of this, we have built a big data and AI integrated PaaS platform, which includes a big data platform, an AI platform, a high-computing power platform, a cloud-native platform, etc. The engine part includes streaming computing, big data offline computing MaxCompute and PAI.
In the service layer, there are the large model application platform Bailian and the open source model community ModelScope. Alibaba has been actively promoting the sharing of model communities, hoping to use the concept of Model as a service to inspire more users with AI needs to use the basic capabilities of these models to quickly build AI applications.
The following two cases are used to explain why the linkage of big data and AI is needed.
In large model question answering system , first use the basic model, then embed the target document, and store the embedding result in the vector database. The number of documents can be very large, so embedding requires batch processing capabilities. The inference service of the basic model itself is also very resource-intensive. Of course, this also depends on how big the basic model is and how to parallelize it. All the generated embeddings are poured into the vector database. When querying, the query must also be vectorized, and then through vector retrieval, the knowledge that may be related to the question and answer is extracted from the vector database. This requires very good performance of the inference service.
After extracting the vector, you need to use the document represented by the vector as context, then constrain this large model, and make questions and answers on this basis, so that the effect of the answer will be far greater It is better than the results obtained by searching by yourself, and it is answered in a natural language.
In the above process, both an offline distributed big data platform is needed to quickly generate embeddings, and an AI platform for large model training and services is needed to connect the entire process. , in order to form a large model question answering system.
Another example is personalized recommendation. This model often High timeliness is required because everyone's interests and personality will change. To capture these changes, a streaming computing system needs to be used to analyze the data obtained in the APP, and then continuously use the extracted features to Model online learning, whenever new data comes in, the model will be updated, and then serve customers through the new model. Therefore, in this scenario, streaming computing capabilities are required, as well as model serving and training capabilities.
Through the above cases, we can see that the combination of AI and big data has become an inevitable development trend. Based on this concept, we first need to have a workspace that can manage the big data platform and the AI platform together. This is why the AI workspace was born.
In this AI workspace, it supports Flink clusters, offline computing cluster MaxCompute, AI platforms, container service computing platforms, etc.
Unifying big data and AI is only the first step. What is more important is to connect them in a workflow. Workflows can be established in many ways, such as SDK, graphical, GUI, SPEC writing, etc. The nodes in the workflow can be big data processing nodes or AI processing nodes, so that complex processes can be well connected.
To further improve efficiency and reduce costs, Severless cloud native services are needed. What Severless is is described in detail in the image above. Cloud native has many different levels, from share nothing (non-cloud approach) to share everything (very cloud approach). The higher the level, the higher the degree of resource sharing, the lower the unit computing cost, but the greater the pressure on the system.
The big data and database fields have slowly begun to move toward serverless in the past two years, also based on cost considerations. Originally, even servers used on the cloud, such as databases on the cloud, existed in the form of instantiations. Behind these instances are the shadows of resources, such as how many CPUs and Cores this instance has. Slowly and gradually transforming into Serverless, the first level is single-tenant computing, which refers to setting up a cluster on the cloud and then deploying big data or database platforms in it. But this cluster is single-tenant, that is, it shares the physical machine with other people. The physical machine is virtualized into a virtual machine, which is used to build a big data platform. This is called single-tenant computing, single-tenant storage, and single-tenant management and control. What users get is an elastic ECS machine on the cloud, but the big data management and operation and maintenance solutions need to be done by themselves. EMR is a classic solution in this regard.
Slowly, we will move from single-tenant storage to shared storage, which is the data lake solution. The data is in a more shared big data system. The calculation is to dynamically pull up a cluster. After the calculation is completed, the cluster will die, but the data will not die because the data is on the storage end of a reliable remote. This is shared storage. . Typical ones are data lake DLF and serverless EMR solutions.
The most extreme thing is Share Everything. If you use BigQuery or Alibaba Cloud’s MaxCompute, you will see a platform and some virtualization For project management, the user provides a query, and the platform performs billing and metering based on the query.
This can bring a lot of benefits. For example, there are many nodes in big data calculations and do not require user code, because these nodes are actually some build-in operators, such as join and aggregator. These deterministic results do not require a relatively heavy Sandbox. Because they are deterministic operators that have been rigorously tested and do not contain any malicious code or arbitrary UDF code, they can eliminate the overhead of virtualization.
The benefit of UDF is flexibility, which enables us to process rich data and has good scalability when the amount of data is large. But one of the challenges UDF will bring is the need for security and isolation.
Both Google's BigQuery and MaxComputer are based on the share everything architecture. We believe that only with the continuous improvement of technology can resources be used more tightly and computing will be This will save labor costs, allowing more companies to afford this data, and promote the use of data in model training.
Because of share everything, we can not only manage big data and AI in a unified way through the workspace, but also connect them through PAI-flow. It can also be scheduled uniformly by sharing everything. In this way, the research and development costs of enterprise AI big data will be further reduced.
At this point, there is a lot of work to be done. The scheduling of K8S itself is oriented to microservices, which will face great challenges for big data, because the service scheduling granularity of big data is very small, and many tasks will only survive for a few seconds to dozens of seconds. The overall pressure will increase by several orders of magnitude. We mainly need to solve how to scale off this scheduling capability on K8S. The Koordinator open source project we launched is to improve the scheduling capability and integrate big data and AI in the K8S ecosystem.
Another important task is multi-rental security isolation. How to implement multi-tenancy in the service layer and control layer of K8S, and how to implement over lake multi-tenancy on the network, so that multiple users can be served on one K8S and the data and resources of each user can be effectively isolated.
Alibaba has launched a container service called ACS, which uses the two technologies introduced earlier to expose all resources through containerization. This enables users to use the big data platform and AI platform seamlessly. It is a multi-tenant method and can support the needs of big data. The scheduling requirements of big data are several orders of magnitude higher than those of microservices and AI, and must be done well. On this basis, ACS products can help customers manage their resources well.
Enterprises face many demands and need to manage resources more carefully. For example, an enterprise is divided into various departments and sub-teams. When building a large model, resources will be split into many directions. Each team will do divergent innovation to see in what scenarios this basic model can be used well. application. But at a certain moment, I hope to concentrate on doing big things and pool all the computing power and resources to train the next iteration of the base model. In order to solve this problem, we have introduced multi-level quota management, which means that when tasks with higher requirements arrive, there can be a higher level to merge and consolidate all the sub-quotas below.
In fact, there are many particularities in the AI scenario. In many cases, synchronous calculation is required, and synchronous calculation is very sensitive to delay, and AI calculation density is high, and the requirements for the network are: very high. If you want to ensure computing power, you need to supply data and exchange gradient information, and when the model is parallel, more things will be exchanged. In these cases, in order to ensure that there are no shortcomings in communication, topology-aware scheduling is needed.
For example, in the All Reduce link of model training, if random scheduling is performed, there will be a lot of cross port switch connections, but if the order is finely controlled, then the cross port switch connections will be It will be very clean, so that the delay can be well guaranteed because there will be no conflicts in the upper-layer switch.
After these optimizations, performance can be greatly improved. How to transfer these topology-aware schedules to the manager of the entire platform is also an issue that needs to be considered when AI increases data platform management.
The previous introduction is about the management of resources and platforms. The management of data is also crucial. What we have been working on is the data warehouse. Systems such as data governance, data quality, etc. To associate the data system with the AI system, the data warehouse needs to provide an AI-friendly data link. For example, in the AI development process, the Python ecosystem is used. How can the data side use this platform through a Python SDK. The most popular library in Python is a data frame data structure similar to pandas. We can package the client side of the big data engine into a pandas interface, so that all AI development workers who are familiar with Python can use it well. Data platform. This is also the philosophy behind the MaxFrame framework we launched on MaxCompute this year.
Data processing systems are highly cost-sensitive in many cases, and sometimes higher-density storage systems are used to store data warehouses. System, but in order not to waste this system, many GPUs will be deployed on it. This high-density cluster is very demanding on the network and GPU, and the two systems are likely to be separated from storage and calculation. Our data system may be biased toward governance and management, while the computing system may be biased toward calculation. It may be a remote connection method. Although both are under the management of a K8S, in order to avoid waiting for data during calculation, we have created data Set acceleration DataSetAcc is actually a data cache that seamlessly connects to the data of remote storage nodes, helping algorithm engineers pull data to local memory or SSD behind the scenes for calculation.
Through the above methods, the AI and big data platforms can be organically combined, so that we can do some innovations. For example, when supporting model training for many general series, there is a lot of data that needs to be cleaned, because there is a lot of duplication in Internet data, so how to deduplicate data through a big data system is critical. It is precisely because we have organically combined the two systems that it is easy to clean the data on the big data platform, and the results can be fed into model training immediately.
The previous article mainly introduced how big data provides support for AI model training. On the other hand, AI technology can also be used to assist data insights and move towards the BI AI data processing model.
In the data processing process, it can help data analysts build analysis more easily. Originally, they might have to write SQL and learn how to use tools and data. system to interact. However, the AI era has changed the way human-computer interaction occurs, and can interact with data systems through natural language. For example, the Copilot programming assistant can assist in generating SQL and help complete various steps in the data development process, thus greatly improving development efficiency.
In addition, data insights can also be done through AI. For example, a piece of data, how many unique keys there are, and what method is suitable for visualization can all be obtained by using AI. AI can observe data and understand data from various angles, realize automatic data exploration, intelligent data query, chart generation, and one-click generation of analysis reports, etc. This is an intelligent analysis service.
Driven by big data and AI, there have been some very gratifying technological developments in recent years. To remain invincible in this trend, it is necessary to link up big data and AI. Only when the two complement each other can we achieve better AI iteration acceleration and data understanding.
The above is the detailed content of Big data AI integrated interpretation. For more information, please follow other related articles on the PHP Chinese website!