Home > Article > Technology peripherals > One article to understand the implementation process of Wuyun native AI platform-KubeAI
In the past few years, the cloud native field represented by container technology has received great attention and development. The implementation of containerization is an important step for enterprises to reduce costs and increase efficiency. To date, Dewu has basically completed the containerization of the entire domain. During the containerization process, on the one hand, the service deployment and operation and maintenance methods were smoothly switched from the previous ECS mode to the containerization mode; on the other hand, the company achieved many efficiency gains in terms of resource utilization and R&D efficiency.
Dewu is a new generation of trendy online shopping community. Search engines and personalized recommendation systems based on AI and big data technology are strong supports for business development. Therefore, applications in the algorithm domain account for 10% of business applications. A huge percentage. During the containerization process, in view of the differences between the R&D process of algorithm application services and ordinary services, and based on fully investigating the needs of R&D students in the algorithm domain, we built the Dewu Cloud native AI platform-KubeAI platform for the R&D scenarios in the algorithm domain. After continuous iteration of functions and continuous expansion of supported scenarios, KubeAI currently supports business domains involving AI capabilities such as CV, search recommendations, risk control algorithms, and data analysis, and has successfully completed containerization, improving resource utilization and R&D efficiency. All of the above improvements have achieved good results. This article will take everyone to understand the implementation process of KubeAI.
AI business is more of an iterative development process for models. Usually, the development process of models It can be summarized into the following steps:
Determine the demand scenario: This process must clearly define what problem is solved, what scenario is provided for the function, and what is the input of the function/service? What, what is the output? For example: Which brand of shoes should be identified or quality inspected, what are the product features of the brand, what dimensions of features does the sample have, feature types, etc. Different scenarios have different requirements for sample data and the processing algorithms used.
Data preparation: According to the results of scenario demand analysis, obtain sample data through various methods (online/offline/mock, etc.), and perform necessary cleaning, labeling and other operations on the data. This process is the basis of AI business development, because all subsequent processes are performed on the basis of data.
Determine the algorithm and write the training script: Based on the understanding of the business goals, in this link the algorithm students will select the appropriate algorithm and write the model training script based on past experience, or based on the scene research and experimental results.
Model training: For the algorithm model, we can understand it as a complex mathematical formula. There will be many parameters in this formula, just like w and b in f(x)=wx b. Training is the process of using a large amount of sample data to find optimal parameters in order to make the model have a high recognition rate. Model training is a very important part of the AI business development process. It can be said that the achievement of business goals depends on the accuracy of the model. Therefore, this link requires more time, energy, and resources than other links, and it requires repeated experimental training in order to achieve the best model accuracy and prediction accuracy. This process is not a one-time event, but cyclical. According to the update of business scenarios and data updates, it must be carried out periodically. For the development and training of algorithm models, the industry has some mainstream AI engines to choose from, such as TensorFlow, PyTorch, MXNet, etc. These engines can provide a certain degree of API support to facilitate algorithm developers to distribute complex models. training, or make some optimizations for the hardware to improve model training efficiency. The result of model training is to get the model file. The content of this file can be understood as saving the parameters of the model.
Model evaluation: In order to prevent model underfitting due to excessive deviation or overfitting due to excessive variance, some evaluation indicators are usually needed to guide developers to evaluate the generalization ability of the model. Some commonly used evaluation indicators, such as: precision, recall rate, ROC curve/AUC, PR curve, etc.
Model deployment: After repeated training and evaluation, an ideal model can be obtained to help the business process online/production data. This requires the model to be deployed in the form of a service or a task to achieve the purpose of accepting input data and providing inference results. We call this service a model service. Model service is an online service script that loads the model. After it is ready, it performs inference calculations on the preprocessed data.
After a model service is launched, the model service will need to be iterated due to changes in data characteristics, algorithm upgrades, online inference service script upgrades, new business requirements for inference indicators, etc. It should be noted that this iterative process may require re-training and re-evaluation of the model; it may also just be an iterative upgrade of the inference service script.
Since last year, we have gradually promoted the implementation of containerization of business services in various domains of Dewu. In order to reduce changes in user operating habits caused by changes in deployment methods during the containerization process, we continue to use the deployment process of the publishing platform to shield the differences between container deployment and ECS deployment.
In the CI process, different compilation and construction templates are customized according to different development language types. From source code compilation to container image production, it is completed uniformly by the container platform layer, which solves the problem of ordinary R&D students’ concerns about containers. The problem is that the engineering code cannot be made into a container image due to lack of knowledge. During the CD process, we perform hierarchical management of configurations at the application type level, environment level, and environment group level. When executing deployment, we merge the multi-layer configuration into values.yaml of Helm Chart and submit the orchestration file to the container cluster. Users only need to set the corresponding environment variables according to their actual needs, submit the deployment, and then obtain the application cluster instance (container instance, similar to the ECS service instance).
For the operation and maintenance of application clusters, the container platform provides the function of logging in to the container instance via WebShell, just like logging in to the ECS instance, which is convenient for troubleshooting application process-related problems; the container platform also provides file upload and download, Instance restart and reconstruction, resource monitoring and other operation and maintenance functions.
AI business (CV, search and recommendation, risk control algorithm services, etc.), as a relatively large business, participates in the containerization process together with ordinary business services. We have gradually completed the waterfall of communities and transactions. Migration of core scene services represented by flow and vajra position. After containerization, the resource utilization of the test environment has been greatly improved, the production environment has also been greatly improved, and the operation and maintenance efficiency has doubled.
The process of containerization is accompanied by the rapid development of the company's technology system, which makes the initial immature AI service research and development process difficult for containerization More demands were put forward, allowing us who originally only focused on containerization of online inference services to see the pain points and difficulties faced by algorithm students in the model development process.
#Pain point 1: The management of the model and the management of the inference service are incoherent. Most CV models are trained on desktop computers and then manually uploaded to OSS, and then the address of the model file on OSS is configured to the online inference service. Most Soutui models are trained on PAI, but are also manually stored on OSS, and are similar to CV when released. It can be seen that the management of the model product is incoherent in the process of model training and release. It is impossible to track which services a model is deployed on, and it is impossible to intuitively see which service a service is deployed on. Or there are multiple models, and model version management is inconvenient.
Pain point 2: It takes a long time to prepare the model development environment, and there is a lack of flexible mechanism for resource application and use. Before containerization, resources were generally provided in the form of ECS instances. You had to go through the process to apply for resources. After applying, you had to do various initialization work, install software, install dependencies, and transfer data (most of the software libraries used in algorithm research work are large in size). , the installation is also more complicated). After an ECS is installed, if resources are insufficient later, you have to apply again and go through the same process again, which is inefficient. At the same time, resource application is subject to quota (budget) constraints and lacks a mechanism for autonomous management and flexible application and release on demand.
Pain point 3: It is difficult to try some model solutions supported by cloud native. As cloud native technology continues to be implemented in various fields, solutions such as Kubeflow and Argo Workflow provide good support for AI scenarios. For example: tfjob-operator can manage distributed training tasks based on the TensorFlow framework in the form of CRD. Users only need to set the parameters of the corresponding components (Chief, PS, Worker, etc.) before submitting training tasks to the Kubernetes cluster. Before containerization, if algorithm students wanted to try this solution, they had to be familiar with and master container-related knowledge such as Docker and Kubernetes, and they could not use this capability as an ordinary user.
Pain point 4: When non-algorithm departments want to quickly verify an algorithm, they cannot find a platform that can support it well. The capabilities of AI are obviously used in various business fields, especially some mature algorithms. Business teams can easily use them to make some baseline indicator predictions or classification predictions to help the business achieve better results. At this time, a platform that can provide AI-related capabilities is needed to meet the needs of these scenarios for heterogeneous resources (CPU/GPU/storage/network, etc.) and algorithm model management, and provide users with ready-to-use functions.
Based on the combing and analysis of the above pain points and difficult issues, and based on other requirements put forward by algorithm students for the container platform during the containerization process (such as: model unified management requirements, log collection requirements, resource pool requirements, data persistence requirements, etc.), we discussed and solved each one one by one. While solving the current problems, we also considered the long-term functional planning of the platform, and gradually built a KubeAI platform solution based on the container platform and oriented to the AI business.
Based on fully researching and studying the basic architecture and product forms of AI platforms in the industry, focusing on AI business scenarios and their surrounding business needs, container technology The team designed and gradually implemented the cloud-native AI platform-KubeAI platform during the containerization process. The KubeAI platform focuses on solving the pain point needs of algorithm students, providing necessary functional modules throughout the model development, release and operation and maintenance processes, helping algorithm developers quickly and cost-effectively obtain and use AI infrastructure resources, and perform algorithms quickly and efficiently. Model design, development and experimentation.
The KubeAI platform provides the following functional modules around the entire life cycle of the model:
Data set management: Main It is compatible with different data sources and provides data caching acceleration capabilities.
Model training: It not only provides Notebook for model development and training, but also supports the management of one-time/periodic tasks; in this process, heterogeneous resources (CPU/GPU/storage) are elastically applied for and released.
Model management: Unified management of model metadata (basic model information, version list, etc.), seamlessly connected with model service release and operation and maintenance processes.
Inference service management: Decouples the model from the inference code, eliminating the need to package the model into the image, which improves the efficiency of inference service updates; supports model upgrades for online services.
AI-Pipeline engine: supports arranging tasks in a pipeline manner to meet the needs of data analysis, model periodic routine training tasks, model iteration and other scenarios.
KubeAI platform not only supports individual users, but also supports platform users. Individual developers use KubeAI's Notebook to develop model scripts. Smaller models can be trained directly in the Notebook, and complex models can be trained through tasks. After the model is produced, it is managed uniformly on KubeAI, including publishing it as an inference service and iterating new versions. The third-party business platform obtains the capabilities of KubeAI through OpenAPI for upper-layer business innovation.
Below we focus on the functions of the four modules of data set management, model training, model service management and AI-Pipeline engine.
After sorting out, the data used by the algorithm students is either stored in NAS, read from ODPS, or pulled from OSS. In order to unify data management, KubeAI is based on Kubernetes PVC (Persistent Volume Claim) resources to provide users with the concept of data sets and is compatible with different data source formats. At the same time, in order to solve the problem of high data access overhead caused by the separation of computing and storage architecture, we use Fluid to configure a caching service for the data set. The data can be cached locally for the next round of iterative calculations, or tasks can be scheduled to The data set has been cached on the computing node.
For model training, we mainly do three aspects of work:
(1)Based on JupyterLab , provides Notebook function, users can develop algorithm models through shell or Web IDE in the same development mode as local.
(2) Model training is conducted in the form of tasks, which can apply for and release resources more flexibly, improve resource utilization, and greatly reduce the cost of model training. Based on the good scalability of Kubernetes, various TrainingJob CRDs in the industry can be easily connected, and training frameworks such as Tensorflow, PyTorch, and xgbost can all be supported. Tasks support batch scheduling and task priority queues.
(3) Cooperated with the algorithm team to partially optimize the Tensorflow training framework, and achieved some improvements in batch data reading efficiency and PS/Worker communication speed; in PS load imbalance, slow Some solutions have been made to problems such as workers.
Compared with ordinary services, the biggest feature of model services is that the service needs to load one or more model files. In the early days of containerization, due to historical reasons, most CV model services directly packaged model files and inference scripts into container images, which resulted in relatively large container images and cumbersome model version updates.
KubeAI changes the above problem. Based on the standardized management of the model, the model service is associated with the model through configuration. When publishing, the platform pulls the corresponding model file according to the model configuration for loading by the inference script. This approach reduces the pressure on algorithm model developers to manage inference service images/versions, reduces storage redundancy, improves model update/rollback efficiency, improves model reuse rate, and helps algorithm teams manage more conveniently and quickly. Models and their associated online inference services.
The actual business scenario will not be a single task node. For example, a complete model iteration process roughly includes data processing links, model training links, and the use of new Model update online inference service, small traffic verification process and official release process. The KubeAI platform provides a workflow orchestration engine based on Argo Workflow. Workflow nodes support custom tasks, platform preset template tasks, and various deep learning AI training tasks (TFJob, PyTorchJob, etc.).
The development mode of CV algorithm model is generally to study theoretical algorithms while developing engineering practice algorithm models, and debug them at any time. Because models are generally smaller, training costs are lower than search and push models, so CV students are more accustomed to training directly in Notebook after developing training scripts in Notebook. Users can independently choose and configure resources such as CPU, GPU card, and network storage disk for Notebook.
After the training script meets the needs through development and debugging, users can use the task management function provided by KubeAI to configure the training script as a stand-alone training task or a distributed training task, and submit it to the KubeAI platform for execution. The platform will schedule the task to run in a resource pool with sufficient resources. After successful operation, the model will be pushed to the model warehouse and registered in KubeAI's model list; or the model will be saved in a designated location for users to make selections and confirmations.
After the model is generated, users can directly deploy the model as an inference service in KubeAI's model service management. When a new version of the model is produced later, users can configure a new model version for the inference service. Then, depending on whether the inference engine supports model hot update, you can complete the upgrade of the model in the inference service by redeploying the service or creating a model upgrade task.
In the machine identification business scenario, the above process is orchestrated through the AI-Pipeline workflow, and model iterations are performed periodically, increasing the model iteration efficiency by about 65%. After the CV scene is connected to the KubeAI platform, the previous local training method is abandoned, and the flexible on-demand resource acquisition method on the platform greatly improves resource utilization; in terms of model management, inference service management, and model iteration, R&D efficiency is improved. About 50%.
Compared with the CV model, the search and push model training cost is higher, which is mainly reflected in the large data sample and long training time. A single task requires a large amount of resources. Before KubeAI was launched, since our data was stored on ODPS (a data warehouse solution provided by Alibaba General Computing Platform, now renamed MaxCompute), most of the search and push algorithm students were in Dataworks (big data based on ODPS). Build data processing tasks on the console of the Development Management Platform) and submit model training tasks to the PAI platform. However, since PAI is a public cloud product, the cost of a single task submitted to it is higher than the resource cost required by the task itself. The higher part can actually be understood as technical service fees; in addition, this kind of public cloud product has The company's internal cost control needs are also unsatisfied.
After KubeAI is gradually implemented, we will gradually migrate the model training tasks of search and push scenarios on PAI to our platform in two ways. Method 1 is to maintain the user's habit of working in Dataworks, still complete some SQL tasks in Dataworks, and then submit the tasks to the KubeAI platform through shell commands; Method 2 is for users to submit tasks directly to the KubeAI platform. We hope that as the data warehouse infrastructure improves, we will gradually switch to the second method.
The model training task development process of Soutui fully utilizes the development environment and tools provided by KubeAI. Through the self-developed training project Framwork, when only using the CPU, the training time can be the same as using GPU training on PAI; the training engine side also supports large model training and real-time training scenarios, and cooperates with multiple types of storage (OSS/file storage) solution and model distribution solution to ensure the success rate of large model training tasks and efficiently complete model updates to online services.
In terms of resource scheduling and management, KubeAI makes full use of cluster federation, oversold mechanism, task bundling and other technical means to gradually transform the use of dedicated resource pools for training tasks into allocating elastic resources to task Pods and scheduling them to Online resource pool, public resource pool. Make full use of the characteristics of periodic execution of production tasks and the main development tasks during the day, and implement peak-shifting and differentiated scheduling (for example, using elastic resources for small sizes and regular resources for large sizes, etc.). Judging from the data in recent months, we have been able to continue to undertake a larger increase in tasks while the total increase in resources has not changed much.
#This is a typical non-algorithmic business scenario using AI capabilities. For example, use Facebook's prophet algorithm to predict a certain business indicator baseline. KubeAI provides basic AI capabilities for the needs of these scenarios, solving their problem of "difficulty in quickly verifying mature algorithms." Users only need to implement the algorithm model in an engineering way (using existing best practices or secondary development), then create a container image, submit a task on KubeAI, start execution, and obtain the desired results; Or perform training and inference periodically to obtain baseline prediction results.
Users can configure the computing resources required for tasks or other heterogeneous resources on demand and use them. Currently, taking 12 indicators of an online business scenario as an example, nearly 20,000 tasks are executed every day. Compared with the previous resource costs for similar needs, KubeAI helps it save nearly 90% of the cost and improve R&D efficiency by 3 About times.
Dewu has successfully containerized its business in a short period of time. This is due to the increasingly mature cloud native technology itself, and on the other hand Thanks to our in-depth understanding of the needs of our own business scenarios, we can provide targeted solutions. The KubeAI platform is based on our in-depth analysis of the pain point requirements of algorithmic business scenarios, based on how to continuously improve the engineering efficiency of AI business scenarios, improve resource utilization, and reduce the threshold for AI model/service development, and then gradually iteratively implement it.
In the future, we will continue to work hard on training engine optimization, refined AI task scheduling, and elastic model training to further improve AI model training and iteration efficiency and resource utilization.
The above is the detailed content of One article to understand the implementation process of Wuyun native AI platform-KubeAI. For more information, please follow other related articles on the PHP Chinese website!