Engineering practice of large-scale deep learning model for takeaway advertising-AI-php.cn

Home

Technology peripherals

Engineering practice of large-scale deep learning model for takeaway advertising

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 01, 2023 pm 04:43 PM

MeituanModelpractice

Author: 亚廼英梁陈龙, et al

Introduction

As Meituan’s food delivery business continues to develop, food delivery The advertising engine team has conducted engineering exploration and practice in multiple fields and has achieved some results. We will share it in a serial manner, and the content mainly includes: ① The practice of business platformization; ② The practice of large-scale deep learning model engineering; ③ The exploration and practice of near-line computing; ④ The practice of large-scale index construction and online retrieval services; ⑤ Mechanism engineering platform practice. Not long ago, we have published the practice of business platformization (For details, please refer to "##Exploration and Practice of Meituan Takeout Advertising Platformization》一文). This article is the second in a series of articles. We will focus on the challenges brought by large-scale deep models at the full link level, starting from two aspects: online latency and offline efficiency, and expounding the engineering of advertising on large-scale deep models. Practice, I hope it can bring some help or inspiration to everyone.

1 Background

In the core business scenarios of the Internet such as search, recommendation, and advertising (hereinafter referred to as search promotion), data mining and interests are conducted Modeling and providing users with high-quality services have become a key element in improving user experience. In recent years, for the search and promotion business, deep learning models have been widely implemented in the industry with the help of data dividends and hardware technology dividends. At the same time, in CTR scenarios, the industry has gradually transitioned from simple DNN small models to large Embedding models with trillions of parameters or even super large models. Model. The takeaway advertising business line has mainly experienced the evolutionary process of "LR shallow model (tree model)" -> "deep learning model" -> "large-scale deep learning model". The entire evolutionary trend is gradually transitioning from simple models based on artificial features to complex deep learning models with data as the core. The use of large models has greatly improved the expressive ability of the model, more accurately matched the supply and demand sides, and provided more possibilities for subsequent business development. But as the model and data scale continue to increase, we find that efficiency has the following relationship with them: Engineering practice of large-scale deep learning model for takeaway advertising As shown in the above figure, when the data scale and model scale increase, the corresponding The "duration" will become longer and longer. This "duration" corresponds to the offline level, reflected in efficiency; corresponds to the online level, reflected in Latency. And our work is carried out around the optimization of this "duration".

2 Analysis

Compared with ordinary small models, the core problem of large models is: as the amount of data and model scale increase dozens of times Even a hundred times, storage, communication, computing, etc. on the overall link will face new challenges, which will affect the offline iteration efficiency of the algorithm. How to break through a series of problems such as online delay constraints? Let’s analyze the entire link first, as shown below:

Engineering practice of large-scale deep learning model for takeaway advertising

The "duration" becomes longer, which will mainly be reflected in the following aspects:

Online delay: At the feature level, when the online request remains unchanged, the increase in the number of features will cause IO and feature calculation time Problems such as the increase are particularly prominent and require reshaping in aspects such as feature operator analysis and compilation, feature extraction internal task scheduling, and network I/O transmission. At the model level, the model has gone through changes from hundreds of M/G to hundreds of G, which has brought about an increase of two orders of magnitude in storage. In addition, the calculation amount of a single model has also increased by orders of magnitude (FLOPs have gone from millions to now tens of millions). Simply relying on the CPU cannot solve the huge problem. Due to the demand for computing power, it is imperative to build a CPU GPU Hierarchical Cache reasoning architecture to support large-scale deep learning reasoning.
Offline efficiency: As the number of samples and features increases several times, the time for sample construction and model training will be greatly lengthened. It may even become unacceptable. How to solve massive sample construction and model training with limited resources is the primary problem of the system. At the data level, the industry generally solves the problem from two levels. On the one hand, it continuously optimizes the constraints in the batch processing process. On the other hand, it "turns batches into streams" of data, from centralized to distributed, which greatly improves the efficiency of data. Ready time. At the training level, acceleration is achieved through hardware GPU combined with architecture-level optimization. Secondly, algorithm innovation is often driven by people. How can new data quickly match the model? How can the new model be quickly applied by other businesses? If N people are placed on N business lines to independently perform the same optimization, it will evolve into a By optimizing one business line and broadcasting to N business lines at the same time, N-1 manpower will be released to do new innovations, which will greatly shorten the innovation cycle, especially when the entire model scale changes. In the future, it will inevitably increase the cost of manual iteration, achieve a deep transformation from "people find features/models" to "features/models find people", reduce "repeated innovation", and achieve intelligent matching of models and data.
Other issues with Pipeline: Machine learning Pipeline is not unique to large-scale deep learning model links, but with the rollout of large models, it will There are new challenges, such as: ① How the system process supports full and incremental online deployment; ② The rollback time of the model, the time to do things correctly, and the recovery time after doing things wrong. In short, new demands will arise in development, testing, deployment, monitoring, rollback, etc.

This article focuses on online latency (model inference, feature service), offline efficiency ( It will be carried out from two aspects: sample construction and data preparation), and gradually explain the engineering practice of advertising on large-scale deep models. We will share how to optimize "duration" and other related issues in subsequent chapters, so stay tuned.

3 Model Inference

At the model inference level, takeaway advertising has gone through three versions, from the 1.0 era, represented by DNN models that support niche scale, to 2.0 Era, high efficiency and low code support multi-service iterations, and now in the 3.0 era, it is gradually facing the needs of deep learning DNN computing power and large-scale storage. The main evolution trends are shown in the figure below: Engineering practice of large-scale deep learning model for takeaway advertising

Facing large model inference scenarios, the two core problems solved by the 3.0 architecture are: "storage problem" and "performance problem". Of course, how to iterate for N hundreds of G models, how to ensure online stability when the computational load increases dozens of times, how to strengthen the Pipeline, etc. are also challenges faced by the project. Below we will focus on how the Model Inference 3.0 architecture solves the problem of large model storage through "distribution" and how to solve the performance and throughput problems through CPU/GPU acceleration.

3.1 Distributed

The parameters of a large model are mainly divided into two parts: Sparse parameters and Dense parameters.

Sparse parameters: The parameter magnitude is very large, usually in the billion level, or even billion/tens of billions level, which will lead to a large storage space usage. Large, usually in the 100G level or even the T level. Its characteristics: ① Difficulty in stand-alone loading: In stand-alone mode, all Sparse parameters need to be loaded into the machine memory, resulting in serious memory shortage, affecting stability and iteration efficiency; ② Sparse reading: only part of the Sparse parameters need to be read for each inference calculation. Parameters, for example, the full amount of User parameters is at the 200 million level, but only one User parameter needs to be read for each inference request.
Dense parameters: The parameter scale is not large, the model is generally fully connected at 2 to 3 layers, and the parameter magnitude is at the million/ten million level. Features: ① Single machine can be loaded: Dense parameters occupy about tens of megabytes, and the single machine memory can be loaded normally. For example: the input layer is 2000, the fully connected layer is [1024, 512, 256], and the total parameters are: 2000 * 1024 1024 * 512 512 * 256 256 = 2703616, a total of 2.7 million parameters, and the memory occupied is within 100 megabytes; ② Full read: For each inference calculation, the full parameters need to be read.

Therefore, the key to solving the problem of large model parameter scale growth is to transform Sparse parameters from single-machine storage to distributed storage. The transformation method includes two parts: ① Model network structure conversion ;② Export Sparse parameters.

3.1.1 Model network structure conversion

In the industry, there are roughly two ways to obtain distributed parameters: external services obtain parameters in advance and pass them to the estimation service ;The estimation service internally obtains parameters from distributed storage by transforming the TF (TensorFlow) operator. In order to reduce the cost of architectural modification and reduce the intrusion into the existing model structure, we chose to obtain distributed parameters by modifying the TF operator.

Under normal circumstances, the TF model will use native operators to read Sparse parameters. The core operator is the GatherV2 operator. The input of the operator mainly has two parts: ① Required Query ID list; ② Embedding table that stores Sparse parameters.

The function of the operator is to read the Embedding data corresponding to the ID list index from the Embedding table and return it. It is essentially a Hash query process. Among them, the Sparse parameters stored in the Embedding table are all stored in the single-machine memory in the single-machine model.

Transforming the TF operator is essentially a transformation of the model network structure. The core points of the transformation mainly include two parts: ① Network graph reconstruction; ② Custom distributed operators.

1. Network diagram reconstruction: Transform the model network structure, replace the native TF operator with a custom distributed operator, and perform the modification of the native Embedding table at the same time solidify.

Distributed operator replacement: Traverse the model network and replace the GatherV2 operator that needs to be replaced with a custom distributed one The operator MtGatherV2 simultaneously modifies the Input/Output of upstream and downstream nodes.
Native Embedding table solidification: The native Embedding table is solidified as placeholders, which can not only retain the integrity of the model network structure, but also avoid the occupation of single-machine memory by Sparse parameters. .

Engineering practice of large-scale deep learning model for takeaway advertising 2. Custom distributed operator: Transform query based on ID list The Embedding process is modified from querying from the local Embedding table to querying from the distributed KV.

Request query: Deduplicate the input ID to reduce the query volume, and query the second-level cache concurrently through sharding ( Local Cache Remote KV) Get the Embedding vector.
Model management: Maintain the model Embedding Meta registration and uninstall process, as well as the creation and destruction of Cache.
Model deployment: Trigger the loading of model resource information and the process of parallel importing Embedding data into KV.

3.1.2 Sparse parameter export

Sharded parallel export: parsing model Checkpoint file, obtain the Part information corresponding to the Embedding table, divide it according to the Part, and export each Part file to HDFS in parallel through multiple Worker nodes.
Import KV: Pre-allocate multiple Buckets in advance. The Buckets will store information such as model versions to facilitate online routing queries. At the same time, the Embedding data of the model will also be stored in the Bucket and imported into KV in a parallel manner based on sharding.

The overall process is shown in the figure below. We ensure hundreds of gigabytes of data through offline distributed model structure conversion, near-line data consistency guarantee, and online hotspot data caching. Normal iteration requirements of the model.

Engineering practice of large-scale deep learning model for takeaway advertising

#It can be seen that the storage used by distributed storage is the external KV capability, which will be replaced by the more efficient, flexible and easy-to-manage Embedding Service in the future.

3.2 CPU acceleration

In addition to the optimization methods of the model itself, there are two main common CPU acceleration methods: ① Instruction set optimization, such as using AVX2, AVX512 Instruction set; ② Use acceleration library (TVM, OpenVINO).

Instruction set optimization: If you use the TensorFlow model, when compiling the TensorFlow framework code, directly add instructions to the compilation options Just set the optimization items. Practice has proved that the introduction of AVX2 and AVX512 instruction sets has obvious optimization effects, and the throughput of online inference services has increased by 30%.
Acceleration library optimization: The acceleration library optimizes and integrates the network model structure to achieve inference acceleration effects. Commonly used acceleration libraries in the industry include TVM, OpenVINO, etc. Among them, TVM supports cross-platform and has good versatility. OpenVINO is optimized specifically for Intel manufacturer's hardware. It has general versatility but good acceleration effect.

Below, we will focus on some of our practical experiences in using OpenVINO for CPU acceleration. OpenVINO is a set of deep learning-based computing acceleration optimization framework launched by Intel, which supports compression optimization, accelerated computing and other functions of machine learning models. The acceleration principle of OpenVINO is simply summarized into two parts: linear operator fusion and data accuracy calibration.

Linear operator fusion: OpenVINO uses the model optimizer to unify the multi-layer operators in the model network to linear Fusion to reduce operator scheduling overhead and data access overhead between operators, such as merging three Conv BN Relu operators into a CBR structure operator.
Data accuracy calibration: After the model is trained offline, since there is no need for backpropagation during the inference process, the data accuracy can be appropriately reduced, such as to FP16 or INT8 precision, resulting in a smaller memory footprint and lower inference latency.

#CPU acceleration usually accelerates inference for fixed batch candidate queues, but in search promotion scenarios, candidate queues are often dynamic. This means that before model inference, a Batch matching operation needs to be added, that is, the requested dynamic Batch candidate queue is mapped to a Batch model closest to it, but this requires building N matching models, resulting in N times the memory usage. The current model volume has reached hundreds of gigabytes, and memory is seriously tight. Therefore, selecting a reasonable network structure for acceleration is a key issue that needs to be considered. The following figure is the overall operating architecture: Engineering practice of large-scale deep learning model for takeaway advertising

Network distribution: The overall network structure of the CTR model is abstracted into three parts: Embedding layer, Attention layer and MLP layer, of which the Embedding layer is used for data acquisition. , the Attention layer contains more logical operations and lightweight network calculations, while the MLP layer is dense network calculations.
Acceleration network selection: OpenVINO has a better acceleration effect for pure network computing and can be well applied to the MLP layer. In addition, most of the model data is stored in the Embedding layer, and the MLP layer only occupies a few tens of megabytes of memory. If multiple batches are divided for the MLP layer network, the model memory occupation is before optimization (Embedding Attention MLP) ≈ after optimization (Embedding Attention MLP×Batch number). For memory usage The impact is smaller. Therefore, we finally selected the MLP layer network as the model acceleration network.

At present, the CPU acceleration solution based on OpenVINO has achieved good results in the production environment: when the CPU is the same as the baseline, the service throughput is increased by 40%, and the average delay is reduced by 15%. If you want to do some acceleration at the CPU level, OpenVINO is a good choice.

3.3 GPU acceleration

On the one hand, with the development of business, business forms are becoming more and more abundant, traffic is getting higher and higher, and models are getting wider and deeper. The power consumption increases sharply; on the other hand, advertising scenes mainly use DNN models, involving a large number of sparse feature Embedding and neural network floating point operations. As a memory access and computing-intensive online service, it must meet the requirements of low latency and high throughput while ensuring availability, which is also a challenge to the computing power of a single machine. If these conflicts between computing resource requirements and space are not resolved well, they will greatly limit business development: before the model is widened and deepened, pure CPU inference services can provide considerable throughput, but after the model is widened and deepened, the calculations become complex. In order to ensure high availability, a large amount of machine resources need to be consumed, which makes large models unable to be applied online on a large scale. At present, a common solution in the industry is to use GPU to solve this problem. GPU itself is more suitable for computing-intensive tasks. Using GPUs requires solving the following challenges: how to achieve as high a throughput as possible while ensuring availability and low latency, while also considering ease of use and versatility. To this end, we have also done a lot of practical work on GPUs, such as TensorFlow-GPU, TensorFlow-TensorRT, TensorRT, etc. In order to take into account the flexibility of TF and the acceleration effect of TensorRT, we adopt the independent two-stage architecture design of TensorFlow and TensorRT.

3.3.1 Acceleration analysis

Heterogeneous computing: Our idea is consistent with CPU acceleration, 200G deep learning CTR The model cannot be directly put into the GPU. Memory-intensive operators are suitable (such as Embedding related operations). CPU and calculation-intensive operators (For example, MLP) is suitable for GPU.
Several points that need to be paid attention to when using GPU: ① Frequent interaction between memory and video memory; ② Latency and throughput; ③ Scalability Trade Off with performance optimization; ④ GPU Utilization.
Selection of inference engine: Commonly used inference acceleration engines in the industry include TensorRT, TVM, XLA, ONNXRuntime, etc. Since TensorRT is good at operator optimization It is more in-depth than other engines, and can implement any operator through a customized plugin, which is highly scalable. Moreover, TensorRT supports models of common learning platforms (Caffe, PyTorch, TensorFlow, etc.), and its surroundings are becoming more and more complete (Model conversion tool onnx-tensorrt, performance analysis tool nsys, etc.), so the acceleration engine on the GPU side uses TensorRT.
Model analysis: The overall abstraction of the CTR model network structure is divided into three parts: Embedding layer, Attention layer and MLP layer. The Embedding layer is used for data acquisition and is suitable for CPU ;The Attention layer contains more logical operations and lightweight network calculations, while the MLP layer focuses on network calculations, and these calculations can be performed in parallel, suitable for GPU, and can make full use of GPU Core (Cuda Core, Tensor Core ), improve the degree of parallelism.

3.3.2 Optimization Goals

The inference phase of deep learning has high requirements on computing power and latency. If the trained neural network is directly When deployed to the inference end, problems such as insufficient computing power or long inference time may occur. Therefore, we need to perform certain optimizations on the trained neural network. The general idea of optimizing neural network models in the industry can be optimized from different aspects such as model compression, merging different network layers, sparsification, and the use of low-precision data types. It even needs targeted optimization based on hardware characteristics. To this end, we mainly optimize around the following two goals:

Throughput under delay and resource constraints: When shared resources such as register and cache do not need to compete, increasing concurrency can effectively improve resource utilization (CPU, GPU and other utilization), but this may lead to an increase in request latency. Since the delay limit of the online system is very strict, the throughput upper limit of the online system cannot be simply converted through the resource utilization indicator. It needs to be comprehensively evaluated under the delay constraint and combined with the resource upper limit. When system latency is low and resource (Memory/CPU/GPU, etc.) utilization is a constraint, resource utilization can be reduced through model optimization; When system resource utilization is low and latency is a constraint, latency can be reduced through fusion optimization and engine optimization. By combining the above various optimization methods, the comprehensive capabilities of system services can be effectively improved, thereby achieving the purpose of improving system throughput.
Computing density under calculation constraints: Under CPU/GPU heterogeneous systems, model reasoning performance is mainly affected by data copy efficiency and computing efficiency, which are respectively controlled by memory access-intensive operators. Determined by calculation-intensive operators, the data copy efficiency is affected by the efficiency of PCIe data transmission, CPU/GPU memory reading and writing, etc. The computing efficiency is affected by the computing efficiency of various computing units such as CPU Core, CUDA Core, and Tensor Core. With the rapid development of hardware such as GPUs, the processing capabilities of computing-intensive operators have increased rapidly. As a result, the phenomenon of memory-intensive operators hindering the improvement of system service capabilities has become more and more prominent. Therefore, reducing memory-access-intensive operators and improving Computing density is also becoming more and more important to system service capabilities, that is, reducing data copying and kernel launch when the amount of model calculations does not change much. For example, model optimization and fusion optimization are used to reduce the use of operator transformations (such as Cast/Unsqueeze/Concat and other operators), and CUDA Graph is used to reduce kernel launch, etc.

The following will focus on the above two goals and introduce in detail our model optimization, fusion optimization and engine optimization Some work done.

3.3.3 Model Optimization

1. Calculation and transmission deduplication: During inference, the same Batch only contains one user information. Therefore, the user information can be reduced from Batch Size to 1 before inference, and then expanded when inference is really needed to reduce the cost of data transmission copy and repeated calculation. As shown in the figure below, you can query the User class feature information only once before inference, and prune it in only user-related subnetworks, and then expand it when you need to calculate the association.

Automated process: find the repeatedly calculated nodes (red nodes), if all the leaf nodes of the node are repeated calculation nodes, then the node is also a repeated calculation node, and all repeated nodes are searched upward layer by layer from the leaf nodes until After node traversal and search, find the connecting lines of all red and white nodes, insert the User feature expansion node, and expand the User feature.

Engineering practice of large-scale deep learning model for takeaway advertising

2. Data accuracy optimization: Due to the model During training, back propagation is required to update the gradient, which requires high data accuracy; while during model inference, only forward inference is performed without updating the gradient. Therefore, under the premise of ensuring the effect, FP16 or mixed precision is used for optimization to save memory space. , reduce transmission overhead and improve inference performance and throughput.

#3. Calculation pushdown: The CTR model structure mainly consists of three layers: Embedding, Attention and MLP. The Embedding layer obtains partial data. Attention is partly logical and partly computational. In order to fully exploit the potential of the GPU, most of the calculation logic of Attention and MLP in the CTR model structure is moved from the CPU to the GPU for calculation, and the overall throughput is greatly improved.

3.3.4 Fusion Optimization

During online model inference, the calculation operations of each layer are completed by the GPU. In fact, the CPU completes the calculation by starting different CUDA kernels. CUDA kernel Calculating tensors is very fast, but a lot of time is often wasted on starting the CUDA kernel and reading and writing the input/output tensors of each layer, which causes a memory bandwidth bottleneck and a waste of GPU resources. Here we will mainly introduce the automatic optimization and manual optimization parts of TensorRT. 1. Automatic optimization: TensorRT is a high-performance deep learning inference optimizer that can provide low-latency, high-throughput inference deployment for deep learning applications. TensorRT can be used to accelerate inference on very large-scale models, embedded platforms, or autonomous driving platforms. TensorRT can now support almost all deep learning frameworks such as TensorFlow, Caffe, MXNet, and PyTorch. Combining TensorRT with NVIDIA GPUs can enable fast and efficient deployment and inference in almost all frameworks. And some optimizations do not require too much user participation, such as some Layer Fusion, Kernel Auto-Tuning, etc.

Layer Fusion: TensorRT greatly reduces the number of network layers by merging horizontally or vertically between layers. Simply put, it is through the fusion of some calculations op or remove some redundant ops to reduce the number of data circulation, frequent use of video memory and scheduling overhead. For example, common network structures include Convolution And ElementWise Operation fusion, CBR fusion, etc. The following figure is a structural diagram of some subgraphs in the entire network structure before and after fusion. FusedNewOP may involve a variety of Tactic during the fusion process, such as CudnnMLPFC, CudnnMLPMM, CudaMLP, etc. Finally, an optimal Tactic will be selected as the fused structure based on the duration. Through the fusion operation, the number of network layers is reduced and the data channel is shortened; the same structure is merged to make the data channel wider, achieving the purpose of more efficient use of GPU resources.

Engineering practice of large-scale deep learning model for takeaway advertising

##Kernel Auto-Tuning: When the network model is in inference, it calls the CUDA kernel of the GPU. Calculation. TensorRT can adjust the CUDA kernel for different network models, graphics card structures, number of SMs, core frequencies, etc., select different optimization strategies and calculation methods, and find the optimal calculation method suitable for the current situation to ensure that the current model obtains the best results on a specific platform. Excellent performance. The above picture is the main idea of optimization. Each op will have multiple kernel optimization strategies (cuDNN, cuBLAS, etc.). According to the current architecture, inefficient kernels are filtered from all optimization strategies and the optimal kernel is selected at the same time. Finally Form a new Network.

2. Manual optimization: As we all know, GPU is suitable for calculation-intensive operators. Other types of operators (lightweight calculation operators, logical operation operators, etc.) are not very friendly. When using GPU calculations, each operation generally goes through several processes: CPU allocates video memory on the GPU -> CPU sends data to GPU -> CPU starts CUDA kernel -> CPU retrieves data -> CPU releases GPU video memory. In order to reduce overheads such as scheduling, kernel launch, and memory access, network integration is required. Due to the flexible and changeable structure of the CTR large model, it is difficult to unify the network fusion methods, and only specific problems can be analyzed in detail. For example, in the vertical direction, Cast, Unsqueeze and Less are fused, and TensorRT internal Conv, BN and Relu are fused; in the horizontal direction, input operators of the same dimension are fused. To this end, we use NVIDIA related performance analysis tools (NVIDIA Nsight Systems, NVIDIA Nsight Compute, etc.) to analyze specific issues based on actual online business scenarios. Integrate these performance analysis tools into the online inference environment to obtain the GPU Profing file during the inference process. Through the Profing file, we can clearly see the inference process. We found that the kernel launch bound phenomenon of some operators in the entire inference is serious, and the gaps between some operators are large, and there is room for optimization, as shown in the following figure:

Engineering practice of large-scale deep learning model for takeaway advertising

To this end, analyze the entire Network based on performance analysis tools and the converted model, find out the parts that TensorRT has optimized, and then perform network integration on other substructures in the Network that can be optimized, while also ensuring that this The substructure occupies a certain proportion in the entire Network, ensuring that the computing density can increase to a certain extent after fusion. As for what kind of network integration method to use, it can be used flexibly according to specific scenarios. The following figure is a comparison of the substructure diagrams before and after our integration:

Engineering practice of large-scale deep learning model for takeaway advertising

##3.3 .5 Engine Optimization

Multiple models: Due to the uncertain scale of user requests in takeaway ads, the ads are sometimes more and sometimes less, so load more Each model corresponds to a batch of different inputs. The input scale is divided into buckets, and is padded to multiple fixed batches. At the same time, it is mapped to the corresponding model for inference.
Multi-contexts and Multi-streams: For each Batch model, using multi-context and multi-stream can not only avoid the overhead of the model waiting for the same context, but also Moreover, it can make full use of the concurrency of multiple streams to achieve overlap between streams. At the same time, in order to better solve the problem of resource competition, CAS is introduced. As shown in the figure below, a single stream becomes multiple streams:

Engineering practice of large-scale deep learning model for takeaway advertising

##Dynamic Shape : In order to deal with unnecessary data padding in scenarios where the input batch is uncertain, and at the same time reduce the number of models and reduce the waste of resources such as video memory, Dynamic Shape is introduced. The model performs inference based on the actual input data, reducing data padding and unnecessary computing resources. waste, and ultimately achieve the purpose of performance optimization and throughput improvement.
CUDA Graph: The time spent on each operation (kernel running, etc.) of modern GPU is at least microsecond level, and it will Each operation submitted to the GPU will also generate some overhead (microsecond level). In actual inference, it is often necessary to perform a large number of kernel operations. Each of these operations is submitted to the GPU separately and calculated independently. If the overhead of all submission startups can be summarized together, it should bring about an overall improvement in performance. CUDA Graph can accomplish this by defining the entire computing process as a graph rather than a list of individual operations, and then reducing the overhead of kernel submission startup by providing a method for a single CPU operation to launch multiple GPU operations on the graph. . The core idea of CUDA Graph is to reduce the number of kernel launches. By capturing the graph before and after inference, and updating the graph according to the needs of inference, subsequent inferences no longer require kernel launches one after another, only graph launches are required, ultimately reducing the number of kernel launches. Purpose. As shown in the figure below, one inference performs 4 kernel-related operations, and the optimization effect can be clearly seen by using CUDA Graph.

Engineering practice of large-scale deep learning model for takeaway advertising

Multi-level PS: In order to further explore the performance of the GPU acceleration engine, query the Embedding data The operation can be performed through multi-level PS: GPU memory Cache->CPU memory Cache->local SSD/distributed KV. Among them, hotspot data can be cached in the GPU memory, and the cached data can be dynamically updated through mechanisms such as data hotspot migration, promotion, and elimination, fully utilizing the parallel computing power and memory access capabilities of the GPU for efficient querying. After offline testing, the GPU Cache query performance is 10 times higher than that of the CPU Cache; for GPU Cache miss data, you can query it by accessing the CPU Cache. The two-level cache can satisfy 90% of data access; for long-tail requests, you need to pass Access distributed KV for data acquisition. The specific structure is as follows:

##3.3.6 Pipeline Engineering practice of large-scale deep learning model for takeaway advertising

The whole process from offline training to final online loading of the model is cumbersome. It is error-prone, and the model cannot be used universally on different GPU cards, different TensorRT and CUDA versions, which brings more possibilities for errors in model conversion. Therefore, in order to improve the overall efficiency of model iteration, we have carried out relevant capacity building in Pipeline, as shown in the following figure:

Engineering practice of large-scale deep learning model for takeaway advertising

Pipeline construction includes two parts: the offline model splitting and conversion process, and the online model deployment process:

Offline side: Just provide the model splitting node, and the platform will automatically split the original TF model into Embedding sub-models and calculation graph sub-models. The Embedding sub-models perform distributed computing through distributed converters. Sub-replacement and Embedding import work; the computational graph sub-model performs conversion and compilation optimization of the TensorRT model according to the selected hardware environment (GPU model, TensorRT version, CUDA version), and finally converts the two sub-models The results are stored in S3 for subsequent model deployment and online. The entire process is automatically completed by the platform, without the user being aware of the execution details.
Online testing: Just select the model deployment hardware environment (Be consistent with the environment for model conversion), the platform will be configured according to the environment, Perform adaptive push loading of the model, and complete the deployment and online deployment of the model with one click.

#Pipeline has greatly improved model iteration efficiency through the construction of configuration and one-click capabilities, helping algorithm and engineering students to focus more on their jobs. The following figure shows the overall benefits achieved in GPU practice compared to pure CPU inference:

Engineering practice of large-scale deep learning model for takeaway advertising

4 Feature Service CodeGen Optimization

Features Extraction is the pre-stage of model calculation. Whether it is a traditional LR model or an increasingly popular deep learning model, input needs to be obtained through feature extraction. In the previous blogConstruction and Practice of Meituan Takeout Feature Platform, we described our feature calculation based on the model feature self-description MFDL. The process is configured to ensure the consistency of samples during online estimation and offline training. With the rapid iteration of business, the number of model features continues to increase, especially large models that introduce a large number of discrete features, resulting in a doubling of the amount of calculation. To this end, we have made some optimizations to the feature extraction layer and achieved significant gains in throughput and time consumption.

4.1 Full-process CodeGen optimization

DSL is a description of the feature processing logic. In early feature calculation implementations, the DSL configured for each model was interpreted and executed. The advantage of interpreting execution is that it is simple to implement, and good implementation can be achieved through good design, such as the commonly used iterator pattern; the disadvantage is that the execution performance is low, and a lot of branch jumps and types cannot be avoided at the implementation level for the sake of versatility. conversion etc. In fact, for a fixed version of the model configuration, all its model feature conversion rules are fixed and will not change with requests. In extreme cases, based on this known information, each model feature can be hard coded to achieve the ultimate performance. Obviously, model feature configurations are ever-changing, and it is impossible to manually code each model. Hence the idea of CodeGen, which automatically generates a set of proprietary codes for each configuration during compilation. CodeGen is not a specific technology or framework, but an idea that completes the conversion process from an abstract description language to a specific execution language. In fact, in the industry, it is a common practice to use CodeGen to accelerate calculations in computing-intensive scenarios. For example, Apache Spark uses CodeGen to optimize SparkSql execution performance. From ExpressionCodeGen in 1.x to accelerate expression operations to WholeStageCodeGen introduced in 2.x for full-stage acceleration, very obvious performance gains have been achieved. In the field of machine learning, some TF model acceleration frameworks, such as TensorFlow XLA and TVM, are also based on CodeGen ideas. They compile Tensor nodes into a unified middle layer IR, and perform scheduling optimization based on IR combined with the local environment to achieve runtime model calculation acceleration. Purpose.

Engineering practice of large-scale deep learning model for takeaway advertising

Drawing on Spark’s WholeStageCodeGen, our goal is to compile the entire feature calculation DSL into an executable method, thereby reducing performance loss when the code is running. The entire compilation process can be divided into: front-end (FrontEnd), optimizer (Optimizer) and back-end (BackEnd). The front-end is mainly responsible for parsing the target DSL and converting the source code into AST or IR; the optimizer optimizes the obtained intermediate code based on the front-end to make the code more efficient; the back-end converts the optimized intermediate code into Native code for the respective platform. The specific implementation is as follows:

Front-end: Each model corresponds to a node DAG graph, parse each feature one by one, calculate DSL, generate AST, and AST nodes are added to the graph.
Optimizer: Optimize DAG nodes, such as public operator extraction, constant folding, etc.
Backend: Compile the optimized graph into bytecode.

Engineering practice of large-scale deep learning model for takeaway advertising

After optimization, the translation of the node DAG graph, that is, the back-end code implementation, determines the final performance. One of the difficulties is also the reason why existing open source expression engines cannot be used directly: the feature calculation DSL is not a purely computational expression. It can describe the feature acquisition and processing process through a combination of read operators and conversion operators:

Read operator: from storage The process of the system obtaining features is an IO-type task. For example, query the remote KV system.
Conversion operator: Converting the features after they are obtained locally is a computationally intensive task. For example, Hash the feature values.

So in actual implementation, it is necessary to consider the scheduling of different types of tasks, maximize the utilization of machine resources, and optimize the overall time-consuming process. Combining industry research and own practice, the following three implementations were carried out:

Engineering practice of large-scale deep learning model for takeaway advertising

Divide Stage based on task type: Divide the entire process into two stages: acquisition and calculation. The internal sharding of the stage is processed in parallel, and the previous stage is completed. Then execute the next Stage. This is the solution we used in the early days. It is simple to implement and can choose different shard sizes based on different task types. For example, IO-type tasks can use larger shards. But the shortcomings are also obvious, which will result in the superposition of long tails of different stages. The long tail of each stage will affect the time-consuming of the entire process.
Divide Stage based on pipeline: In order to reduce the long-tail superposition of different Stages, you can first fragment the data and read for each feature Take the shard and add a callback to call back the computing task after the IO task is completed, making the entire process as smooth as an assembly line. Shard scheduling can allow shards whose previous stage is ready earlier to enter the next stage in advance, reducing waiting time and thus reducing the long tail of overall request time. However, the disadvantage is that the unified shard size cannot fully improve the utilization of each Stage. Smaller shards will bring more network consumption to IO tasks, and larger shards will increase the time consumption of computing tasks.
Based on SEDA (Staged Event-Driven Architecture) approach: The staged event-driven approach uses queues to isolate Stage acquisition and calculation Stage, and each Stage is assigned an independent The thread pool and batch processing queue consume N (batching factor) elements each time. This allows each Stage to independently select the shard size, and the event-driven model can also keep the process smooth. This is what we are currently exploring.

The CodeGen solution is not perfect. Dynamically generated code reduces code readability and increases debugging costs. However, using CodeGen as the adaptation layer also provides a more in-depth solution. Optimization opens up space. Based on CodeGen and asynchronous non-blocking implementation, good benefits have been achieved online. On the one hand, it reduces the time-consuming feature calculation, on the other hand, it also significantly reduces the CPU load and improves the system throughput. In the future, we will continue to take advantage of CodeGen and carry out targeted optimization in the back-end compilation process, such as exploring the combination of hardware instructions (such as SIMD) or heterogeneous computing (such as GPU) for deeper optimization.

4.2 Transmission Optimization

The online prediction service has a two-layer architecture as a whole. The feature extraction layer is responsible for model routing and feature calculation, and the model calculation layer is responsible for model calculation. The original system process is to splice the results of feature calculation into a matrix of M (predicted Batch Size) × N (sample width), and then serialize and transmit it to the calculation layer. . The reason for this is, on the one hand, historical reasons. The input format of many early non-DNN simple models is a matrix. After the routing layer is spliced, the computing layer can be used directly without conversion; on the other hand, the array format is relatively compact and can Save time on network transmission. Engineering practice of large-scale deep learning model for takeaway advertising However, with the iterative development of models, DNN models have gradually become mainstream, and the disadvantages of matrix transmission are also very obvious:

Poor scalability: The data format is unified and is not compatible with non-numeric characteristic values.
Transmission performance loss: Based on the matrix format, features need to be aligned. For example, the Query/User dimension needs to be copied and aligned to each Item, which increases Requests the amount of network transmission data from the computing layer.

In order to solve the above problems, the optimized process adds a conversion layer above the transmission layer to convert the calculated model features into the required Format, such as Tensor, matrix or CSV format for offline use, etc. Engineering practice of large-scale deep learning model for takeaway advertising Most of the actual online models are TF models. In order to further save transmission consumption, the platform designed the Tensor Sequence format to store each Tensor matrix: among them, r_flag is used to mark whether it is an item class feature. , length represents the length of the item feature, the value is M (Number of Item) × NF (Feature length), data is used to store the actual feature value, for the Item feature, M Feature values are stored flatly, and request type features are filled directly. Based on the compact Tensor Sequence format, the data structure is more compact and the amount of data transmitted over the network is reduced. The optimized transmission format has also achieved good results online. The request size of the routing layer calling the computing layer has been reduced by 50%, and the network transmission time has been significantly reduced.

4.3 High-dimensional ID feature encoding

Discrete features and sequence features can be unified into Sparse features. In the feature processing stage, the original features will be Hash processed and turned into ID classes. feature. In the face of features with hundreds of billions of dimensions, the process of string concatenation and hashing cannot meet the requirements in terms of expression space and performance. Based on industry research, we designed and applied a feature encoding format based on Slot coding:

Engineering practice of large-scale deep learning model for takeaway advertising

Among them, feature_hash is the original feature value after Hash value after. Integer features can be filled directly. Non-integer features or cross features are hashed first and then filled. If the number exceeds 44 bits, it will be truncated. After the Slot coding scheme was launched, it not only improved the performance of online feature calculation, but also significantly improved the model effect.

5 Sample Construction

5.1 Streaming Sample

In order to solve the problem of online and offline consistency, the industry generally The feature data used in online dump real-time scoring is called feature snapshot; instead of constructing samples through simple offline label splicing and feature backfilling, this method will cause large data inconsistencies. The original architecture is shown in the figure below:

Engineering practice of large-scale deep learning model for takeaway advertising

As the feature scale becomes larger and the iteration scenarios become more and more complex, this solution becomes more and more prominent. The biggest problem is that the online feature extraction service is under great pressure, and secondly, the cost of collecting the entire data stream is too high. This sample collection scheme has the following problems:

Long readiness time: Under the current resource constraints, it takes almost T 2 to run such large data Prepare sample data to influence algorithm model iteration.
High resource consumption: The existing sample collection method is to calculate the characteristics of all requests and then splice them with exposure and clicks. Since the characteristics of unexposed items are calculated , Data falls out of the table, resulting in a large amount of stored data and consuming a lot of resources.

5.1.1 Common solutions

In order to solve the above problems, there are two common solutions in the industry: ①Flink real-time stream processing; ②KV cache secondary processing . The specific process is shown in the figure below:

Engineering practice of large-scale deep learning model for takeaway advertising

Streaming splicing solution: Use the low-latency stream processing capabilities of streaming processing frameworks (Flink, Storm, etc.) to directly read exposures/clicks The real-time stream is associated (Join) with the feature snapshot stream data in memory; streaming training samples are first generated and then transferred to model offline training samples. Streaming samples and offline samples are stored in different storage engines respectively, supporting different types of model training methods. Problems with this solution: The amount of data in the data flow link is still very large, occupying more message flow resources (such as Kafka); Flink resource consumption is too large. If the data volume is hundreds of G per second, do Window Join requires 30 minutes × 60 × 100G memory resources.
KV caching scheme: Write all feature snapshots of feature extraction into KV storage (such as Redis) and cache them for N minutes, and the business system passes The message mechanism transmits the Items in the candidate queue to the real-time computing system (Flink or consumer application). The amount of Items at this time will be much less than the amount of previously requested Items, so these Item characteristics are The data is retrieved from the feature snapshot cache and output through message flow to support streaming training. This method relies on external storage. Regardless of the increase in features or traffic, Flink resources are controllable and the operation is more stable. But the outstanding problem is that larger memory is needed to cache large amounts of data.

5.1.2 Improvement and optimization

From the perspective of reducing invalid calculations, not all requested data will be exposed. The strategy has stronger demand for exposed data, so forwarding day-level processing to stream processing can greatly improve the data readiness time. Secondly, starting from the data content, the characteristics include request-level changed data and day-level changed data. The link flexibly separates the processing of the two, which can greatly improve the utilization of resources. The following figure is the specific plan:

Engineering practice of large-scale deep learning model for takeaway advertising

#1. Data splitting: Solve the problem of large data transmission volume (Feature snapshot flow problem), predicted Label Matching the real-time data one by one, the offline data can be accessed twice during reflow, which can greatly reduce the size of the link data stream.

There are only contextual real-time features in the sample stream, which increases the stability of the read data stream. At the same time, because only real-time features need to be stored, Kafka hard disk storage is reduced by 10 times.

2. Delayed consumption Join method: Solve the problem of large memory usage.

The exposure stream is used as the mainstream and written to HBase. At the same time, in order to allow other streams to be exposed on the Join in HBase later, the RowKey is written to Redis; subsequent streams pass the RowKey When writing to HBase, the splicing of exposure, clicks, and features is completed with the help of external storage, ensuring that the system can run stably as the amount of data increases.
The sample stream is delayed consumption. The sample stream of the background service often arrives before the exposure stream. In order to join 99% of the exposure data, the sample stream waiting window statistics must be at least N minutes. The above; the implementation method is to store all the data in the window period on Kafka's disk, and use the sequential read performance of the disk to omit the large amount of memory that needs to be cached during the window period.

#3. Sample of feature re-recording : Through Label’s Join, the number of feature requests added here is less than 20% of the online number; sample Delayed reading, splicing with exposure, filtering out exposure model service requests (Context real-time features), then recording all offline features, forming complete sample data, and writing it to HBase.

5.2 Structured Storage

As the business iterates, the number of features in the feature snapshot becomes larger and larger, making the overall feature snapshot reach dozens in a single business scenario. TB level/day; from a storage perspective, the characteristic snapshot of a single business over multiple days is already at the PB level, almost reaching the storage threshold of the advertising algorithm, storage pressure is high; from a computing perspective, using the original calculation Process, due to the resource limitations of the computing engine (Spark) (shuffle is used, the data in the shuffle write phase will be written to disk. If the allocated memory is insufficient, multiple disk writes and external sorting will occur ), requires a memory of the same size as its own data and more computing CUs to effectively complete the calculation, occupies a high amount of memory. The core process of the sample construction process is shown in the following figure:

Engineering practice of large-scale deep learning model for takeaway advertising

When re-recording features, there are the following problems:

Data redundancy: The offline table of supplementary features is generally a full amount of data, and the number of entries is in the hundreds of millions. The number of entries used in sample construction is about the number of DAUs on that day, which is tens of millions. Therefore, the supplementary feature table data is participating in There is redundant data in the calculation.
Join order: The calculation process of supplementary features is dimensional feature completion. There are multiple Join calculations, so the performance of Join calculation and the order of Join tables It has a lot to do with it. As shown in the figure above, if the left table is a large table with dozens of TB levels, then the subsequent shuffle calculation process will generate a large amount of network IO and disk IO.

In order to solve the problem of slow sample construction efficiency, we will start with data structure management in the short term. The detailed process is as shown in the figure below:

Engineering practice of large-scale deep learning model for takeaway advertising

Structured splitting. The data is split into Context data and structured storage dimensional data instead of hybrid storage. It solves the problem of carrying a large amount of redundant data in the process of splicing new features of Label samples; and after structured storage, great storage compression is achieved for offline features.
High-efficiency filtering prefix. Data filtering is advanced before Join, reducing the amount of data involved in feature calculation, which can effectively reduce network IO. During the splicing process, the Hive table for supplementary recording of features is generally a full table, and the number of data items is generally the monthly activity. However, the number of data items used in the actual splicing process is approximately the daily activity, so there is a large amount of data redundancy. , invalid data will bring additional IO and calculation. The optimization method is to precompute the dimension Key used and generate the corresponding Bloom filter. Use the Bloom filter to filter when reading the data, which can greatly reduce redundant data transmission and redundant calculations during the supplementary recording process.
High performance Join. Use efficient strategies to arrange the Join sequence to improve the efficiency and resource usage of the feature re-enrollment process. During the feature splicing process, there will be join operations on multiple tables, and the order of joins will also greatly affect the splicing performance. As shown in the figure above, if the amount of data in the left table to be spliced is large, the overall performance will be poor. You can use the idea of the Huffman algorithm to regard each table as a node, and the corresponding amount of data as its weight. The amount of Join calculation between tables can be simply analogized to the sum of the weights of two nodes. Therefore, this problem can be abstracted into constructing a Huffman tree, and the construction process of the Huffman tree is the optimal Join order.

Data offline storage resources have been saved by 80%, and sample construction efficiency has been increased by 200%. Currently, the entire sample data is also being implemented based on the data lake to further improve data efficiency.

6 Data preparation

The platform has accumulated a large amount of valuable content such as features, samples and models, and hopes to help strategy personnel by reusing these data assets. Conduct better business iterations and achieve better business benefits. Feature optimization accounts for 40% of all methods used by algorithm staff to improve model effects. However, the traditional feature mining method has problems such as long time consumption, low mining efficiency, and repeated feature mining. Therefore, the platform hopes to empower in the feature dimension. business. If there is an automated experimental process to verify the effect of any feature and recommend the final effect indicators to users, it will undoubtedly help strategists save a lot of time. When the entire link construction is completed, you only need to input different feature candidate sets to output the corresponding effect indicators. To this end, the platform has built an intelligent mechanism for "addition", "subtraction", "multiplication" and "division" of features and samples.

6.1 Do "addition"

Feature recommendation is based on model testing method, reuse features into existing models of other business lines, and construct new samples and models ; Compare the offline effects of the new model and the Base model, obtain the benefits of new features, and automatically push them to relevant business leaders. The specific feature recommendation process is shown in the figure below:

Engineering practice of large-scale deep learning model for takeaway advertising

Feature Awareness: Feature recommendation is triggered through the online wall or the stock method between businesses. These features have been verified to a certain extent, which can ensure the success rate of feature recommendation. .
Sample production: Features are extracted from the configuration file during sample production, and the process automatically adds the new features to the configuration file, and then Production of new sample data. After obtaining new features, analyze the original features, dimensions, and UDF operators that these features depend on, and integrate the new feature configuration and dependent original data into the original configuration file of the baseline model to construct a new feature configuration file. Automatically build new samples. During sample construction, relevant features are extracted from the feature warehouse through feature names, and the configured UDF is called for feature calculation. The time period for sample construction is configurable.
Model training: Automatically transform the model structure and sample format configuration, and then perform model training, using TensorFlow as the model training framework, Use tfrecord format as sample input, put new features into two groups A and B according to numerical class and ID class respectively. ID class features perform table lookup operations, and then append them to existing features uniformly, without modifying the model structure. New samples can be received for model training.
Automatically configure new model training parameters: including training date, sample path, model hyperparameters, etc., divided into training set and test set , automatically train new models.
Model evaluation: Call the evaluation interface to obtain offline indicators, compare the new and old model evaluation results, and reserve single feature evaluation results. After breaking up some features, Give the contribution of a single feature. The evaluation results are uniformly sent to users.

Engineering practice of large-scale deep learning model for takeaway advertising

6.2 Do “subtraction”

After the feature recommendation is implemented in the advertisement and a certain amount of revenue is achieved, we Make some new explorations at the feature empowerment level. With the continuous optimization of the model, the speed of feature expansion is very fast, and the resource consumption of model services increases sharply. It is imperative to eliminate redundant features and "slim down" the model. Therefore, the platform has built a set of end-to-end feature screening tools.

Engineering practice of large-scale deep learning model for takeaway advertising

##Feature Scoring: Through WOE (Weight Of Evidence, Weight of Evidence) and other evaluation algorithms give scores for all features of the model. Features with higher scores have higher quality and higher evaluation accuracy.
Effect Verification: After training the model, sort by score and eliminate features in batches. Specifically, the feature breaking method is used to compare the evaluation results of the original model and the broken model. When the difference is larger than the threshold, the evaluation ends and features that can be eliminated are given.
End-to-end solution: After the user configures the experimental parameters and indicator thresholds, deletable features and the model after deleting the features can be provided without human intervention. offline evaluation results.

In the end, after 40% of the features were offline in the internal model, the decline in business indicators was still controlled within a reasonable threshold.

6.3 Doing “multiplication”

In order to obtain better model effects, some new explorations have been started within advertising, including large models, real-time, and feature libraries wait. There is a key goal behind these explorations: the need for more and better data to make models smarter and more efficient. Starting from the current situation of advertising, the construction of a sample bank (Data Bank) is proposed to bring in more types and larger scale of external data and apply it to existing businesses. Specifically as shown in the figure below:

Engineering practice of large-scale deep learning model for takeaway advertising

We have established a universal sample sharing platform, on which other business lines can be borrowed to generate Incremental samples. It also builds a common Embedding sharing architecture to realize the large-scale and small-scale business integration. The following is an example of reusing non-advertising samples in the advertising business line. The specific methods are as follows:

Expanded sample: Based on the Flink streaming processing framework, a highly scalable sample library DataBank is built. Business A can easily reuse Business B and Business C. Use the exposure, click and other Label data to do experiments. Especially for small business lines, a large amount of value data has been expanded. Compared with offline supplementary registration, this approach will have stronger consistency. The feature platform provides online and offline consistency guarantees.
Share: After the sample is ready, a very typical application scenario is transfer learning. In addition, a data path shared by Embedding is also built ( does not rely heavily on the "sample expansion" process). All business lines can be trained based on large Embeddings. Each business party can also update this Embedding and create an Embedding online. Version mechanism for use by multiple business lines.

For example, by reusing non-advertising samples into a business within advertising, the number of samples has been increased several times. Combined with the transfer learning algorithm, the offline AUC has been increased by a thousand points. Fourth, the CPM will increase by 1% after going online. In addition, we are also building an advertising sample theme database to uniformly manage the sample data generated by each business (Unified Metadata), expose unified sample theme classification to users, and quickly register, search, and reuse. Unified storage for the bottom layer, saving storage and computing resources, reducing data join, and improving timeliness.

6.4 Do "division"

Through feature "subtraction", some features that have no positive effect can be eliminated, but through observation, it is found that there are still many features with little value in the model Characteristics. Therefore, we can take a step further by comprehensively considering both value and cost. Under the cost-based constraints of the entire link, we can screen out those features with relatively low input and output and reduce resource consumption. This process of solving under cost constraints is defined as "division". The overall process is shown in the figure below.

Engineering practice of large-scale deep learning model for takeaway advertising

In the offline dimension, we have established a feature value evaluation system to give the cost and value of features. The feature value can be used in online reasoning Information is used to perform traffic degradation, feature elasticity calculation and other operations. The key steps for "division" are as follows:

Problem Abstract: If We can get the value score of each feature, and we can also get the cost of the feature (storage, communication, computing and processing), then the problem is converted into How to maximize the value of features given the known model structure and fixed resource costs.
Value evaluation under cost constraints: Based on the feature set of the model, the platform first performs a statistical summary of costs and values; the cost includes The offline cost and online cost are based on the trained evaluation model to obtain a comprehensive ranking of features.
Scenario modeling: You can select different feature sets for modeling based on different resource conditions. With limited resources, choose the model with the greatest value to work online. In addition, it can be modeled for a relatively large feature set and enabled during low traffic peaks to improve resource utilization and bring greater benefits to the business. Another application scenario is traffic degradation. The inference service monitors the consumption of online resources. Once the resource calculation reaches the bottleneck, it switches to the degradation model.

7 Summary and Outlook

The above is our anti-"increase" practice in large-scale deep learning projects to help Reduce business costs and improve efficiency. In the future, we will continue to explore and practice in the following aspects:

Full-link GPUization: At the inference level, through GPU switching, while supporting more complex business iterations, the overall cost is also extremely high. After that, we will carry out GPU-based transformation on sample construction and feature services, and jointly promote the upgrade of offline training level.
Sample Data Lake: Construct a larger sample warehouse through the Schema Evolution, Patch Update and other features of the data lake to improve the business side Carry out low-cost, high-value data disclosure.
Pipeline: During the iteration process of the entire life cycle of the algorithm, many aspects of debugging, the link information is not enough "in series", and offline The perspectives of , online, and effect indicators are relatively fragmented. Standardization and observability based on the entire link are the general trend, and this is the basis for the intelligent and elastic deployment of subsequent links. MLOps and cloud native, which are popular in the industry now, have many reference ideas.
Intelligent matching of data and models: As mentioned above, under the premise that the model structure is fixed, features are automatically added and subtracted to the model. Similarly, at the model level, fixed Under the premise of certain feature input, some new model structures can be automatically embedded. And in the future, we will also automatically complete the matching of data and models based on the business field and through the platform's features and model system.

8 Authors of this article

Yajie, Yingliang, Chen Long, Chengjie, Dengfeng, Dongkui, Tongye , Simin, Lebin, etc., all come from Meituan’s food delivery technical team.

The above is the detailed content of Engineering practice of large-scale deep learning model for takeaway advertising. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

A Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles