Home >Technology peripherals >AI >Adaptive and unsupervised multi-scenario model modeling practice in Taobao personalized recommendations
This article will share thoughts and practices on adaptive and unsupervised multi-scenario modeling in Taobao’s personalized recommendation scenarios. This work was published in CIKM 2022 (paper title: Scenario-Adaptive and Self-Supervised Model for Multi-Scenario Personalized Recommendation). This article will introduce how multi-scenario modeling describes the migration relationship between full-domain scenarios and single scenarios in a fine-grained manner to achieve domain adaptation, and how to introduce unsupervised data into multi-scenario modeling. It will also introduce the role of multi-scenario modeling in recommendation and recall. Staged implementation practice.
First introduce the business background of multi-scenario modeling, Modeling motivation and solution selection. This article will focus on the multi-scenario modeling problem of recommendation systems, which is also a common problem in various recommendation systems and needs to be solved urgently. Specifically, it will be introduced around 5 questions.
Can be explained from the business perspective and the model perspective. From a business perspective, "scenario" can be simply understood as different recommendation portals or recommendation hosting pages on the recommendation platform. For example, in the advertising field, the same advertisement can be placed on different media terminals, and at the same time, it can correspond to different delivery forms, such as information flow advertisements or open-screen advertisements. In the field of e-commerce, there will also be a very rich recommendation page. Taking Taobao as an example, there is a guess-your-favorite product recommendation interface on the homepage, recommendation scenarios in the shopping cart, and related recommendations on the product details page. In the field of content recommendation, such as our actual business - shopping on Taobao, the recommendation page includes a one-hop, two-column display scene, as well as an immersive endless flow of recommendations that slides up and down after clicking to enter the second hop. In these examples, each hosted page can be used as a scene, and the same recommendation platform will also have multiple recommendation scenes, showing the characteristics of multiple scenes.
From the perspective of model modeling, multi-scenario problems can be simply defined as multiple data sets that share the same feature space and label space, but have different data distributions. The data recorded in each recommendation scenario can form a corresponding data set. Although each data set comes from different sources, their feature system and label space are consistent.
The first point is that the traffic overlaps. From the user's perspective, for the same user, he recommends Different scenarios can be accessed in a product or a recommendation system, and corresponding browsing, clicking, and other interactive behaviors will be left in multiple scenarios.
The second point is that the supply overlaps. Whether it is advertising, products or content, it can be placed in different scenes and displayed in different scenes.
The third point is the distribution difference between scenarios. In terms of behavior patterns, the same user visits multiple different scenarios, and the user’s behavior in the scenarios may be different; similarly, from From a supply perspective, for the same advertisement, product or content, the exposure and interaction performance displayed by it in different scenarios are also different. Therefore, there are differences between different scenarios in terms of data distribution. Therefore, the multi-scenario recommendation problem we are facing now embodies the characterization of the commonality and characteristics of scenarios: users may have sustainable interests, but their expressions in different scenarios may be different.
The fourth point is that the feature spaces and targets between scenarios are consistent. Different scenarios reflect more differences in data distribution, but the feature system is actually relatively consistent. In addition, the label space between scenes is the same, which means that the modeling tasks between multiple scenes are actually the same. For example, modeling click tasks on the product scene, or modeling completion or long-playing tasks on the content scene.
Combined with the specific business background, we hope to focus on the following through the modeling of multi-scenario problems Two goals.
The first one is the performance goal, focusing on optimizing the problem of data sparseness. Single recommendation scenarios usually face the problem of sparse user behavior data, and this phenomenon is especially obvious in some small scenarios or new scenarios. One of the important goals of the multi-scenario problem is to alleviate the problem of data sparseness and improve the business indicators of all scenarios through information sharing between multiple scenarios.
The second is the goal in terms of iteration and operation and maintenance costs. The traditional optimization method is to allocate dedicated personnel to maintain and optimize independent models for each scenario. When some new scenarios are connected, , it may be necessary to repeat the entire access process and model training process. This means that all personnel allocation, model deployment work and maintenance costs are positively related to the number of scenarios, which will lead to relatively high overall maintenance costs and iteration costs.
Considering the actual cost, we hope to build a unified algorithm framework that can serve various scenarios at the same time and support the rapid access of new scenarios. This is what we want to achieve through multi-scenario modeling. another goal.
It should also be emphasized that multi-scenario problems are different from the currently known multi-target and cross-domain problems. The focus of multi-scenario problems is to share the same feature system, have multiple different data sources, and have the same goal. Multi-objective means having multiple different optimization objectives for the same data source. Cross-domain problems are usually divided into a source domain and a target domain. It is generally assumed that the data volume and effect of the source domain are superior, and then the source domain is used to assist in improving the effect of the target domain. In multi-scenario problems, there is no relationship between each scenario's advantages and disadvantages. The modeling goal is to improve the effects of all scenarios.
In the early stage of practical application, for this kind of multi-scenario problem Modeling We have made a simple classification of the existing solutions in the industry, and they are roughly divided into the following four types.
## The first one is the most intuitive, and it is also a solution currently used in many business practices, that is, for each Each scenario is a separate scenario, and the data of each scenario is used to train separate models respectively. Then during online deployment and estimation, each scenario will have a separate model. In this way, existing model structures in the industry can be selected for modeling in different scenarios. Of course, this solution also has some shortcomings, mainly focusing on the following points. The first point is that this method cannot solve the problem of data sparseness in a single scenario well. This is because it cannot use the same Supplement the characteristic information of other types of scenarios. Especially in some scenarios where the amount of data is relatively small or the behavioral data is sparse, the problem of data sparseness is more prominent. The second point mentioned before is that this single-scenario-single-model modeling method consumes relatively high manpower and resources in terms of model maintenance and iteration costs. The third point is that when some new scenarios are introduced in the business, there will also be the problem of higher costs. Since the single-scenario-single-model approach has the problem of sparse samples, the second solution is to mix the sample data of all scenes, train a model with the mixed data, and then deploy the same model to all On the scene. This method can alleviate two problems in the first type of scheme, because this scheme uses samples from all scenes, and all scenes share the same model. However, its shortcoming is that this relatively rough mixed training of data samples destroys the data distribution between each scene and introduces some noise. In addition, the overall model effect may be dominated by the data of some large scenes, thus affecting the effects of some small scenes. The third type of solution is to use a two-stage training method, that is, first conduct mixed training on samples from all scenes, train a base model, and then use independent samples from each scene to train the original base model for in-scene fine-tuning. In terms of model deployment and estimation, each scenario is also put online and estimated using the model fine-tuned with its own scene data. The disadvantage of this approach is that each scenario requires its own model to be deployed separately. In addition, this direct pre-training and fine-tuning method does not model the relationship between scenes very well.The last category is the current mainstream method in the industry for multi-scenario modeling. By drawing on the multi-task learning architecture, the data of each scenario is considered in the model structure to conduct joint modeling, and And through the design of the model structure, the common expressions and differences between scenes are depicted. In the past two years, there have been many attempts and implementations in the industry. For example, SAR-Net is trained in a way similar to MMOE, and STAR builds this globally shared network and a network unique to each scenario, through matrix mapping between model parameters. methods to achieve the characterization of this scene difference, and other work to depict this scene difference through dynamic parameter networks.
Considering the shortcomings of the first three types of methods, we chose to carry out subsequent work based on joint training.
In the process of our actual business implementation, we also face some more specific challenges for this kind of joint modeling of multiple scenarios through a unified model, which are mainly reflected in the following aspects.
Challenge 1: We use joint modeling in the hope of making full use of the information in all scenarios To solve the problem of data sparseness in a single scene, the essence is to migrate the effective information of other scenes to the specified scene. Although some of the methods mentioned earlier use parameter matrix operations or dynamic parameter networks to characterize the commonalities and differences of scenes, However, these transformation methods are relatively implicit, and it is impossible to explain in an interpretable manner whether other scenes have transferred information to the specified scene, and how much information has been transferred. Therefore, how to achieve refined and effective scene information migration is the first challenge we face. To put it simply, how to model whether to migrate information and how much information to migrate.
Challenge 2: When we train the model, it is mainly based on user interaction behaviors, such as product clicks or Positive feedback signals such as video completion are used to construct positive samples, that is, training is performed on labeled sample data. This will cause serious data sparse problems in some scenarios with few behaviors. If we can use some unsupervised tasks to extend the training data from the labeled sample space to the unlabeled sample space, it will help alleviate the problem of data sparseness.
Challenge Three: We all know that the entire recommendation link is divided into several core stages such as recall and sorting. . In the early research, we found that joint modeling for multi-scenario problems mainly focused on the sorting stage, including some of the models listed above, which are basically sorting models. Recall, as the first stage of the entire recommendation link, faces very different candidate sizes, retrieval methods, and sorting. Therefore, how to implement multi-scenario joint modeling in the recall phase is also a challenge we face.
Let’s introduce what we are doing The model solution for actual business implementation. This model is referred to as SASS. This solution mainly focuses on three core keywords. The first is scene adaptive transfer (Scenario-Adaptive), the second is unsupervised (Self-Supervised), and the third is implementation exploration for recall tasks.
##The overall model framework contains two stages, one is the pre-training task, and the other is One is the fine-tuning task. The first stage is to construct an unsupervised pre-training task on an unlabeled sample set and model the relationship between scenes through contrastive learning. In addition, because the entire model is implemented in the recall stage, the user side and item side need to be modeled independently, so we have a symmetrical structural design on the user side and item side.
The second stage is a fine-tuning task, which will reuse the model structure based on the first stage, including loading the embedding and network structure parameters pre-trained in the first stage. In addition, the second stage is to train a recall task on the labeled sample space, and then output the user-side and item-side representation vectors. Next, we will introduce these two stages in detail.
First of all, In the first stage of the pre-training task, we construct an unsupervised task of contrastive learning between scenes. As shown in the upper right corner of the figure, everyone should be familiar with the classic training paradigm of contrastive learning. The same object x obtains two different feature sets through two different data enhancement methods, and then uses the feature extraction network and mapping The network finally obtains two different vector expressions of the same object, and then the representation distance between the two vectors can be shortened by comparing the learned metric loss to achieve unsupervised pre-training tasks.
Inspired by the idea of contrastive learning, we combined the alignment expression between scenes in multi-scene modeling with the pre-training task of contrastive learning. As mentioned earlier, the same user may access multiple different scenes and have different interactive behaviors in different scenes, leaving some statistical information related to the scene. Therefore, we can regard this difference in user behavior in different scenarios as a natural way of data enhancement. The interests of the same user have continuity, but there may be certain differences in his expression between different scenarios. Then we build an unsupervised task of contrastive learning on this basis.
Looking at the specific model, as shown on the left side of the figure, we will build a unified feature system for different scenarios, but the specific values of the features correspond to the scenario. For example, we combine user behavior The sequence is divided into scenes, and users have corresponding statistical characteristics such as interests and preferences in respective scenes. Through this splitting method, the same user has multiple different feature value sets related to the scenario. For example, in the figure, the characteristics of the same user in scene a and scene b, and then through a unified representation network (this representation network will be introduced later), we can get the representation vectors of the same user in different scenes, and finally learn the loss through comparison to shorten the distance between the two.
What I just talked about is the training method of the user side. In the recall task, the user side and item side are usually modeled independently. Therefore, the item side also uses symmetrical structures and tasks for training, and the user side and item side share the same embedding layer. Specifically, for the same item, we split the feature values on the item side into scenes. After passing through the representation network on the item side, we obtain the vector expression of each scene, and then use the same contrastive learning loss for training. .
When constructing the sample, we have a special treatment: a user may visit more than 2 scenes. Therefore, when constructing the training task of contrastive learning, we will combine the scenes visited by the user in pairs. to construct multiple training samples. Also on the item side, multiple samples are constructed through these two combinations.
In terms of specific contrastive learning tasks, we continue to use the loss form of InfoNCE for training.
We achieve pre-training of unlabeled data between multiple scenes through modeling scenes and comparative learning tasks between scenes. Next, we will introduce the entire The more important design details of the representation network in the model framework.
The scene representation network in the model is a multi-layer, scene-adaptive migration network. First of all, from the overall model structure, parameters are shared in the embedding layer of the model. This representation network can be divided into several components as a whole. The first one is the network shared by the whole scene, which is the blue part on the left side of the model in the picture. The global shared network here is that samples from all scenes will pass through it. Training here can be seen as a representation structure that trains a mixture of all scene samples on the user side or item side. The second part is the unique network structure of each scene, which is the gray part corresponding to each scene in the picture. The samples corresponding to each scene are trained through the corresponding network. Since the network layer parameters of each scene are separated, this training and representation can well describe the differences in distribution between each scene. In addition, in the lower left corner of the figure, we also introduce an auxiliary bias network. The input of this bias network includes scene id and some scene-specific features, as well as some contextual features. This can further characterize the differences and bias information between the contexts of the scenes on the basis of a shared system. In the specific training process, after each sample passes through the unified feature embedding layer and is spliced, it will enter the full scene sharing network and this sample The network unique to the corresponding scene performs forward propagation and back propagation network training. At the same time, in the entire network structure, the output of each layer of the full-scene sharing network will pass through a scene-adaptive gating unit to integrate the fusion information of the whole scene. Migrate to a single scene to achieve refined migration of scene information. For details, please refer to the structure in the upper right corner of the model in the figure. The migration structure mainly includes an adaptive gate and an update gate. The output value of the adaptive gate is used to control how much of the information in the full scene can be migrated to a single scene, while the output of the update gate is to control the information migrated from the full scene network and the original information of the single scene. The weight value of the weighted fusion. The inputs of these two gate networks include the information of the full scene network, the information of the single scene network, and the bias information of the scene itself. Through this refined and adaptive migration structure, the migration direction and amount of migration information of the scene are modeled and characterized. We stack the migration structure in multiple layers, and finally each sample can get a vector representation of its corresponding scene. Finally, the respective output of each scene is fused with the output of bias on the corresponding scene to obtain its final vector expression on the corresponding scene. 2. Phase 2: Fine-tuning tasks
##The second stage is the fine-tuning task. Since we want to implement the model into the recall stage of the recommended link, the goals of the fine-tuning task and the recall task are aligned. In terms of sample selection, we use the items clicked by the user as positive samples, construct negative samples through random sampling, and then calculate pairwise loss by constructing triples for training.
In addition, in the fine-tuning stage we will reuse the model structure and parameters, that is, we will re-use the model structure in the fine-tuning stage and the pre-training stage The same representation network structure is used, and the embedding layer and network parameters in the pre-training stage are loaded, which is equivalent to retaining the information of unsupervised training between scenes in the first stage.
#In the metric matching tasks on the user side and item side in the fine-tuning phase, we also introduced a new auxiliary task to help training. As mentioned earlier, each sample can obtain two vector expressions after being characterized through the representation network. One is the unique vector output of each single scene network. This vector depicts its independent expression in each corresponding scene; the other It is a globally shared vector output that depicts the global expression of user features or item features. Therefore, the training task in the entire fine-tuning phase contains two losses. One is the loss trained between the user embedding and item embedding output by the single-scene network; the other is the user embedding and item embedding corresponding to the output of the full-scene network. Another loss can also be obtained through such a calculation method, and the weighted sum of the last two losses is used as the final loss for training. The introduction of the auxiliary task of full-scene loss is equivalent to describing the expression of the same user and item in the whole domain. Although its expression may not be suitable for the independent feature expression of each scene, if it is added to a global task for training Finally, it is beneficial to the convergence of the overall effect, and subsequent experimental analysis can also demonstrate this point.
Next, we will introduce how to deploy the recall model. After the model training is completed, we will deploy the model in the fine-tuning stage and go online. During online estimation, the information of each scene will be passed through the network of the corresponding scene on the model to obtain the representation vector in that scene.
In addition, in the auxiliary task, the output of the full scene network is only used in the training phase, because it is a mixed sample, there may be some noise, and then in the prediction When , each scene still uses the feature vector output by its respective scene. For the recall task, on the item side we will generate this vector for all candidates, then construct the corresponding index, and then generate the vector during online estimation through model deployment. Then the topk results are obtained through vector retrieval, and finally the results are returned to the sorting stage to perform some subsequent operations on the entire recommended link.
Next, we will introduce some experimental analysis and implementation using this model. Already implemented applications.
We are working with two open source data sets and the industrial data set of our own business. The effects of other methods were compared.
The comparison methods are mainly divided into three categories. The first category is the traditional single scene model, because we focus on a recall task , so compare some popular recall methods in the industry, such as YoutubeDNN, MIND, BST, DSSM, etc. These single-scene models are trained with independent samples from each scene. The second type is to use samples mixed from multiple scenes for training, and the model still uses the single-scene model commonly used in the industry. The third category is some of the existing multi-scenario joint modeling methods in the industry and those proposed by us. Some of these methods are used in the sorting stage, and for the implementation in the recall stage, for better comparison , we slightly modified these methods - that is, taking the output of the last layer of the ranking model network as a representation vector to adapt to the recall task.
The last two columns in the above table are the models we proposed. SASS-Base is a model structure without pre-training, while SASS adds pre-training. stage. Since the second data set we verified has missing features and cannot support the pre-training task, we focused on comparing the effects of SASS-Base and other methods on this data set.
#From comparing various types of methods, we have obtained several valuable conclusions. The first point is thata single scene model trained with mixed samples is in most cases less effective than a single scene model trained with its own separate samples. This is consistent with the conclusion of our previous demonstration and survey, that is, this method of mixing samples may introduce more noise and break the original data distribution of each scene. However, for some small scenes with particularly sparse data, mixed samples can achieve better results. Because for these scenarios, it is difficult to learn effective information when training with sparse data. Using this mixed sample method, although the data may be biased, it can bring some benefits through the increase in sample size. and effect improvement. The second conclusion is that the model trained through multi-scenario joint modeling is generally better than the first two types of single-scenario modeling methods. The model we proposed does not add pre-training tasks, that is, SASS- In terms of the structure of the Base model, it is basically better than or can achieve similar results to other multi-scenario joint modeling methods in each scene. After superimposing the pre-training tasks, the overall effect has been further improved.
We subsequently conducted a series of ablation experiments, which mainly included the following parts.The first one is an adaptive gate structure that describes the transfer of information from a global scene to a single scene. We compared the structure of the model's gate network with other existing gate migration methods, including (1) using matrix multiplication mapping to achieve information migration; (2) using two features similar to Simnet This migration method is to perform addition, multiplication and splicing, and then perform fusion through MLP; (3) A network structure similar to MOE, which is migrated through Sigmoid gate. Finally, judging from the actual experimental results, our adaptive method can achieve good results.
The second point is to compare whether to add pre-training tasks and the impact of different pre-training task types on the experimental results. The contrasting pre-training method is the training task of predicting the next video or the next item through the user's behavior sequence. After comparing the results, it can be proved that the pre-training task is added, and the effect of the model can be improved through contrastive learning between scenes.
The third point is to demonstrate the auxiliary network and an auxiliary task in the model structure design. One of them is that we introduce a globally shared network in the fine-tuning stage and use the output results of this network to perform auxiliary fine-tuning training. The other is that in our network structure design, we integrate the scene-related information for the output of each scene. Ablation experiments of bias features. The experimental results also prove that the addition of these two structures has a certain improvement in the overall model effect.
In addition, since our representation network is a multi-layer information transfer structure, we also compared the improvement in the effect of our model by increasing the number of network layers. It can be seen that the overall trend is that as the number of network layers increases, the model effect first improves and then decreases. The subsequent increase in the number of network layers caused the deterioration of the effect. We analyzed that it may be because the entire parameter amount increases with the increase in the number of network layers, and there will be some over-fitting phenomena. In addition, performing such a large amount of information migration after obtaining the upper-level representation may make the representation of a single scene more susceptible to the influence of full-scene information. Therefore, increasing the number of layers in this multi-layer network structure can improve the effect to a certain extent, but the number of network layers is not necessarily better.
In the next set of experiments, we compared the representation vectors of different item sides for the recall task, because each scene will generate its own vector on recall. In some multi-scenario modeling, the user side will have different expressions, but the item side is not described in detail. In our recall task, the user side and item side correspond to each scene. Each scene has its own independent vector expression, so we also compared the item expression corresponding to each scene and the embedding on the same item side shared by the scene. Through comparison, it is found that the independent vector expressions of each scene can also be distinguished on the item side.
Finally, we conducted an online A/B experiment on this model in an actual content recommendation scenario. It has achieved good results on some experimental indicators, especially in some relatively small or sparse data scenarios, the improvement rate is higher.
Currently, the model solution we proposed has been promoted in Taobao’s content recommendation scenarios, including short videos, image and text recommendations, etc., and this model has become one of the main recall methods in various scenarios.
##Finally, let’s summarize. Overall, the problem we want to solve is the problem of multi-scenario modeling in the recommendation field, which is also a common problem in recommendation systems. For this kind of multi-scenario modeling, our core goal is to maximize the use of information between various scenarios by building a unified framework. This joint modeling method solves the problem of data sparseness and improves the business indicators of each scenario. And through the same set of method architecture, the cost of model iteration and deployment is reduced. But in our actual business applications, multi-scenario modeling faces three core challenges. The first is how to achieve refinement and the migration of effective scene information; in addition, how to solve the problem of data sparseness in multi-scene modeling and how to introduce some unlabeled data; and then the third point is the recall phase of multi-scene joint modeling Carry out landing.
In our practice, we design this adaptive scene information transfer network architecture and construct unsupervised tasks of contrastive learning between scenes, including model structure design, training methods and deployment. Go up and adapt the task of the recall phase to solve the above challenges. Finally, this scene-adaptive unsupervised model is currently well implemented in all scenes and is used as a main recall method.
#A1: This is a problem in model evaluation. It needs to correspond to the modeling goals of each scenario, and then look at the improvement of indicators by scenario. If it is a task in the recall phase, it is to select recall-related evaluation indicators for each corresponding scene, such as Hit Rate or NDCG. If it is in the sorting stage, we should mainly focus on some sorting-related indicators such as AUC and GAUC.
#A2: First of all, because we are solving a multi-scene problem as a whole. As the definition shows, between each scene, the samples are largely aligned, so when we actually model, we try to align and flatten all the features. In addition, for the situation where each scene still has some unique features, we designed a scene bias network. For those features that cannot be completely aligned, we will put them in this separate network structure.
#A3: This set of code is currently being used in the company’s actual business scenarios, and open source needs to comply with the company’s information disclosure compliance requirements. We will communicate in the future and may provide a demo version for open source.
A4: Our current model itself is implemented in the recall stage, so the entire model is also deployed in such a recall stage. Of course, what we provide is a comparison The general scheme can be used as a sorting model with some modifications.
#A5: Negative sampling is mainly performed in the fine-tuning stage. We use a negative sampling method that is more common in recall tasks, in which the user's clicks are used as positive samples, and the negative samples are randomly sampled based on the exposure probability of the Item in each scene. Then in multiple scenes, because our overall training task will still be trained separately in each scene. Therefore, during negative sampling, the negative samples corresponding to the positive samples will also be randomly sampled in the exposure space corresponding to the scene. In this way, a pair of positive and negative samples is constructed.
A6: This problem is a bit big. Let me break it down. It is that our current set of models is implemented in the content recommendation scenario on Taobao. It is based on content as a whole. As the main carrier, it includes pictures, texts and videos. This piece is completely adaptable because its overall feature system is basically the same and can be completely reused. And another point is actually what we have to do in the next stage, that is, the products and contents may be mentioned, and their own data distribution and feature systems are actually different. This can be understood as more like a cross-domain problem. It is a fusion between the product domain and the content domain. Our next phase of work also hopes to introduce such cross-domain ideas into multiple scenarios and migrate some behavioral information from the commodity domain to the content domain. This is the first point. As for the second point, regarding the unification of goals, we have now achieved the unification of goals. For example, when it comes to clicks on the home page, we use the user's click signal as a positive sample. However, in this endless flow up and down, there is actually no user click. Then we use the user's long play, completion and other duration signals as the positive sample. Positive feedback means that the overall label is flattened to such a binary dimension.
A7: Generally speaking, we still follow this comparative learning idea. In fact, the key point is to build a pre-training task after splitting the features between scenes. You can Please refer to my previous PPT explanation or the introduction in our paper (Scenario-Adaptive and Self-Supervised Model for Multi-Scenario Personalized Recommendation).
A8: The pre-training task is performed on a space without labels, such as the two scenarios of a and b visited by a user, and The user has some static attribute characteristics, and the historically preserved user behavior sequence corresponding to each scene, which then constitutes a feature system for the same user in the two scenes a and b. The modeling goal of pre-training is to model this accessed user and shorten the distance between the two representation vectors between scene a and scene b through such a feature measure. Therefore, it is actually an unsupervised task, which is trained on a click-free sample space.
A9: This is actually another problem. Currently we are solving the multi-scenario problem. We hope that all scenarios share the same goal. For example, the click target or the two-category target we converted actually depicts whether the user is interested in the content or the video. As for the multi-scene and multi-target method mentioned just now, it is actually somewhat different from multi-scene, but I understand that such a multi-objective modeling method can be built on this multi-scene modeling. Because there is currently a lot of work on the task of joint representation of multiple scenes and multiple targets. For example, in our current architecture, after passing through the representation network, each sample will have an independent vector expression corresponding to each scene. If we use this vector expression as input, and then build a unique target-related feature network for each target on this basis, we can actually do multi-target and multi-scene joint tasks. You can regard our current multi-scenario training method as a basic framework, and then model other multi-objective tasks based on it. I feel like this makes sense.
A10: This loss is actually similar to the traditional contrastive learning scheme. We also choose InfoNCE loss, that is, in a training sample with a batch size of n. , treat the two vectors generated by the same user or item in the corresponding scene as positive samples, and the 2n-2 vectors generated corresponding to other samples as its negative samples, thus constructing a loss for training.
#A11: What we just focused on is the user side, so the item side actually has the same structure. For the item side, there is also a global shared network for the item itself and an expression of a network parameter for each item in each scene in each scene. Therefore, through the same model method, it is a completely symmetrical structure on the user side. Each item will migrate through such a globally shared parameter and the network structure of each scene, and will eventually have an independent output related to the scene.
A12: This is a good question. Our current model is actually trained offline and updated on a daily basis. Then we are also making some replacement attempts, hoping to improve its timeliness and conduct streaming training through online learning. Of course, there are still some problems currently encountered, mainly reflected in such a multi-scenario training method. In fact, it needs to introduce data from multiple sources at the same time. So in this kind of streaming training, how can data from multiple sources be simultaneously Access and how to ensure stable training are actually relatively big challenges, so our current model is still updated at the offline day level. We may make some attempts in the future, such as treating the current offline multi-scenario joint training as a base model, and then fine-tuning the streaming data through model restore on a single scenario, and performing iterative updates in this way. .
#A13: This was actually introduced just now. The overall core point is still to split the expression of some features faced by the same user in different scenarios. For example, two scenarios a and b have some static features, and at the same time, they will leave independent user behaviors in the corresponding scenarios. Sequence, as well as some statistical information of such a user in the corresponding scene, such as category preferences, account preferences, and some statistical characteristics such as click exposure. This is equivalent to splitting the feature structure of the data into different scenarios, so that the same user can have different features in different scenarios. This is the structure on the features, and on the samples, it is just mentioned. If a user visits multiple scenes, then the scenes are combined in pairs to construct such a sample pair. Then on the item side, an item may be placed on multiple scenes, so the pairwise combination between multiple scenes can also construct samples of the item.
A14: Our pre-training phase is an unsupervised task. Our focus is to obtain the representation of its embedding and the corresponding network through the pre-training task. Initialization of parameters. Therefore, when we evaluate pre-training contrastive learning, we mainly analyze the effect through visual clustering of vectors produced in the pre-training stage.
#A15: This is actually a description of the commonality and differences of scenes. The original intention of our multi-scenario solution is to hope that after being reinforced by a unified model, Then optimize in each scenario. Of course, in reality, especially in subsequent iterations, it is not possible to achieve 100% success with the same model architecture in all scenarios, and some of the benefits may not be obvious. So at this time, we actually need to make some fine-tuning designs on the upper structure based on the characteristics of each scene in our current multi-scenario architecture. That is, multi-scenarios can be used in the underlying embedding sharing and information migration part. framework to do it. Of course, each scenario has its own network characteristics. For example, some strong trigger information in the second hop will have some independent feature structures, which require some adaptation at the upper layer.
A16: The underlying embedding and globally shared network are shared between each scene, and then each network parameter corresponding to each scene is in each unique to the scene.
#A17: We currently use a split database construction method, that is, a candidate corresponding to each scene will generate an independent index in each scene.
#A18: There is no standard answer to this. It may have to be looked at based on the actual business scenario.
A19: Because the recall phase is a multi-channel recall, that is, there will be many different types of recall methods, such as vector recall, etc., and then There are some manual recall methods provided by some operational indicators, so the current recommendation stage is actually a fusion of various types of recall methods, and then unified scoring is given to the sorting side. Specific to the model we are talking about today, the focus is still on the optimization of this model.
A20: Our current plan is to load model parameters in the initialization phase after model training. As for the embedding and whether the parameters need to be updated after pre-training, we have done two experiments. The first is to fix the pre-trained model after loading it, and the second is to continue to participate in such model training during the fine-tuning stage. Then what we currently use is to retrain the original pre-trained model parameters during the fine-tuning phase.
A21: The order between scenes is random, because our model will actually migrate the information of such a global scene to a single scene, so between scenes , its training order is actually random.
A22: Yes, in terms of scene differences, we generally use the same user or the same item to have different characteristics in different scenes, so as to describe its characteristics in the scene. Differences, that is, treating the differences between scenes as a way of data enhancement. Not only the scene alone, but also some behavioral characteristic representations of users and items on the scene.
A23: Pre-training is to assist the training of our final recall task, so when estimating, we will only use the second-stage fine-tuning task for deployment. . You can understand such a twin-tower structure model as a more classic DSSM. Therefore, the model structure in the fine-tuning stage is finally used to go online, and then the user-side expression and item-side expression are output when the representation input is estimated.
A24: Another way to put this question is whether the two-stage approach we are currently using can be transformed into end-to-end training, that is, in the training task It not only performs unsupervised training between scenes, but also conducts joint training of user and item recall on this basis. This approach feels feasible, but the reason we did not do this is that it does not solve the problem of using unlabeled sample data. That is to say, the samples used in our two-stage tasks are actually different. In the pre-training stage, a larger range of unlabeled data is used, and then in the fine-tuning stage, the labeled data is used. If joint training is used, training can only be performed on such labeled data, and the sample space will actually be reduced. This will be somewhat different from our original design intention.
A25: This will perform a migration on each layer, that is, on a multi-layer structure, a single-layer structure will migrate information, and then This structure is stacked. According to the results of an experiment just now, if you choose layer 3 after multiple comparisons, its effect is somewhat improved. Therefore, in our actual business implementation, the number of layers corresponding to this model is also set to 3.
A26: Our pre-training and fine-tuning phase tasks are day-level incremental training, and then in the fine-tuning phase, we will import the pre-training phase. The two sides are equivalent to Incremental training is done in parallel, and parameters are loaded in the middle.
A27: Just mentioned that our overall training method is incremental training, so their time windows are basically aligned.
The above is the detailed content of Adaptive and unsupervised multi-scenario model modeling practice in Taobao personalized recommendations. For more information, please follow other related articles on the PHP Chinese website!