Home > Article > Technology peripherals > Application and research of industry search based on pre-trained language model
# Take the e-commerce scenario as an example. For example, a user searches for aj1 North Carolina blue new sneakers in an e-commerce store. In order to better understand such a user's query, a series of tasks need to be performed:
##If divided according to the search paradigm, it is generally divided into sparse retrieval and dense retrieval.
sparse retrieval: Traditionally build an inverted index based on words or words, and at the same time build a series of capabilities for query understanding based on this. , including some text relevance sorting, etc.;
Generally, the search is divided into links: recall, sorting (rough sorting, fine sorting) Arrange, rearrange).
#Recall phase:
From From left to right, the model complexity and effect accuracy become higher. From right to left, the number of Docs processed increases. Take Taobao e-commerce as an example, such as recall (billions), preliminary ranking (hundreds of thousands), fine ranking (hundreds, thousands), and rearrangement (tens).
#Searching for production links is a system where retrieval effect and engineering efficiency are trade-off. As computing power increases, complex models begin to be replaced. For example, models that have been finely sorted will now gradually move to the stage of rough sorting or even recall.
## Search effectiveness evaluation:
Search is very different in different industry scenarios. , here it is divided into consumer Internet search and industrial Internet search:
AliceMind is a hierarchical pre-training language model system built by DAMO Academy. Contains general pre-training models, multi-language, multi-modal, dialogue, etc., and is the base for all NLP tasks. Search word segmentation (atomic capability), It determines the retrieval index granularity, and is also related to subsequent correlation and BM25 granularity. For task specific tasks, if you customize some pre-training, the effect will be better than general pre-training. For example, recent research hopes to add unsupervised statistical information to the native BERT pre-training task, such as statistical words, Gram granularity, or boundary entropy, and then add mse-loss to the pre-training. On CWS/POS and NER (picture on the right), many tasks have reached SOTA. Another study is cross-cutting. The cost of labeling data and constructing supervision tasks every time is very high, so it is necessary to build a cross-domain unsupervised word segmentation mechanism. The table in the lower right corner is an example. The quality of e-commerce word segmentation has been significantly improved compared to open source word segmentation. This method has also been released to ACL2020. ##Search named entity recognition mainly involves structured understanding of query and Doc, and identifying key phrases and types. At the same time, the construction of the search knowledge graph also relies on the NER function. #Searching for NER also presents some challenges. The main reason is that queries are often relatively short and lack context. For example, the query entity in e-commerce is highly ambiguous and knowledgeable. Therefore, the core optimization idea of NER in recent years is to enhance the representation of NER through context or the introduction of knowledge. We did implicit enhancement work combo embedding in 2020 and 2021. By dynamically integrating existing word extractor or GLUE representations, it can be used on many business tasks to achieve SOTA. In 2021, we will develop explicit retrieval enhancement. A piece of text will get enhanced context through the search engine and integrate it into the transformer structure. This work was published in ACL 2021. Based on this work, we participated in the SemEval 2022 multi-language NER evaluation and won 10 championships, as well as the best system paper. Search enhancement: In addition to the input sentence itself, additional context is retrieved and concat to the input, combined with KL's loss to help learning. Obtained SOTA in many open source data sets. BERT itself is very effective, but the actual production is very small There is a GPU cluster, and inference is required for each task, which is very costly in terms of performance. We think about whether we can only do inference once, and then adapt each task by itself after the encoder, so that we can get better results. An intuitive way is to incorporate NLP query analysis tasks through the meta-task framework. But the traditional meta-task is a uniformly sampled distribution. We propose MOMETAS, an adaptive meta-learning based method that self-adapts sampling for different tasks. In the process of learning multiple tasks, we will periodically use validation data for testing to see the effects of different task learning. reward in turn guides the sampling of previous training. (Table below) Combining this mechanism on many tasks has a lot of improvements compared to UB (uniform distribution). Apply the above mechanism to search scenarios in many industries, and the benefits will be achieved through BERT only once Encoding and storing can be directly reused in many downstream tasks, which can greatly improve performance. Deep retrieval, It is nothing more than two towers or a single tower. The common training paradigm is supervised signals and pre-trained models. Finetune is performed to obtain embedding, and query and doc are represented. The recent optimization routes are mainly data enhancement or difficult sample mining, and the other is optimizing pre-trained language models. Native BERT is not a particularly well-suited text representation for searching, so there are pre-trained language models for searching text representations. Other optimizations lie in multi-view text representation and special loss design. Compared with the random sampling of native BERT, we combine search word weights to improve words with higher word weights to improve sampling. Probabilistically, learned representations are better suited for search recall. In addition, sentence level comparative learning is added. Combining these two mechanisms, a pre-trained language model of ROM is proposed. ## Do experiments at MS MARCO to achieve the best results compared to previous practices. In actual scene search tasks, it can also bring great improvements. At the same time, this model also participated in MS rankings. Except for the ROM recall stage In addition, in the fine ranking and reranking stage, a set of list aware Transformer reranking is proposed, which organically integrates the results of many classifiers through the Transformer, resulting in a relatively large improvement. Combining the two solutions of ROM and HLATR, the results from March to now (July) are still SOTA. The address analysis product developed by DAMO Academy is based on the fact that there are many correspondence addresses in various industries. Chinese correspondence addresses have many characteristics, such as many defaults in colloquial expressions. At the same time, the address itself is a person or thing, and it is an important entity unit that bridges many entities in the objective world. Therefore, based on this, a set of address knowledge graph was established to provide parsing, completion, search, and address analysis. This is the technical block diagram of the product. From bottom to top, it includes the construction of the address knowledge graph and the address pre-training language model, including a search engine-based framework to connect the entire link. The benchmark capabilities mentioned above are provided in the form of APIs and packaged into industry solutions. One of the more important points in this technology is the pre-trained language model of geographical semantics. An address will be represented as a string in text, but in fact it is often represented as longitude and latitude in space, and there are corresponding pictures on the map. Therefore, the information of these three modalities is organically integrated into a multi-modal geo-semantic language model to support the tasks in location. As mentioned above, many basic capabilities related to addresses are required, such as word segmentation, error correction, structuring and other analyses. The intuitive application of the address search system is to fill in the address and search in the suggestion scene, or search in the Amap map, which needs to be mapped to the space. At one point. Next, we will introduce two relatively industrial application solutions. The first one is the new retail Family ID. The core requirement is to maintain a customer management system. However, user information in each system is not connected and effective integration cannot be achieved. For example, when a brand manufacturer sells an air conditioner, the family members register various addresses and mobile phone numbers due to the purchase, installation, and maintenance, but the corresponding addresses are actually the same address. The established address search normalization technology normalizes addresses with different representations, generates fingerprints, and aggregates different user IDs into the Family concept. # Concept of aggregation through family , can achieve better penetration analysis, advertising reach and other marketing activities under new retail. Another application scenario is 119, 129, emergency and other intelligent alarm receiving applications. Because the personal and property safety of the people is involved, every second counts. We hope to improve this efficiency by combining speech recognition and text semantic understanding technologies. (Example on the left) The scene has many characteristics, such as typos, unfluency, and colloquialism in ASR transcription. . The goal is to infer the location of an alarm based on automated speech transcription analysis. # We have proposed a complete set of system solutions, including smooth spoken language error correction for dialogue understanding, intent recognition, and a set of search and recall mechanisms to ultimately achieve address recommendation. The link is relatively mature and has been implemented in fire protection systems in hundreds of cities in China. Firefighters identify specific locations from alarm conversations, combine recommendation, matching, and address fences to determine the specific locations and send out alarms accordingly. Next, we will introduce the education industry The photo collection business also has a lot of demand in To C and for teachers. Photo search questions have several characteristics. It has an incrementally updated question bank and has a large user base. In addition, the fields corresponding to different disciplines and age groups are very knowledgeable. At the same time, it is a multi-modal algorithm, with a set of links from OCR to subsequent semantic understanding and search. In recent years, a complete set of links from algorithms to systems has been built for photo collection.
#For example, after taking a picture with a mobile phone and OCR recognition, a series of tasks such as spelling correction, subject prediction, word segmentation, and word weighting will be performed to help with retrieval. . Since OCR does not recognize spaces in English, a set of K12 English pre-training algorithm models were trained to perform English Segmentation. At the same time, the subjects and question types are unknown and need to be predicted in advance. Use multimodality to combine images and text for intent understanding. Photo search questions are different from ordinary user searches. User searches often have shorter queries, while photo search questions It is often a complete question. Many words in the question are unimportant, and it is necessary to do word weight analysis, discard unimportant words or sort them to downgrade them. The most obvious optimization effect in the photo search scene is vector recall. Performance requirements make it difficult to use the OR recall mechanism and need to use AND logic. The corresponding feature is that there are relatively few recalls. To improve recall, you need to do more redundant modules such as term weighting and error correction. (Right picture) The multi-channel recall effect of text plus vector exceeds that of pure OR logic, and the latency is reduced by 10 times. Photo search links include image vector recall, formula recall, and personalized recall. Provide two examples. The first one is the OCR result of plain text. (Left column) The old result is based on ES, simple OR recall, plus the result of BM25. (Right column) The link after multi-channel recall and correlation recall has been greatly improved. . #The second is to take pictures containing graphics, which must be combined with picture recall in multi-channel. ##There is a lot of semi-structured and unstructured data in enterprise search, providing unified search to help enterprises integrate data resources. Not only in electric power, other industries also have similar needs. The search here is no longer a narrow search, but also includes the AI of document preprocessing and the construction of a knowledge graph, as well as the ability to subsequently bridge questions and answers. The above is a schematic diagram of creating a set of institutional standard texts in the electric power knowledge base, from structuring to retrieval to application. 1. AliceMind system
2. Word segmentation
3. Named entity recognition
4. Adaptive multi-task training
5. Search recall pre-trained language model
6. HLATR rearrangement model
##3. Industry search application
1. Address analysis product
3. Unified search of power knowledge base
The above is the detailed content of Application and research of industry search based on pre-trained language model. For more information, please follow other related articles on the PHP Chinese website!