search
HomeTechnology peripheralsAIApplication and research of industry search based on pre-trained language model

Application and research of industry search based on pre-trained language model

##1. Background of industry search

1. Damo Academy Natural Language Intelligence Big Picture

Application and research of industry search based on pre-trained language model


The above picture is the technical block diagram of DAMO Academy’s natural language processing intelligence, which includes from bottom to top:

  • NLP data, NLP basic lexicon, syntax and semantics, analysis technology, and upper-level NLP technology
  • ##Industry application: DAMO Academy does more than In addition to basic research, it also empowers Alibaba Group and integrates with Alibaba Cloud to empower industries. Many industry scenarios for empowerment are search.
#2. Nature of industry search

Application and research of industry search based on pre-trained language model


##The essence of search for industrial and consumer Internet is the same: users have information acquisition needs and at the same time have an information resource library, and the two are bridged through search engines.

# Take the e-commerce scenario as an example. For example, a user searches for aj1 North Carolina blue new sneakers in an e-commerce store. In order to better understand such a user's query, a series of tasks need to be performed:

    Analysis of query understanding: NLP error correction, word segmentation and category Prediction, entity recognition word weight, query rewriting and other technologies
  • (offline) document analysis: NLP analysis, quality efficiency analysis
  • Retrieval and sorting: Through the analysis of query and document, combined with some retrieval and sorting mechanisms of the search engine itself, the goal of bridging the two can be achieved.
  • 3. Industry search link

Application and research of industry search based on pre-trained language model


##If divided according to the search paradigm, it is generally divided into sparse retrieval and dense retrieval.

sparse retrieval: Traditionally build an inverted index based on words or words, and at the same time build a series of capabilities for query understanding based on this. , including some text relevance sorting, etc.;
  • dense retrieval: With the rise of pre-trained language models, single towers and double towers are realized based on pre-trained bases model, and then combined with the vector engine to establish a search mechanism.

Application and research of industry search based on pre-trained language modelGenerally, the search is divided into links: recall, sorting (rough sorting, fine sorting) Arrange, rearrange).

Application and research of industry search based on pre-trained language model

#Recall phase:

  • Keyword recall of traditional sparse retrieval
  • ##dense retrieval vector recall, personalized recall
  • Rough sorting stage: Use text relevance (static) scores to filter
  • Fine sorting stage: Relatively complex, there will be correlation models, which may be combined with the business efficiency model (LTR)

Application and research of industry search based on pre-trained language model

From From left to right, the model complexity and effect accuracy become higher. From right to left, the number of Docs processed increases. Take Taobao e-commerce as an example, such as recall (billions), preliminary ranking (hundreds of thousands), fine ranking (hundreds, thousands), and rearrangement (tens).

#Searching for production links is a system where retrieval effect and engineering efficiency are trade-off. As computing power increases, complex models begin to be replaced. For example, models that have been finely sorted will now gradually move to the stage of rough sorting or even recall.

Application and research of industry search based on pre-trained language model

## Search effectiveness evaluation:

  • Recall: recall or no result rate
  • Ordering: relevance, conversion efficiency (close to business)
  • Relevance: NDCG, MRR
  • Conversion efficiency: click-through rate, conversion rate
4. Search on consumer Internet and industrial Internet

Application and research of industry search based on pre-trained language model

Search is very different in different industry scenarios. , here it is divided into consumer Internet search and industrial Internet search:

  • user group and UV: The consumer Internet search UV is very large, and the industrial Internet is targeted at employees within government and enterprises.
  • Search pursuit indicators: In consuming the Internet, in addition to pursuing search results and accurate searches, we also pursue High conversion rate. In the industrial Internet, it is more about the need for information matching, so focus on recall and relevance.
  • Engineering system requirements: The consumer Internet QPS requirements will be very high, and a large number of user behaviors will be accumulated, which requires There are real-time log analysis and real-time model training. The requirements for the industrial Internet will be lower.
  • Algorithm direction: The consumer Internet will be obtained from the analysis and modeling of massive user behavior offline, nearline, and online Greater benefits. The user behavior of the industrial Internet is sparse, so it will pay more attention to content understanding, such as NLP or visual understanding. Research directions include low resource and transfer learning.
2. Research on related technologies

Application and research of industry search based on pre-trained language model

##Search is with The system framework is tightly coupled: including offline data, search service framework (green part), and search technology algorithm system (blue part). Its base is the Alicemind pre-trained language model system, which will also converge on document analysis, query understanding, and correlation. wait.

1. AliceMind system

Application and research of industry search based on pre-trained language model

AliceMind is a hierarchical pre-training language model system built by DAMO Academy. Contains general pre-training models, multi-language, multi-modal, dialogue, etc., and is the base for all NLP tasks.

2. Word segmentation

Application and research of industry search based on pre-trained language model

Search word segmentation (atomic capability), It determines the retrieval index granularity, and is also related to subsequent correlation and BM25 granularity. For task specific tasks, if you customize some pre-training, the effect will be better than general pre-training. For example, recent research hopes to add unsupervised statistical information to the native BERT pre-training task, such as statistical words, Gram granularity, or boundary entropy, and then add mse-loss to the pre-training. On CWS/POS and NER (picture on the right), many tasks have reached SOTA.

Application and research of industry search based on pre-trained language model

Another study is cross-cutting. The cost of labeling data and constructing supervision tasks every time is very high, so it is necessary to build a cross-domain unsupervised word segmentation mechanism. The table in the lower right corner is an example. The quality of e-commerce word segmentation has been significantly improved compared to open source word segmentation. This method has also been released to ACL2020.

3. Named entity recognition

Application and research of industry search based on pre-trained language model


##Search named entity recognition mainly involves structured understanding of query and Doc, and identifying key phrases and types. At the same time, the construction of the search knowledge graph also relies on the NER function.

#Searching for NER also presents some challenges. The main reason is that queries are often relatively short and lack context. For example, the query entity in e-commerce is highly ambiguous and knowledgeable. Therefore, the core optimization idea of ​​NER in recent years is to enhance the representation of NER through context or the introduction of knowledge.

Application and research of industry search based on pre-trained language model

We did implicit enhancement work combo embedding in 2020 and 2021. By dynamically integrating existing word extractor or GLUE representations, it can be used on many business tasks to achieve SOTA.

In 2021, we will develop explicit retrieval enhancement. A piece of text will get enhanced context through the search engine and integrate it into the transformer structure. This work was published in ACL 2021.

Based on this work, we participated in the SemEval 2022 multi-language NER evaluation and won 10 championships, as well as the best system paper.

Application and research of industry search based on pre-trained language model


Search enhancement: In addition to the input sentence itself, additional context is retrieved and concat to the input, combined with KL's loss to help learning. Obtained SOTA in many open source data sets.

4. Adaptive multi-task training

Application and research of industry search based on pre-trained language model

BERT itself is very effective, but the actual production is very small There is a GPU cluster, and inference is required for each task, which is very costly in terms of performance. We think about whether we can only do inference once, and then adapt each task by itself after the encoder, so that we can get better results.

Application and research of industry search based on pre-trained language model

An intuitive way is to incorporate NLP query analysis tasks through the meta-task framework. But the traditional meta-task is a uniformly sampled distribution. We propose MOMETAS, an adaptive meta-learning based method that self-adapts sampling for different tasks. In the process of learning multiple tasks, we will periodically use validation data for testing to see the effects of different task learning. reward in turn guides the sampling of previous training. (Table below) Combining this mechanism on many tasks has a lot of improvements compared to UB (uniform distribution).

Application and research of industry search based on pre-trained language model

Apply the above mechanism to search scenarios in many industries, and the benefits will be achieved through BERT only once Encoding and storing can be directly reused in many downstream tasks, which can greatly improve performance.

5. Search recall pre-trained language model

Application and research of industry search based on pre-trained language model

Deep retrieval, It is nothing more than two towers or a single tower. The common training paradigm is supervised signals and pre-trained models. Finetune is performed to obtain embedding, and query and doc are represented. The recent optimization routes are mainly data enhancement or difficult sample mining, and the other is optimizing pre-trained language models. Native BERT is not a particularly well-suited text representation for searching, so there are pre-trained language models for searching text representations. Other optimizations lie in multi-view text representation and special loss design.

Application and research of industry search based on pre-trained language model

Compared with the random sampling of native BERT, we combine search word weights to improve words with higher word weights to improve sampling. Probabilistically, learned representations are better suited for search recall. In addition, sentence level comparative learning is added. Combining these two mechanisms, a pre-trained language model of ROM is proposed.

Application and research of industry search based on pre-trained language model

## Do experiments at MS MARCO to achieve the best results compared to previous practices. In actual scene search tasks, it can also bring great improvements. At the same time, this model also participated in MS rankings.

6. HLATR rearrangement model

Application and research of industry search based on pre-trained language model

Except for the ROM recall stage In addition, in the fine ranking and reranking stage, a set of list aware Transformer reranking is proposed, which organically integrates the results of many classifiers through the Transformer, resulting in a relatively large improvement.

Application and research of industry search based on pre-trained language model

Combining the two solutions of ROM and HLATR, the results from March to now (July) are still SOTA.

##3. Industry search application

1. Address analysis product

Application and research of industry search based on pre-trained language model

The address analysis product developed by DAMO Academy is based on the fact that there are many correspondence addresses in various industries. Chinese correspondence addresses have many characteristics, such as many defaults in colloquial expressions. At the same time, the address itself is a person or thing, and it is an important entity unit that bridges many entities in the objective world. Therefore, based on this, a set of address knowledge graph was established to provide parsing, completion, search, and address analysis.

Application and research of industry search based on pre-trained language model

This is the technical block diagram of the product. From bottom to top, it includes the construction of the address knowledge graph and the address pre-training language model, including a search engine-based framework to connect the entire link. The benchmark capabilities mentioned above are provided in the form of APIs and packaged into industry solutions.

Application and research of industry search based on pre-trained language model

One of the more important points in this technology is the pre-trained language model of geographical semantics. An address will be represented as a string in text, but in fact it is often represented as longitude and latitude in space, and there are corresponding pictures on the map. Therefore, the information of these three modalities is organically integrated into a multi-modal geo-semantic language model to support the tasks in location.

Application and research of industry search based on pre-trained language model

As mentioned above, many basic capabilities related to addresses are required, such as word segmentation, error correction, structuring and other analyses.

Application and research of industry search based on pre-trained language model

The core link is to bridge them by bridging the geographical pre-training language model, address basic tasks, and triggering search engines. For example, if you search for Zhejiang No. 1 Hospital, you may perform structuring, synonym correction, term weighting, vectorization, and Geohash prediction on it. Make a recall based on the analysis results. This link is a standard search link that performs text recall, pinyin recall, vector recall, and also adds geographic recall. Recall is followed by multi-stage sorting, including multi-granular feature fusion.

Application and research of industry search based on pre-trained language model

The intuitive application of the address search system is to fill in the address and search in the suggestion scene, or search in the Amap map, which needs to be mapped to the space. At one point.

Application and research of industry search based on pre-trained language model

Next, we will introduce two relatively industrial application solutions. The first one is the new retail Family ID. The core requirement is to maintain a customer management system. However, user information in each system is not connected and effective integration cannot be achieved.

Application and research of industry search based on pre-trained language model

For example, when a brand manufacturer sells an air conditioner, the family members register various addresses and mobile phone numbers due to the purchase, installation, and maintenance, but the corresponding addresses are actually the same address. The established address search normalization technology normalizes addresses with different representations, generates fingerprints, and aggregates different user IDs into the Family concept.

Application and research of industry search based on pre-trained language model


Application and research of industry search based on pre-trained language model

# Concept of aggregation through family , can achieve better penetration analysis, advertising reach and other marketing activities under new retail.

Application and research of industry search based on pre-trained language model

Another application scenario is 119, 129, emergency and other intelligent alarm receiving applications. Because the personal and property safety of the people is involved, every second counts. We hope to improve this efficiency by combining speech recognition and text semantic understanding technologies.

Application and research of industry search based on pre-trained language model

(Example on the left) The scene has many characteristics, such as typos, unfluency, and colloquialism in ASR transcription. . The goal is to infer the location of an alarm based on automated speech transcription analysis.

Application and research of industry search based on pre-trained language model


Application and research of industry search based on pre-trained language model

Application and research of industry search based on pre-trained language model

# We have proposed a complete set of system solutions, including smooth spoken language error correction for dialogue understanding, intent recognition, and a set of search and recall mechanisms to ultimately achieve address recommendation. The link is relatively mature and has been implemented in fire protection systems in hundreds of cities in China. Firefighters identify specific locations from alarm conversations, combine recommendation, matching, and address fences to determine the specific locations and send out alarms accordingly.

2. Education photo search topic

Application and research of industry search based on pre-trained language model

Next, we will introduce the education industry The photo collection business also has a lot of demand in To C and for teachers.

Application and research of industry search based on pre-trained language model

Photo search questions have several characteristics. It has an incrementally updated question bank and has a large user base. In addition, the fields corresponding to different disciplines and age groups are very knowledgeable. At the same time, it is a multi-modal algorithm, with a set of links from OCR to subsequent semantic understanding and search.

Application and research of industry search based on pre-trained language model

In recent years, a complete set of links from algorithms to systems has been built for photo collection.

Application and research of industry search based on pre-trained language model

#For example, after taking a picture with a mobile phone and OCR recognition, a series of tasks such as spelling correction, subject prediction, word segmentation, and word weighting will be performed to help with retrieval. .

Application and research of industry search based on pre-trained language model

Since OCR does not recognize spaces in English, a set of K12 English pre-training algorithm models were trained to perform English Segmentation.

Application and research of industry search based on pre-trained language model

At the same time, the subjects and question types are unknown and need to be predicted in advance. Use multimodality to combine images and text for intent understanding.

Application and research of industry search based on pre-trained language model

Photo search questions are different from ordinary user searches. User searches often have shorter queries, while photo search questions It is often a complete question. Many words in the question are unimportant, and it is necessary to do word weight analysis, discard unimportant words or sort them to downgrade them.

Application and research of industry search based on pre-trained language model

The most obvious optimization effect in the photo search scene is vector recall. Performance requirements make it difficult to use the OR recall mechanism and need to use AND logic. The corresponding feature is that there are relatively few recalls. To improve recall, you need to do more redundant modules such as term weighting and error correction. (Right picture) The multi-channel recall effect of text plus vector exceeds that of pure OR logic, and the latency is reduced by 10 times.

Application and research of industry search based on pre-trained language model

Photo search links include image vector recall, formula recall, and personalized recall.

Application and research of industry search based on pre-trained language model

Provide two examples. The first one is the OCR result of plain text. (Left column) The old result is based on ES, simple OR recall, plus the result of BM25. (Right column) The link after multi-channel recall and correlation recall has been greatly improved. .

#The second is to take pictures containing graphics, which must be combined with picture recall in multi-channel.

3. Unified search of power knowledge base

Application and research of industry search based on pre-trained language model

Application and research of industry search based on pre-trained language model


##There is a lot of semi-structured and unstructured data in enterprise search, providing unified search to help enterprises integrate data resources. Not only in electric power, other industries also have similar needs. The search here is no longer a narrow search, but also includes the AI ​​of document preprocessing and the construction of a knowledge graph, as well as the ability to subsequently bridge questions and answers. The above is a schematic diagram of creating a set of institutional standard texts in the electric power knowledge base, from structuring to retrieval to application.

The above is the detailed content of Application and research of industry search based on pre-trained language model. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
一文搞懂Tokenization!一文搞懂Tokenization!Apr 12, 2024 pm 02:31 PM

语言模型是对文本进行推理的,文本通常是字符串形式,但模型的输入只能是数字,因此需要将文本转换成数字形式。Tokenization是自然语言处理的基本任务,根据特定需求能够把一段连续的文本序列(如句子、段落等)切分为一个字符序列(如单词、短语、字符、标点等多个单元),其中的单元称为token或词语。根据下图所示的具体流程,首先将文本句子切分成一个个单元,然后将单元素数值化(映射为向量),再将这些向量输入到模型进行编码,最后输出到下游任务进一步得到最终的结果。文本切分按照文本切分的粒度可以将Toke

为大模型提供全新科学复杂问答基准与测评体系,UNSW、阿贡、芝加哥大学等多家机构联合推出SciQAG框架为大模型提供全新科学复杂问答基准与测评体系,UNSW、阿贡、芝加哥大学等多家机构联合推出SciQAG框架Jul 25, 2024 am 06:42 AM

编辑|ScienceAI问答(QA)数据集在推动自然语言处理(NLP)研究发挥着至关重要的作用。高质量QA数据集不仅可以用于微调模型,也可以有效评估大语言模型(LLM)的能力,尤其是针对科学知识的理解和推理能力。尽管当前已有许多科学QA数据集,涵盖了医学、化学、生物等领域,但这些数据集仍存在一些不足。其一,数据形式较为单一,大多数为多项选择题(multiple-choicequestions),它们易于进行评估,但限制了模型的答案选择范围,无法充分测试模型的科学问题解答能力。相比之下,开放式问答

云端部署大模型的三个秘密云端部署大模型的三个秘密Apr 24, 2024 pm 03:00 PM

编译|星璇出品|51CTO技术栈(微信号:blog51cto)在过去的两年里,我更多地参与了使用大型语言模型(LLMs)的生成AI项目,而非传统的系统。我开始怀念无服务器云计算。它们的应用范围广泛,从增强对话AI到为各行各业提供复杂的分析解决方案,以及其他许多功能。许多企业将这些模型部署在云平台上,因为公共云提供商已经提供了现成的生态系统,而且这是阻力最小的路径。然而,这并不便宜。云还提供了其他好处,如可扩展性、效率和高级计算能力(按需提供GPU)。在公共云平台上部署LLM的过程有一些鲜为人知的

顺手训了一个史上超大ViT?Google升级视觉语言模型PaLI:支持100+种语言顺手训了一个史上超大ViT?Google升级视觉语言模型PaLI:支持100+种语言Apr 12, 2023 am 09:31 AM

近几年自然语言处理的进展很大程度上都来自于大规模语言模型,每次发布的新模型都将参数量、训练数据量推向新高,同时也会对现有基准排行进行一次屠榜!比如今年4月,Google发布5400亿参数的语言模型PaLM(Pathways Language Model)在语言和推理类的一系列测评中成功超越人类,尤其是在few-shot小样本学习场景下的优异性能,也让PaLM被认为是下一代语言模型的发展方向。同理,视觉语言模型其实也是大力出奇迹,可以通过提升模型的规模来提升性能。当然了,如果只是多任务的视觉语言模

大规模语言模型高效参数微调--BitFit/Prefix/Prompt 微调系列大规模语言模型高效参数微调--BitFit/Prefix/Prompt 微调系列Oct 07, 2023 pm 12:13 PM

2018年谷歌发布了BERT,一经面世便一举击败11个NLP任务的State-of-the-art(Sota)结果,成为了NLP界新的里程碑;BERT的结构如下图所示,左边是BERT模型预训练过程,右边是对于具体任务的微调过程。其中,微调阶段是后续用于一些下游任务的时候进行微调,例如:文本分类,词性标注,问答系统等,BERT无需调整结构就可以在不同的任务上进行微调。通过”预训练语言模型+下游任务微调”的任务设计,带来了强大的模型效果。从此,“预训练语言模型+下游任务微调”便成为了NLP领域主流训

RoSA: 一种高效微调大模型参数的新方法RoSA: 一种高效微调大模型参数的新方法Jan 18, 2024 pm 05:27 PM

随着语言模型扩展到前所未有的规模,对下游任务进行全面微调变得十分昂贵。为了解决这个问题,研究人员开始关注并采用PEFT方法。PEFT方法的主要思想是将微调的范围限制在一小部分参数上,以降低计算成本,同时仍能实现自然语言理解任务的最先进性能。通过这种方式,研究人员能够在保持高性能的同时,节省计算资源,为自然语言处理领域带来新的研究热点。RoSA是一种新的PEFT技术,通过在一组基准测试的实验中,发现在使用相同参数预算的情况下,RoSA表现出优于先前的低秩自适应(LoRA)和纯稀疏微调方法。本文将深

Meta 推出 AI 语言模型 LLaMA,一个有着 650 亿参数的大型语言模型Meta 推出 AI 语言模型 LLaMA,一个有着 650 亿参数的大型语言模型Apr 14, 2023 pm 06:58 PM

2月25日消息,Meta在当地时间周五宣布,它将推出一种针对研究社区的基于人工智能(AI)的新型大型语言模型,与微软、谷歌等一众受到ChatGPT刺激的公司一同加入人工智能竞赛。Meta的LLaMA是“大型语言模型MetaAI”(LargeLanguageModelMetaAI)的缩写,它可以在非商业许可下提供给政府、社区和学术界的研究人员和实体工作者。该公司将提供底层代码供用户使用,因此用户可以自行调整模型,并将其用于与研究相关的用例。Meta表示,该模型对算力的要

BLOOM可以为人工智能研究创造一种新的文化,但挑战依然存在BLOOM可以为人工智能研究创造一种新的文化,但挑战依然存在Apr 09, 2023 pm 04:21 PM

​译者 | 李睿审校 | 孙淑娟BigScience研究项目日前发布了一个大型语言模型BLOOM,乍一看,它看起来像是复制OpenAI的GPT-3的又一次尝试。 但BLOOM与其他大型自然语言模型(LLM)的不同之处在于,它在研究、开发、培训和发布机器学习模型方面所做的努力。近年来,大型科技公司将大型自然语言模型(LLM)就像严守商业机密一样隐藏起来,而BigScience团队从项目一开始就把透明与开放放在了BLOOM的中心。 其结果是一个大型语言模型,可以供研究和学习,并可供所有人使用。B

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),