Home  >  Article  >  Technology peripherals  >  Let’s talk about knowledge extraction. Have you learned it?

Let’s talk about knowledge extraction. Have you learned it?

PHPz
PHPzforward
2023-11-13 20:13:02656browse

1. Introduction

Knowledge extraction usually refers to mining structured information from unstructured text, such as tags and phrases containing rich semantic information. This is widely used in the industry in scenarios such as content understanding and product understanding. By extracting valuable tags from user-generated text information, it is applied to content or products.

Knowledge extraction is usually accompanied by The classification of extracted tags or phrases is usually modeled as a named entity recognition task. A common named entity recognition task is to identify named entity components and classify the components into place names, person names, organization names, etc.; domain-related tag word extraction will Tag words are identified and divided into domain-defined categories, such as series (Air Force One, Sonic 9), brand (Nike, Li Ning), type (shoes, clothing, digital), style (INS style, retro style, Nordic style) )wait.

For the convenience of description, in the following, tags or phrases rich in information are collectively called tag words

2. Knowledge extraction classification

This article starts from tag word mining and tag words Classic methods of knowledge extraction are introduced from two perspectives. Tag word mining methods are divided into unsupervised methods, supervised methods and remote supervision methods, as shown in Figure 1. Tag word mining selects high-scoring tag words through two steps: candidate word mining and phrase scoring. Tag word classification usually jointly models tag word extraction and classification, and transforms it into a sequence annotation task for named entity recognition.

Let’s talk about knowledge extraction. Have you learned it?Figure 1 Classification of knowledge extraction methods

3. Tag word mining

Unsupervised method

Statistics-based method

First segment the document or combine the segmented words into N-grams as candidate words , and then score candidate words based on statistical characteristics.
  • TF-IDF (Term Frequency-Inverse Document Frequency): Count the TF-IDF score of each word. The higher the score, the greater the amount of information contained.

Rewritten content: Calculation method: tfidf(t, d, D) = tf(t, d) * idf(t, D), where tf(t, d) = log (1 freq(t, d)), freq(t,d) represents the number of times candidate word t appears in the current document d, idf(t,D) = log(N/count(d∈D:t∈D) ) indicates how many documents the candidate word t appears in, and is used to indicate the rarity of a word. If a word only appears in one document, it means that the word is rare and has more information.

Specific business In this scenario, you can use external tools to conduct a first round of screening of candidate words, such as using part-of-speech tags to screen nouns.

  • YAKE[1]: Five features are defined to capture keyword characteristics, which are heuristically combined to assign a score to each keyword. The lower the score, the more important the keyword is. 1) Capital letters: Term in capital letters (except for the beginning word of each sentence) is more important than Term in lowercase letters, corresponding to the number of bold words in Chinese; 2) Word position: each paragraph of text Some words at the beginning are more important than subsequent words; 3) Word frequency, counts the frequency of word occurrence; 4) Word context, used to measure the number of different words that appear under a fixed window size, one word The more different words co-occur, the lower the importance of the word; 5) The number of times a word appears in different sentences, a word appears in more sentences, the more important it is.

Graph-Based Model

  • TextRank[2]: First perform word segmentation and part-of-speech on the text Annotate and filter out stop words, leaving only words with specified parts of speech to construct the graph. Each node is a word, and edges represent relationships between words, which are constructed by defining the co-occurrence of words within a moving window of a predetermined size. Use PageRank to update the weight of nodes until convergence; sort the node weights in reverse order to obtain the most important k words as candidate keywords; mark the candidate words in the original text, and if they form adjacent phrases, combine them into multiple Keyword phrases for phrases.

Representation-based methodEmbedding-Based Model

Representation-based method calculates the vector similarity between candidate words and documents degree to rank candidate words.
  • EmbedRank[3]: Select candidate words through word segmentation and part-of-speech tagging, use pre-trained Doc2Vec and Sent2vec as vector representations of candidate words and documents, and calculate cosine similarity to rank candidate words. Similarly, KeyBERT[4] replaces the vector representation of EmbedRank with BERT.

Supervised method

Supervised methods calculate whether candidate words belong to label words by training a model.
  • First screen candidate words and then use tag word classification: The classic model KEA[5] uses Naive Bayes as a classifier to score N-gram candidate words on four designed features.
  • Joint training of candidate word screening and tag word recognition: BLING-KPE[6] takes the original sentence as input, uses CNN and Transformer to encode the N-gram phrase of the sentence, and calculates whether the phrase is a tag word Probability, whether the label word is manually labeled Label. BERT-KPE[7] Based on the idea of ​​BLING-KPE, ELMO is replaced by BERT to better represent the vector of the sentence.

Let’s talk about knowledge extraction. Have you learned it?Figure 2 BLING-KPE model structure


Far supervision method

AutoPhrase

The typical representative of the distant supervision method is AutoPhrase [10], which is widely used in tag word mining in the industry. AutoPhrase uses existing high-quality knowledge bases to conduct remote supervised training to avoid manual annotation.

In this article, we define high-quality phrases as those words with complete semantics, when the following four conditions are met at the same time

  • Popularit: frequency of occurrence in the document High enough;
  • Concordance: The frequency of Token collocation is much higher than that of other collocations after replacement, that is, the frequency of co-occurrence;
  • Informativeness: informative and clear, such as "this "is" is a negative example with no information;
  • Completeness: The phrase and its sub-phrases must have completeness.

AutoPhrase tag mining process is shown in Figure 3. First, we use part-of-speech tagging to screen high-frequency N-gram words as candidates. Then, we classify the candidate words through distant supervision. Finally, we use the above four conditions to filter out high-quality phrases (phrase quality re-estimation)

Let’s talk about knowledge extraction. Have you learned it?Figure 3 AutoPhrase tag mining process

From external knowledge The library obtains high-quality phrases as Positive Pool, and other phrases as negative examples. According to the experimental statistics of the paper, there are 10% of high-quality phrases in the negative example pool because they are not classified into negative examples in the knowledge base, so the paper uses the following method: The random forest ensemble classifier shown in Figure 4 reduces the impact of noise on classification. In industry applications, classifier training can also use the two-classification method of inter-sentence relationship tasks based on the pre-training model BERT [13].

Let’s talk about knowledge extraction. Have you learned it?Figure 4 AutoPhrase tag word classification method

4. Tag word classification

Supervised method

NER sequence annotation model

Named Entity Recognition (NER) is also a label extraction method that jointly trains candidate word screening and label word recognition. It is usually aimed at scenarios where the amount of sentence information is relatively repetitive. To identify entity components in sentences, it is implemented using a sequence annotation model. Taking the sentence as input, predict the probability of each Token in the sentence belonging to the following Label: B (Begin)-LOC (place name), I (Inside)-LOC (place name), E (End)-LOC (place name), O (Others), etc., where "-" is followed by the category to which the entity word belongs. In Chinese NER tasks, character-based rather than vocabulary-based methods are usually used for sequence annotation modeling to avoid the mistransmission problem caused by Chinese word segmentation. Therefore, vocabulary information needs to be introduced to strengthen entity word boundaries.

Lattice LSTM[8] is the first work to introduce vocabulary information for Chinese NER tasks. Lattice is a directed acyclic graph. The beginning and end characters of the vocabulary determine the grid position. Through the vocabulary information (dictionary) When matching a sentence, a Lattice-like structure can be obtained, as shown in Figure 5(a). The Lattice LSTM structure fuses vocabulary information into the native LSTM, as shown in 5(b). For the current character, all external dictionary information ending with that character is fused. For example, "store" fuses "people and drug stores" and "Pharmacy" information. For each character, Lattice LSTM uses an attention mechanism to fuse a variable number of word units. Although Lattice-LSTM effectively improves the performance of NER tasks, the RNN structure cannot capture long-distance dependencies, and introducing lexical information is lossy. At the same time, the dynamic Lattice structure cannot fully perform GPU parallelism. The Flat[9] model has effectively improved These two questions. As shown in Figure 5(c), the Flat model captures long-distance dependencies through the Transformer structure, and designs a Position Encoding to integrate the Lattice structure. After splicing the words matched by characters into sentences, each character and word is Construct two Head Position Encoding and Tail Position Encoding, flatten the Lattice structure from a directed acyclic graph to a flat Flat-Lattice Transformer structure.

Let’s talk about knowledge extraction. Have you learned it?Figure 5 NER model introducing lexical information

Far supervision method

##AutoNER

AutoNER[11] uses an external dictionary to construct training data for distant supervised entity recognition. It first performs entity boundary recognition (Entity Span Recognition), and then performs entity classification (Entity Classification). The construction of the external dictionary can directly use the external knowledge base, or use the AutoPhrase mining method to first conduct offline tag word mining, and then use the AutoNER model to incrementally update the tag words. In order to solve the noise problem in distant supervision, we use the entity boundary identification scheme of Tie or Break to replace the BIOE labeling method. Among them, Tie means that the current word and the previous word belong to the same entity, and Break means that the current word and the previous word are no longer in the same entity.

In the entity classification stage, Fuzzy CRF is used to deal with it. One entity has multiple types

Let’s talk about knowledge extraction. Have you learned it?Figure 6 AutoNER model structure diagram

BOND

BOND[12] is a two-stage entity recognition model based on remote supervised learning. In the first stage, long-distance labels are used to adapt the pre-trained language model to the NER task; in the second stage, the Student model and Teacher model are first initialized with the model trained in Stage 1, and then the pseudo-labels generated by the Teacher model are used to pair the Student model Conduct training to minimize the impact of noise problems caused by distant supervision.

Let’s talk about knowledge extraction. Have you learned it?Picture

The content that needs to be rewritten is: Figure 7 BOND training flow chart

5. Summary

This article introduces classic methods of knowledge extraction from the two perspectives of tag word mining and tag word classification, including unsupervised and distantly supervised classic methods TF-IDF and TextRank that do not rely on manual annotation of data, AutoPhrase, AutoNER, etc., which are widely used in the industry. It can provide reference for industry content understanding, dictionary construction for Query understanding, NER and other directions.

References

【1】Campos R, Mangaravite V, Pasquali A, et al. Yake! collection-independent automatic keyword extractor[ C]//Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings 40. Springer International Publishing, 2018: 806-810. https://github.com /LIAAD/yake

【2】Mihalcea R, Tarau P. Textrank: Bringing order into text[C]//Proceedings of the 2004 conference on empirical methods in natural language processing. 2004: 404-411.

【3 】Bennani-Smires K, Musat C, Hossmann A, et al. Simple unsupervised keyphrase extraction using sentence embeddings[J]. arXiv preprint arXiv:1801.04470, 2018.

【4】KeyBERT, https://github .com/MaartenGr/KeyBERT

【5】Witten I H, Paynter G W, Frank E, et al. KEA: Practical automatic keyphrase extraction[C]//Proceedings of the fourth ACM conference on Digital libraries. 1999: 254-255.

Translation content: [6] Xiong L, Hu C, Xiong C, et al. Open domain Web keyword extraction beyond language models[J]. arXiv preprint arXiv:1911.02671, 2019

[7] Sun, S., Xiong, C., Liu, Z., Liu, Z., & Bao, J. (2020). Joint Keyphrase Chunking and Salience Ranking with BERT. arXiv preprint arXiv:2004.13639.

The content that needs to be rewritten is: [8] Zhang Y, Yang J. Chinese named entity recognition using lattice LSTM[C]. ACL 2018

【9】Li X, Yan H, Qiu X, et al. FLAT: Chinese NER using flat-lattice transformer[C]. ACL 2020.

【10】Shang J , Liu J, Jiang M, et al. Automated phrase mining from massive text corpora[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(10): 1825-1837.

【11】 Shang J, Liu L, Ren X, et al. Learning named entity tagger using domain-specific dictionary[C]. EMNLP, 2018.

【12】Liang C, Yu Y, Jiang H, et al. Bond : Bert-assisted open-domain named entity recognition with distant supervision[C]//Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020: 1054-1064.

【13】Meituan Exploration and practice of NER technology in search, https://zhuanlan.zhihu.com/p/163256192

The above is the detailed content of Let’s talk about knowledge extraction. Have you learned it?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete