search
HomeTechnology peripheralsAILet's talk about knowledge extraction. Have you learned it?

Let's talk about knowledge extraction. Have you learned it?

Nov 13, 2023 pm 08:13 PM
Knowledgeextractner

1. Introduction

Knowledge extraction usually refers to mining structured information from unstructured text, such as tags and phrases containing rich semantic information. This is widely used in the industry in scenarios such as content understanding and product understanding. By extracting valuable tags from user-generated text information, it is applied to content or products.

Knowledge extraction is usually accompanied by The classification of extracted tags or phrases is usually modeled as a named entity recognition task. A common named entity recognition task is to identify named entity components and classify the components into place names, person names, organization names, etc.; domain-related tag word extraction will Tag words are identified and divided into domain-defined categories, such as series (Air Force One, Sonic 9), brand (Nike, Li Ning), type (shoes, clothing, digital), style (INS style, retro style, Nordic style) )wait.

For the convenience of description, in the following, tags or phrases rich in information are collectively called tag words

2. Knowledge extraction classification

This article starts from tag word mining and tag words Classic methods of knowledge extraction are introduced from two perspectives. Tag word mining methods are divided into unsupervised methods, supervised methods and remote supervision methods, as shown in Figure 1. Tag word mining selects high-scoring tag words through two steps: candidate word mining and phrase scoring. Tag word classification usually jointly models tag word extraction and classification, and transforms it into a sequence annotation task for named entity recognition.

Lets talk about knowledge extraction. Have you learned it?Figure 1 Classification of knowledge extraction methods

3. Tag word mining

Unsupervised method

Statistics-based method

First segment the document or combine the segmented words into N-grams as candidate words , and then score candidate words based on statistical characteristics.
  • TF-IDF (Term Frequency-Inverse Document Frequency): Count the TF-IDF score of each word. The higher the score, the greater the amount of information contained.

Rewritten content: Calculation method: tfidf(t, d, D) = tf(t, d) * idf(t, D), where tf(t, d) = log (1 freq(t, d)), freq(t,d) represents the number of times candidate word t appears in the current document d, idf(t,D) = log(N/count(d∈D:t∈D) ) indicates how many documents the candidate word t appears in, and is used to indicate the rarity of a word. If a word only appears in one document, it means that the word is rare and has more information.

Specific business In this scenario, you can use external tools to conduct a first round of screening of candidate words, such as using part-of-speech tags to screen nouns.

  • YAKE[1]: Five features are defined to capture keyword characteristics, which are heuristically combined to assign a score to each keyword. The lower the score, the more important the keyword is. 1) Capital letters: Term in capital letters (except for the beginning word of each sentence) is more important than Term in lowercase letters, corresponding to the number of bold words in Chinese; 2) Word position: each paragraph of text Some words at the beginning are more important than subsequent words; 3) Word frequency, counts the frequency of word occurrence; 4) Word context, used to measure the number of different words that appear under a fixed window size, one word The more different words co-occur, the lower the importance of the word; 5) The number of times a word appears in different sentences, a word appears in more sentences, the more important it is.

Graph-Based Model

  • TextRank[2]: First perform word segmentation and part-of-speech on the text Annotate and filter out stop words, leaving only words with specified parts of speech to construct the graph. Each node is a word, and edges represent relationships between words, which are constructed by defining the co-occurrence of words within a moving window of a predetermined size. Use PageRank to update the weight of nodes until convergence; sort the node weights in reverse order to obtain the most important k words as candidate keywords; mark the candidate words in the original text, and if they form adjacent phrases, combine them into multiple Keyword phrases for phrases.

Representation-based methodEmbedding-Based Model

Representation-based method calculates the vector similarity between candidate words and documents degree to rank candidate words.
  • EmbedRank[3]: Select candidate words through word segmentation and part-of-speech tagging, use pre-trained Doc2Vec and Sent2vec as vector representations of candidate words and documents, and calculate cosine similarity to rank candidate words. Similarly, KeyBERT[4] replaces the vector representation of EmbedRank with BERT.

Supervised method

Supervised methods calculate whether candidate words belong to label words by training a model.
  • First screen candidate words and then use tag word classification: The classic model KEA[5] uses Naive Bayes as a classifier to score N-gram candidate words on four designed features.
  • Joint training of candidate word screening and tag word recognition: BLING-KPE[6] takes the original sentence as input, uses CNN and Transformer to encode the N-gram phrase of the sentence, and calculates whether the phrase is a tag word Probability, whether the label word is manually labeled Label. BERT-KPE[7] Based on the idea of ​​BLING-KPE, ELMO is replaced by BERT to better represent the vector of the sentence.

Lets talk about knowledge extraction. Have you learned it?Figure 2 BLING-KPE model structure


Far supervision method

AutoPhrase

The typical representative of the distant supervision method is AutoPhrase [10], which is widely used in tag word mining in the industry. AutoPhrase uses existing high-quality knowledge bases to conduct remote supervised training to avoid manual annotation.

In this article, we define high-quality phrases as those words with complete semantics, when the following four conditions are met at the same time

  • Popularit: frequency of occurrence in the document High enough;
  • Concordance: The frequency of Token collocation is much higher than that of other collocations after replacement, that is, the frequency of co-occurrence;
  • Informativeness: informative and clear, such as "this "is" is a negative example with no information;
  • Completeness: The phrase and its sub-phrases must have completeness.

AutoPhrase tag mining process is shown in Figure 3. First, we use part-of-speech tagging to screen high-frequency N-gram words as candidates. Then, we classify the candidate words through distant supervision. Finally, we use the above four conditions to filter out high-quality phrases (phrase quality re-estimation)

Lets talk about knowledge extraction. Have you learned it?Figure 3 AutoPhrase tag mining process

From external knowledge The library obtains high-quality phrases as Positive Pool, and other phrases as negative examples. According to the experimental statistics of the paper, there are 10% of high-quality phrases in the negative example pool because they are not classified into negative examples in the knowledge base, so the paper uses the following method: The random forest ensemble classifier shown in Figure 4 reduces the impact of noise on classification. In industry applications, classifier training can also use the two-classification method of inter-sentence relationship tasks based on the pre-training model BERT [13].

Lets talk about knowledge extraction. Have you learned it?Figure 4 AutoPhrase tag word classification method

4. Tag word classification

Supervised method

NER sequence annotation model

Named Entity Recognition (NER) is also a label extraction method that jointly trains candidate word screening and label word recognition. It is usually aimed at scenarios where the amount of sentence information is relatively repetitive. To identify entity components in sentences, it is implemented using a sequence annotation model. Taking the sentence as input, predict the probability of each Token in the sentence belonging to the following Label: B (Begin)-LOC (place name), I (Inside)-LOC (place name), E (End)-LOC (place name), O (Others), etc., where "-" is followed by the category to which the entity word belongs. In Chinese NER tasks, character-based rather than vocabulary-based methods are usually used for sequence annotation modeling to avoid the mistransmission problem caused by Chinese word segmentation. Therefore, vocabulary information needs to be introduced to strengthen entity word boundaries.

Lattice LSTM[8] is the first work to introduce vocabulary information for Chinese NER tasks. Lattice is a directed acyclic graph. The beginning and end characters of the vocabulary determine the grid position. Through the vocabulary information (dictionary) When matching a sentence, a Lattice-like structure can be obtained, as shown in Figure 5(a). The Lattice LSTM structure fuses vocabulary information into the native LSTM, as shown in 5(b). For the current character, all external dictionary information ending with that character is fused. For example, "store" fuses "people and drug stores" and "Pharmacy" information. For each character, Lattice LSTM uses an attention mechanism to fuse a variable number of word units. Although Lattice-LSTM effectively improves the performance of NER tasks, the RNN structure cannot capture long-distance dependencies, and introducing lexical information is lossy. At the same time, the dynamic Lattice structure cannot fully perform GPU parallelism. The Flat[9] model has effectively improved These two questions. As shown in Figure 5(c), the Flat model captures long-distance dependencies through the Transformer structure, and designs a Position Encoding to integrate the Lattice structure. After splicing the words matched by characters into sentences, each character and word is Construct two Head Position Encoding and Tail Position Encoding, flatten the Lattice structure from a directed acyclic graph to a flat Flat-Lattice Transformer structure.

Lets talk about knowledge extraction. Have you learned it?Figure 5 NER model introducing lexical information

Far supervision method

##AutoNER

AutoNER[11] uses an external dictionary to construct training data for distant supervised entity recognition. It first performs entity boundary recognition (Entity Span Recognition), and then performs entity classification (Entity Classification). The construction of the external dictionary can directly use the external knowledge base, or use the AutoPhrase mining method to first conduct offline tag word mining, and then use the AutoNER model to incrementally update the tag words. In order to solve the noise problem in distant supervision, we use the entity boundary identification scheme of Tie or Break to replace the BIOE labeling method. Among them, Tie means that the current word and the previous word belong to the same entity, and Break means that the current word and the previous word are no longer in the same entity.

In the entity classification stage, Fuzzy CRF is used to deal with it. One entity has multiple types

Lets talk about knowledge extraction. Have you learned it?Figure 6 AutoNER model structure diagram

BOND

BOND[12] is a two-stage entity recognition model based on remote supervised learning. In the first stage, long-distance labels are used to adapt the pre-trained language model to the NER task; in the second stage, the Student model and Teacher model are first initialized with the model trained in Stage 1, and then the pseudo-labels generated by the Teacher model are used to pair the Student model Conduct training to minimize the impact of noise problems caused by distant supervision.

Lets talk about knowledge extraction. Have you learned it?Picture

The content that needs to be rewritten is: Figure 7 BOND training flow chart

5. Summary

This article introduces classic methods of knowledge extraction from the two perspectives of tag word mining and tag word classification, including unsupervised and distantly supervised classic methods TF-IDF and TextRank that do not rely on manual annotation of data, AutoPhrase, AutoNER, etc., which are widely used in the industry. It can provide reference for industry content understanding, dictionary construction for Query understanding, NER and other directions.

References

【1】Campos R, Mangaravite V, Pasquali A, et al. Yake! collection-independent automatic keyword extractor[ C]//Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings 40. Springer International Publishing, 2018: 806-810. https://github.com /LIAAD/yake

【2】Mihalcea R, Tarau P. Textrank: Bringing order into text[C]//Proceedings of the 2004 conference on empirical methods in natural language processing. 2004: 404-411.

【3 】Bennani-Smires K, Musat C, Hossmann A, et al. Simple unsupervised keyphrase extraction using sentence embeddings[J]. arXiv preprint arXiv:1801.04470, 2018.

【4】KeyBERT, https://github .com/MaartenGr/KeyBERT

【5】Witten I H, Paynter G W, Frank E, et al. KEA: Practical automatic keyphrase extraction[C]//Proceedings of the fourth ACM conference on Digital libraries. 1999: 254-255.

Translation content: [6] Xiong L, Hu C, Xiong C, et al. Open domain Web keyword extraction beyond language models[J]. arXiv preprint arXiv:1911.02671, 2019

[7] Sun, S., Xiong, C., Liu, Z., Liu, Z., & Bao, J. (2020). Joint Keyphrase Chunking and Salience Ranking with BERT. arXiv preprint arXiv:2004.13639.

The content that needs to be rewritten is: [8] Zhang Y, Yang J. Chinese named entity recognition using lattice LSTM[C]. ACL 2018

【9】Li X, Yan H, Qiu X, et al. FLAT: Chinese NER using flat-lattice transformer[C]. ACL 2020.

【10】Shang J , Liu J, Jiang M, et al. Automated phrase mining from massive text corpora[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(10): 1825-1837.

【11】 Shang J, Liu L, Ren X, et al. Learning named entity tagger using domain-specific dictionary[C]. EMNLP, 2018.

【12】Liang C, Yu Y, Jiang H, et al. Bond : Bert-assisted open-domain named entity recognition with distant supervision[C]//Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020: 1054-1064.

【13】Meituan Exploration and practice of NER technology in search, https://zhuanlan.zhihu.com/p/163256192

The above is the detailed content of Let's talk about knowledge extraction. Have you learned it?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
How to Run LLM Locally Using LM Studio? - Analytics VidhyaHow to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationGuy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaWhat is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics Vidhya12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsAV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsPerplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingEveryone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaRocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.