Let's talk about knowledge extraction. Have you learned it?
1. Introduction
Knowledge extraction usually refers to mining structured information from unstructured text, such as tags and phrases containing rich semantic information. This is widely used in the industry in scenarios such as content understanding and product understanding. By extracting valuable tags from user-generated text information, it is applied to content or products.
Knowledge extraction is usually accompanied by The classification of extracted tags or phrases is usually modeled as a named entity recognition task. A common named entity recognition task is to identify named entity components and classify the components into place names, person names, organization names, etc.; domain-related tag word extraction will Tag words are identified and divided into domain-defined categories, such as series (Air Force One, Sonic 9), brand (Nike, Li Ning), type (shoes, clothing, digital), style (INS style, retro style, Nordic style) )wait.
For the convenience of description, in the following, tags or phrases rich in information are collectively called tag words
2. Knowledge extraction classification
Figure 1 Classification of knowledge extraction methods
3. Tag word mining
Unsupervised method
Statistics-based method
- TF-IDF (Term Frequency-Inverse Document Frequency): Count the TF-IDF score of each word. The higher the score, the greater the amount of information contained.
Rewritten content: Calculation method: tfidf(t, d, D) = tf(t, d) * idf(t, D), where tf(t, d) = log (1 freq(t, d)), freq(t,d) represents the number of times candidate word t appears in the current document d, idf(t,D) = log(N/count(d∈D:t∈D) ) indicates how many documents the candidate word t appears in, and is used to indicate the rarity of a word. If a word only appears in one document, it means that the word is rare and has more information.
Specific business In this scenario, you can use external tools to conduct a first round of screening of candidate words, such as using part-of-speech tags to screen nouns.
- YAKE[1]: Five features are defined to capture keyword characteristics, which are heuristically combined to assign a score to each keyword. The lower the score, the more important the keyword is. 1) Capital letters: Term in capital letters (except for the beginning word of each sentence) is more important than Term in lowercase letters, corresponding to the number of bold words in Chinese; 2) Word position: each paragraph of text Some words at the beginning are more important than subsequent words; 3) Word frequency, counts the frequency of word occurrence; 4) Word context, used to measure the number of different words that appear under a fixed window size, one word The more different words co-occur, the lower the importance of the word; 5) The number of times a word appears in different sentences, a word appears in more sentences, the more important it is.
Graph-Based Model
- TextRank[2]: First perform word segmentation and part-of-speech on the text Annotate and filter out stop words, leaving only words with specified parts of speech to construct the graph. Each node is a word, and edges represent relationships between words, which are constructed by defining the co-occurrence of words within a moving window of a predetermined size. Use PageRank to update the weight of nodes until convergence; sort the node weights in reverse order to obtain the most important k words as candidate keywords; mark the candidate words in the original text, and if they form adjacent phrases, combine them into multiple Keyword phrases for phrases.
Representation-based methodEmbedding-Based Model
- EmbedRank[3]: Select candidate words through word segmentation and part-of-speech tagging, use pre-trained Doc2Vec and Sent2vec as vector representations of candidate words and documents, and calculate cosine similarity to rank candidate words. Similarly, KeyBERT[4] replaces the vector representation of EmbedRank with BERT.
Supervised method
- First screen candidate words and then use tag word classification: The classic model KEA[5] uses Naive Bayes as a classifier to score N-gram candidate words on four designed features.
- Joint training of candidate word screening and tag word recognition: BLING-KPE[6] takes the original sentence as input, uses CNN and Transformer to encode the N-gram phrase of the sentence, and calculates whether the phrase is a tag word Probability, whether the label word is manually labeled Label. BERT-KPE[7] Based on the idea of BLING-KPE, ELMO is replaced by BERT to better represent the vector of the sentence.
Figure 2 BLING-KPE model structure
Far supervision method
AutoPhrase
In this article, we define high-quality phrases as those words with complete semantics, when the following four conditions are met at the same time
- Popularit: frequency of occurrence in the document High enough;
- Concordance: The frequency of Token collocation is much higher than that of other collocations after replacement, that is, the frequency of co-occurrence;
- Informativeness: informative and clear, such as "this "is" is a negative example with no information;
- Completeness: The phrase and its sub-phrases must have completeness.
AutoPhrase tag mining process is shown in Figure 3. First, we use part-of-speech tagging to screen high-frequency N-gram words as candidates. Then, we classify the candidate words through distant supervision. Finally, we use the above four conditions to filter out high-quality phrases (phrase quality re-estimation)
Figure 3 AutoPhrase tag mining process
From external knowledge The library obtains high-quality phrases as Positive Pool, and other phrases as negative examples. According to the experimental statistics of the paper, there are 10% of high-quality phrases in the negative example pool because they are not classified into negative examples in the knowledge base, so the paper uses the following method: The random forest ensemble classifier shown in Figure 4 reduces the impact of noise on classification. In industry applications, classifier training can also use the two-classification method of inter-sentence relationship tasks based on the pre-training model BERT [13].
Figure 4 AutoPhrase tag word classification method
4. Tag word classification
Supervised method
NER sequence annotation model
Lattice LSTM[8] is the first work to introduce vocabulary information for Chinese NER tasks. Lattice is a directed acyclic graph. The beginning and end characters of the vocabulary determine the grid position. Through the vocabulary information (dictionary) When matching a sentence, a Lattice-like structure can be obtained, as shown in Figure 5(a). The Lattice LSTM structure fuses vocabulary information into the native LSTM, as shown in 5(b). For the current character, all external dictionary information ending with that character is fused. For example, "store" fuses "people and drug stores" and "Pharmacy" information. For each character, Lattice LSTM uses an attention mechanism to fuse a variable number of word units. Although Lattice-LSTM effectively improves the performance of NER tasks, the RNN structure cannot capture long-distance dependencies, and introducing lexical information is lossy. At the same time, the dynamic Lattice structure cannot fully perform GPU parallelism. The Flat[9] model has effectively improved These two questions. As shown in Figure 5(c), the Flat model captures long-distance dependencies through the Transformer structure, and designs a Position Encoding to integrate the Lattice structure. After splicing the words matched by characters into sentences, each character and word is Construct two Head Position Encoding and Tail Position Encoding, flatten the Lattice structure from a directed acyclic graph to a flat Flat-Lattice Transformer structure.
Figure 5 NER model introducing lexical information
Far supervision method
##AutoNERAutoNER[11] uses an external dictionary to construct training data for distant supervised entity recognition. It first performs entity boundary recognition (Entity Span Recognition), and then performs entity classification (Entity Classification). The construction of the external dictionary can directly use the external knowledge base, or use the AutoPhrase mining method to first conduct offline tag word mining, and then use the AutoNER model to incrementally update the tag words.Figure 6 AutoNER model structure diagram
Picture
【2】Mihalcea R, Tarau P. Textrank: Bringing order into text[C]//Proceedings of the 2004 conference on empirical methods in natural language processing. 2004: 404-411.
【3 】Bennani-Smires K, Musat C, Hossmann A, et al. Simple unsupervised keyphrase extraction using sentence embeddings[J]. arXiv preprint arXiv:1801.04470, 2018.
【4】KeyBERT, https://github .com/MaartenGr/KeyBERT
【5】Witten I H, Paynter G W, Frank E, et al. KEA: Practical automatic keyphrase extraction[C]//Proceedings of the fourth ACM conference on Digital libraries. 1999: 254-255.
Translation content: [6] Xiong L, Hu C, Xiong C, et al. Open domain Web keyword extraction beyond language models[J]. arXiv preprint arXiv:1911.02671, 2019
[7] Sun, S., Xiong, C., Liu, Z., Liu, Z., & Bao, J. (2020). Joint Keyphrase Chunking and Salience Ranking with BERT. arXiv preprint arXiv:2004.13639.
The content that needs to be rewritten is: [8] Zhang Y, Yang J. Chinese named entity recognition using lattice LSTM[C]. ACL 2018
【9】Li X, Yan H, Qiu X, et al. FLAT: Chinese NER using flat-lattice transformer[C]. ACL 2020.
【10】Shang J , Liu J, Jiang M, et al. Automated phrase mining from massive text corpora[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(10): 1825-1837.
【11】 Shang J, Liu L, Ren X, et al. Learning named entity tagger using domain-specific dictionary[C]. EMNLP, 2018.
【12】Liang C, Yu Y, Jiang H, et al. Bond : Bert-assisted open-domain named entity recognition with distant supervision[C]//Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020: 1054-1064.
【13】Meituan Exploration and practice of NER technology in search, https://zhuanlan.zhihu.com/p/163256192
The above is the detailed content of Let's talk about knowledge extraction. Have you learned it?. For more information, please follow other related articles on the PHP Chinese website!

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.