New word discovery algorithm based on CNN
Author | mczhao, senior R&D manager of Ctrip, focuses on the field of natural language processing technology.
Overview
With the continuous emergence of consumer hot spots and new Internet celebrity memes, in NLP tasks on e-commerce platforms, there often appear Some words I haven't seen before. These words are not in the system's existing vocabulary and are called "unregistered words".
On the one hand, the lack of words in the lexicon affects the word segmentation quality of some word segmenters based on the lexicon, which indirectly affects the quality of text recall and highlight prompts, that is, user text Search accuracy and search result interpretability.
On the other hand, in the mainstream NLP deep learning algorithm BERT/Transformer, etc., word vectors are often used instead of word vectors when processing Chinese. In theory, the effect of using word vectors should be better, but due to the unregistered words, the effect of using word vectors in practice is better. If the vocabulary is more complete, the effect of using word vectors will be better than using word vectors.
To sum up, new word discovery is a problem we need to solve at the moment.
1. Traditional unsupervised method
There is a relatively mature solution to this problem of Chinese new word discovery in the industry. The input is some corpus, and after segmenting these texts into NGram, candidate segments are generated. Calculate some statistical characteristics of these fragments, and then determine whether this fragment is a word based on these characteristics.
The mainstream approach in the industry is to count and observe indicators in these three aspects: popularity, cohesion, and richness of left and right adjacent characters. There are many articles describing these three indicators on the Internet. Here is a brief introduction. For details, you can refer to the two new word discovery articles of Hello NLP and Smooth NLP.
1.1 Popularity
Use word frequency to express popularity. Count the occurrence times of all fragments in all corpus, and those high-frequency fragments are often one word.
1.2 Cohesion
Use point mutual information to measure cohesion:
For example, we determine whether Han Ting is a word, log(P("Han Ting")/P("Han")P("Ting")). The probability of Hanting becoming a word is directly proportional to the popularity of "Hanting" and inversely proportional to the popularity of the words "Han" and "ting". This is easy to understand. For example, the most common word in Chinese characters is "的". The probability of matching any Chinese character with "的" is very high, but it does not mean that "x of" or "的x" is the same word. Here The popularity of the single word "的" plays an inhibitory role.
1.3 The richness of left and right adjacent characters
The left and right adjacency entropy represents the richness of left and right characters. The left and right adjacency entropy is the randomness of the distribution of words appearing on the left or right of the candidate word fragment. You can separate the entropy on the left and the entropy on the right, or you can combine the two entropies into one indicator.
For example, the segment "Shangri-La" has very high heat and cohesion, and corresponds to its sub-segment "Shangri-La" The popularity and cohesion of "Gri" are also very high, but because the word "La" appears after "Shangri" in most cases, its right neighbor entropy is very low, which inhibits its word formation. It can be judged that "Shangri" cannot be a separate word.
2. Limitations of the classic method
The problem with the classic method is that it requires manual setting of threshold parameters. After an NLP expert understands the probability distribution of the fragments in the current corpus, he combines these indicators through formulas or uses them independently, and then sets a threshold as a judgment standard. The judgment results using this standard can also achieve high accuracy.
However, the probability distribution or word frequency is not static. As the corpus becomes more and more abundant, or the weighted popularity of the corpus (usually the corresponding product popularity) fluctuates, experts set The parameters and thresholds in the formula also need to be constantly adjusted. This wastes a lot of manpower and turns artificial intelligence engineers into mere tweakers.
3. New word discovery based on deep learning
3.1 Word frequency probability distribution chart
The three indicators of the above-mentioned algorithms in the industry have only one fundamental source feature, which is word frequency. In statistical methods, some simple and key statistics are usually displayed in the form of pictures, such as histograms, box plots, etc. Even without the intervention of a model, people can still make correct decisions at a glance just by looking at them. judge. You can cut the corpus into all fragments of limited length, normalize the word frequency of the fragments to 0-255, and map it into a two-dimensional matrix. The rows represent the starting characters and the columns represent the ending characters. One pixel is a fragment, and the pixel The brightness of the point is the popularity of the candidate word fragment.
The picture above is the word frequency probability distribution diagram of the short sentence "Pudong Airport Ramada Hotel". We were pleasantly surprised to find that with our naked eyes, it roughly You can separate some brighter, isosceles right-angled triangle blocks, such as: "Pudong", "Pudong Airport", "Airport", "Ramada Hotel", etc. These blocks can determine that the corresponding fragment is the word we need.
3.2 Classic image segmentation algorithm
By observing the word frequency probability distribution map, we can transform a short sentence segmentation problem into an image segmentation problem . Early image segmentation algorithms are similar to the above-mentioned new word discovery algorithms. They are also threshold-based algorithms for detecting edge grayscale changes. With the development of technology, deep learning algorithms are now generally used, and the more famous one is the U-Net image segmentation algorithm. .
The first half of U-Net uses convolutional downsampling to extract multiple layers of features with different granularities. The second half Upsampling, these features are concated at the same resolution, and finally the pixel-level classification results are obtained through the fully connected layer Softmax.
3.3 New word discovery algorithm based on convolutional network
The segmentation of the word frequency probability distribution map is similar to the segmentation of the graph. They all cut out parts that are adjacent in location and have similar gray levels. Therefore, to segment short sentences, you can also refer to the image segmentation algorithm and use a fully convolutional network. The reason for using convolution is that whether we are cutting short sentences or images, we pay more attention to local information, that is, those pixels close to the cutting edge. The reason for using multi-layer networks is that multi-layer pooling can show the threshold judgment of different layer features. For example, when we cut the map terrain, we must consider not only the slope (first derivative/difference) but also the change of slope (second order). Derivative/difference), the two are thresholded respectively and the combination method is not just a simple linear weighting but a serial network.
For the new word discovery scenario, we design the following algorithm:
- First fill the word frequency distribution map of the short sentence with 0 to 24x24;
- First have two 3x3 convolution layers and output 4 channels;
- Concat the two convolution layers, do another 3x3 convolution, and output a single channel ;
- The loss function uses logistic=T, so the last layer can be used for classification without softmax output;
##Compared with U-Net, there are the following differences:
1) Abandoned downsampling and upsampling,The reason is that the short sentences generally used for segmentation are relatively short, and the resolution of the word frequency distribution map is not high, so the model is also simplified.
2) U-Net is a three-category (block 1, block 2, on the edge), This algorithm only requires two categories (whether the pixel is a word). So the final output results are also different. U-Net outputs some continuous blocks and dividing lines, and we only need whether a certain point is positive.
The picture below shows the results predicted by the model after training the model. We can see that in the output results, the pixels corresponding to the three words "Shanghai" (the upper row and the sea column), "Hongqiao" and "Business District" have been identified.If you want to explore how the model works, you can view the convolution kernel of the middle layer. We first simplify the number of convolution kernels in the model's convolutional layer from 4 to 1. After training, view the middle layer through TensorFlow's API: model.get_layer('Conv2').__dict__. We found that the convolution kernel of the Conv2 layer is as follows:
You can see the effects of the first and second rows on the model. The effect is the opposite. The previous line corresponding to the pixel minus the difference (with weight) of the current line. If the grayscale difference is larger, the string represented by this pixel is more likely to be a word.
You can also see that the absolute value of 0.04505884 in the first row and the second column is relatively small. This may be because the forward parameter of the first row minus the second row and the third column minus the second Negative parameters of columns cancel each other out.
5. Optimization space
This article describes a fully convolutional network model with a very simple structure, and there is still a lot of room for improvement. .
First, expand the feature selection range. For example, the input feature in this article only has word frequency. If the left and right adjacency entropy is also included in the input feature, the segmentation effect will be more accurate.
The second is to increase the depth of the network. Through model analysis, we found that the first layer of convolution is mainly to deal with the cases generated by pixels filled with 0. There is only one layer of convolution that actually focuses on the real heat. If it is a 3x3 convolution kernel, it can only be seen. For the first-order difference result, the second row and column before and after the current pixel are not taken into account. You can appropriately expand the convolution kernel size or deepen the network to make the model's field of view larger. But deepening the network will also bring about the problem of overfitting.
Finally, this model can not only be used to supplement the vocabulary to improve the word segmentation effect, but also can be directly used as a reference for word segmentation in the candidate word recall and word segmentation path scoring in the word segmentation process. The prediction results of this model can be applied in both steps.
The above is the detailed content of New word discovery algorithm based on CNN. For more information, please follow other related articles on the PHP Chinese website!

https://undressaitool.ai/ is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.
![[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyright](https://img.php.cn/upload/article/001/242/473/174707263295098.jpg?x-oss-process=image/resize,p_40)
The latest model GPT-4o released by OpenAI not only can generate text, but also has image generation functions, which has attracted widespread attention. The most eye-catching feature is the generation of "Ghibli-style illustrations". Simply upload the photo to ChatGPT and give simple instructions to generate a dreamy image like a work in Studio Ghibli. This article will explain in detail the actual operation process, the effect experience, as well as the errors and copyright issues that need to be paid attention to. For details of the latest model "o3" released by OpenAI, please click here⬇️ Detailed explanation of OpenAI o3 (ChatGPT o3): Features, pricing system and o4-mini introduction Please click here for the English version of Ghibli-style article⬇️ Create Ji with ChatGPT

As a new communication method, the use and introduction of ChatGPT in local governments is attracting attention. While this trend is progressing in a wide range of areas, some local governments have declined to use ChatGPT. In this article, we will introduce examples of ChatGPT implementation in local governments. We will explore how we are achieving quality and efficiency improvements in local government services through a variety of reform examples, including supporting document creation and dialogue with citizens. Not only local government officials who aim to reduce staff workload and improve convenience for citizens, but also all interested in advanced use cases.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

SublimeText3 Linux new version
SublimeText3 Linux latest version

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 English version
Recommended: Win version, supports code prompts!
