Home >Technology peripherals >AI >New word discovery algorithm based on CNN

New word discovery algorithm based on CNN

WBOY
WBOYforward
2023-04-11 23:01:011649browse

Author | mczhao, senior R&D manager of Ctrip, focuses on the field of natural language processing technology.

Overview

With the continuous emergence of consumer hot spots and new Internet celebrity memes, in NLP tasks on e-commerce platforms, there often appear Some words I haven't seen before. These words are not in the system's existing vocabulary and are called "unregistered words".

On the one hand, the lack of words in the lexicon affects the word segmentation quality of some word segmenters based on the lexicon, which indirectly affects the quality of text recall and highlight prompts, that is, user text Search accuracy and search result interpretability.

On the other hand, in the mainstream NLP deep learning algorithm BERT/Transformer, etc., word vectors are often used instead of word vectors when processing Chinese. In theory, the effect of using word vectors should be better, but due to the unregistered words, the effect of using word vectors in practice is better. If the vocabulary is more complete, the effect of using word vectors will be better than using word vectors.

To sum up, new word discovery is a problem we need to solve at the moment.

1. Traditional unsupervised method

There is a relatively mature solution to this problem of Chinese new word discovery in the industry. The input is some corpus, and after segmenting these texts into NGram, candidate segments are generated. Calculate some statistical characteristics of these fragments, and then determine whether this fragment is a word based on these characteristics.

The mainstream approach in the industry is to count and observe indicators in these three aspects: popularity, cohesion, and richness of left and right adjacent characters. There are many articles describing these three indicators on the Internet. Here is a brief introduction. For details, you can refer to the two new word discovery articles of Hello NLP and Smooth NLP.

1.1 Popularity

Use word frequency to express popularity. Count the occurrence times of all fragments in all corpus, and those high-frequency fragments are often one word.

1.2 Cohesion

Use point mutual information to measure cohesion:

New word discovery algorithm based on CNN

For example, we determine whether Han Ting is a word, log(P("Han Ting")/P("Han")P("Ting")). The probability of Hanting becoming a word is directly proportional to the popularity of "Hanting" and inversely proportional to the popularity of the words "Han" and "ting". This is easy to understand. For example, the most common word in Chinese characters is "的". The probability of matching any Chinese character with "的" is very high, but it does not mean that "x of" or "的x" is the same word. Here The popularity of the single word "的" plays an inhibitory role.

1.3 The richness of left and right adjacent characters

The left and right adjacency entropy represents the richness of left and right characters. The left and right adjacency entropy is the randomness of the distribution of words appearing on the left or right of the candidate word fragment. You can separate the entropy on the left and the entropy on the right, or you can combine the two entropies into one indicator.

New word discovery algorithm based on CNN

For example, the segment "Shangri-La" has very high heat and cohesion, and corresponds to its sub-segment "Shangri-La" The popularity and cohesion of "Gri" are also very high, but because the word "La" appears after "Shangri" in most cases, its right neighbor entropy is very low, which inhibits its word formation. It can be judged that "Shangri" cannot be a separate word.

2. Limitations of the classic method

The problem with the classic method is that it requires manual setting of threshold parameters. After an NLP expert understands the probability distribution of the fragments in the current corpus, he combines these indicators through formulas or uses them independently, and then sets a threshold as a judgment standard. The judgment results using this standard can also achieve high accuracy.

However, the probability distribution or word frequency is not static. As the corpus becomes more and more abundant, or the weighted popularity of the corpus (usually the corresponding product popularity) fluctuates, experts set The parameters and thresholds in the formula also need to be constantly adjusted. This wastes a lot of manpower and turns artificial intelligence engineers into mere tweakers.

3. New word discovery based on deep learning

3.1 Word frequency probability distribution chart

The three indicators of the above-mentioned algorithms in the industry have only one fundamental source feature, which is word frequency. In statistical methods, some simple and key statistics are usually displayed in the form of pictures, such as histograms, box plots, etc. Even without the intervention of a model, people can still make correct decisions at a glance just by looking at them. judge. You can cut the corpus into all fragments of limited length, normalize the word frequency of the fragments to 0-255, and map it into a two-dimensional matrix. The rows represent the starting characters and the columns represent the ending characters. One pixel is a fragment, and the pixel The brightness of the point is the popularity of the candidate word fragment.

New word discovery algorithm based on CNN

The picture above is the word frequency probability distribution diagram of the short sentence "Pudong Airport Ramada Hotel". We were pleasantly surprised to find that with our naked eyes, it roughly You can separate some brighter, isosceles right-angled triangle blocks, such as: "Pudong", "Pudong Airport", "Airport", "Ramada Hotel", etc. These blocks can determine that the corresponding fragment is the word we need.

3.2 Classic image segmentation algorithm

By observing the word frequency probability distribution map, we can transform a short sentence segmentation problem into an image segmentation problem . Early image segmentation algorithms are similar to the above-mentioned new word discovery algorithms. They are also threshold-based algorithms for detecting edge grayscale changes. With the development of technology, deep learning algorithms are now generally used, and the more famous one is the U-Net image segmentation algorithm. .

New word discovery algorithm based on CNN

The first half of U-Net uses convolutional downsampling to extract multiple layers of features with different granularities. The second half Upsampling, these features are concated at the same resolution, and finally the pixel-level classification results are obtained through the fully connected layer Softmax.

3.3 New word discovery algorithm based on convolutional network

The segmentation of the word frequency probability distribution map is similar to the segmentation of the graph. They all cut out parts that are adjacent in location and have similar gray levels. Therefore, to segment short sentences, you can also refer to the image segmentation algorithm and use a fully convolutional network. The reason for using convolution is that whether we are cutting short sentences or images, we pay more attention to local information, that is, those pixels close to the cutting edge. The reason for using multi-layer networks is that multi-layer pooling can show the threshold judgment of different layer features. For example, when we cut the map terrain, we must consider not only the slope (first derivative/difference) but also the change of slope (second order). Derivative/difference), the two are thresholded respectively and the combination method is not just a simple linear weighting but a serial network.

For the new word discovery scenario, we design the following algorithm:

  • First fill the word frequency distribution map of the short sentence with 0 to 24x24;
  • First have two 3x3 convolution layers and output 4 channels;
  • Concat the two convolution layers, do another 3x3 convolution, and output a single channel ;
  • The loss function uses logistic=T, so the last layer can be used for classification without softmax output;

New word discovery algorithm based on CNN

##Compared with U-Net, there are the following differences:

1) Abandoned downsampling and upsampling,The reason is that the short sentences generally used for segmentation are relatively short, and the resolution of the word frequency distribution map is not high, so the model is also simplified.

2) U-Net is a three-category (block 1, block 2, on the edge), This algorithm only requires two categories (whether the pixel is a word). So the final output results are also different. U-Net outputs some continuous blocks and dividing lines, and we only need whether a certain point is positive.

The picture below shows the results predicted by the model after training the model. We can see that in the output results, the pixels corresponding to the three words "Shanghai" (the upper row and the sea column), "Hongqiao" and "Business District" have been identified.

New word discovery algorithm based on CNN

Use the trained model and enter the landmark name in Ctrip’s landmark library, you can automatically segment and discover some New words, as shown below, although there are some bad cases, the overall accuracy is good.

New word discovery algorithm based on CNN

After importing these words into the lexicon, the accuracy of search word segmentation increases, and the lexicon coverage of the word segmentation results rise. Because search word segmentation generally tends to over-recall and eliminate missed recall, the industry has a more radical approach to recall by word segmentation, and the accuracy is generally solved through subsequent sorting. Therefore, the accuracy of word segmentation has been improved, but in the eyes of users, the accuracy of search results has not been significantly improved. However, it can solve some of the problems of incorrect highlighting caused by word segmentation errors.

4. Model internal analysis

If you want to explore how the model works, you can view the convolution kernel of the middle layer. We first simplify the number of convolution kernels in the model's convolutional layer from 4 to 1. After training, view the middle layer through TensorFlow's API: model.get_layer('Conv2').__dict__. We found that the convolution kernel of the Conv2 layer is as follows:

New word discovery algorithm based on CNN

You can see the effects of the first and second rows on the model. The effect is the opposite. The previous line corresponding to the pixel minus the difference (with weight) of the current line. If the grayscale difference is larger, the string represented by this pixel is more likely to be a word.

You can also see that the absolute value of 0.04505884 in the first row and the second column is relatively small. This may be because the forward parameter of the first row minus the second row and the third column minus the second Negative parameters of columns cancel each other out.

5. Optimization space

This article describes a fully convolutional network model with a very simple structure, and there is still a lot of room for improvement. .

First, expand the feature selection range. For example, the input feature in this article only has word frequency. If the left and right adjacency entropy is also included in the input feature, the segmentation effect will be more accurate.

The second is to increase the depth of the network. Through model analysis, we found that the first layer of convolution is mainly to deal with the cases generated by pixels filled with 0. There is only one layer of convolution that actually focuses on the real heat. If it is a 3x3 convolution kernel, it can only be seen. For the first-order difference result, the second row and column before and after the current pixel are not taken into account. You can appropriately expand the convolution kernel size or deepen the network to make the model's field of view larger. But deepening the network will also bring about the problem of overfitting.

Finally, this model can not only be used to supplement the vocabulary to improve the word segmentation effect, but also can be directly used as a reference for word segmentation in the candidate word recall and word segmentation path scoring in the word segmentation process. The prediction results of this model can be applied in both steps.

The above is the detailed content of New word discovery algorithm based on CNN. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete