Home >Backend Development >Python Tutorial >Detailed explanation of usage examples of jieba Chinese word segmentation

Detailed explanation of usage examples of jieba Chinese word segmentation

巴扎黑
巴扎黑Original
2017-07-23 11:46:447656browse

Unlike English text classification, which only requires separating words one by one, Chinese text classification requires separating words composed of text to form vectors. Therefore, word segmentation is needed.
Here we use jieba, a popular open source word segmentation tool on the Internet. It can effectively extract the words in the sentence one by one. The principle of jieba segmentation will not be repeated here. The key is how to use it.
1. Installation
Stuttering word segmentation is a Python tool function library. It is installed in the python environment. The installation method is as follows:
(1) Under python2.x
Fully automatic Installation: easy_install jieba or pip install jieba
Semi-automatic installation: Download first, unzip and then run python setup.py install
Manual installation: Place the jieba directory in the current directory or site-packages directory
Through import jieba Quote
(2) Under python3.x
Currently the master branch only supports Python2.x
The Python3.x version of the branch is also basically available:

git clone 
git checkout jieba3k
python setup.py install

2, Use
When using it, you must first use the import jieba code to import the jieba library. Since the Chinese text may have some symbols in addition to the text content, such as brackets, equal signs or arrows, you also need to pass these Regular expressions are matched and deleted.
Since regular expressions are used, import re is also needed to import related function libraries.
The specific code is as follows:

def textParse(sentence):
    import jieba
    import re
    #以下两行过滤出中文及字符串以外的其他符号
    r= re.compile("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+")
    sentence=r.sub('',sentence)
    seg_list = jieba.cut(sentence)
    #print ("Default Mode:", ' '.join(seg_list))
    return [tok for tok in seg_list]

The textParse function receives a sentence (sentence) as a parameter, and the return result is an array composed of sentence words.
The most critical function in stuttering word segmentation is jieba.cut. This function splits the received sentence into words and returns a generator for iteration. The last line of code converts this structure into an array.

3. Stop words
Stop words refer to some modal particles or connectives that appear in Chinese. If these words are not kicked out, they will affect the core words and classification. clear relationship. For example, "of", "of", "and", "and", etc. You can also add stop words suitable for this classification scenario as appropriate. The Chinese stop word list covers 1598 stop words. It can be obtained from github.
The project improvements are as follows:
(1) Create a new stopkey word list stopkey.txt
in the project and put all Chinese stopwords into this text file.
(2) Add filter stop word function when segmenting Chinese words

4. Custom dictionary
For classification scenarios, customize some common words. When segmenting words, Treat these words as single words. For example, adding "many to many" in the database to the dictionary can avoid dividing the above words into "many", "pair" and "many" during word segmentation. The definitions of these dictionaries are also related to the classifier application scenarios.
The project improvements are as follows:
(1) Add a custom dictionary file userdict.txt
(2) Add a custom dictionary word segmentation function to Chinese word segmentation

5. Improved Chinese word segmentation function
The code is as follows (other common symbols are also added):

#中文分词
def textParse(sentence):
    import jieba
    import re
    
    #以下两行过滤出中文及字符串以外的其他符号
    r= re.compile("[\s+\.\!\/_\?【】\-(?:\))(?:\()(?:\[)(?:\])(\:):,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+")
    
    sentence=r.sub('',sentence)
    jieba.load_userdict("userdict.txt");#加载自定义词典
    stoplist={}.fromkeys([line.strip() for line in open("stopkey.txt",'r',encoding= 'utf-8')])#停用词文件是utf8编码  
    seg_list = jieba.cut(sentence)
    seg_list=[word for word in list(seg_list) if word not in stoplist]
    #print ("Default Mode:", ' '.join(seg_list))
    return seg_list

The above is the detailed content of Detailed explanation of usage examples of jieba Chinese word segmentation. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn