Home > Article > Backend Development > Implement a small text classification system using python
Text mining refers to the process of extracting unknown, understandable, and ultimately usable knowledge from large amounts of text data, and at the same time using this knowledge to better organize information for future reference. That is, the process of finding knowledge from unstructured text.
Currently there are 7 main areas of text mining:
·Search and information retrieval IR
·Text clustering : Use clustering methods to group and classify words, fragments, paragraphs or files
· Text Classification: Group and classify fragments, paragraphs or files while using data mining Based on the classification method, trained labeled instances Interconnection
3.5)
3. Build words Vector space: Count text word frequency and generate text word vector space4. Weight strategy - TF-IDF: Use TF-IDF to discover feature words and extract them as features that reflect the document theme
5. Classifiers: Use algorithms to train classifiers
6. Evaluate classification results
1. Preprocessing
a. Select the range of text to be processed
b. Establish a classified text corpus
·Training set corpus
c. Text Format conversion: Use Python's l
xmld. Detect sentence boundaries: mark the end of the sentence
2. Chinese word segmentation Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. Chinese word segmentation is to divide a sequence of Chinese characters (sentences) into independent words. Chinese word segmentation is very complicated, and to some extent it is not completely An algorithmic problem. Finally, probability theory solved this problem. The algorithm is conditional random field (CRF) based on the probability graphical model.
Word segmentation is the most basic and lowest module in natural language processing. The accuracy of word segmentation is crucial to subsequent applications. Modules have a great influence. The structured representation of text or sentences is the core task in language processing. Currently, the structured representation of text is divided into four categories: word vector space, subject model, tree representation of dependent syntax,
Graph representation of RDF.
The following is a sample code for Chinese words:
# -*- coding: utf-8
-*-import os
import jieba
def savefile(savepath, content):
fp = open(savepath,"w",encoding='gb2312', errors='ignore')
fp.write(content)
fp.close()
def readfile(path):
fp = open(path,"r", encoding= 'gb2312', errors='ignore')
content = fp.read()
fp.close()
return content
# corpus_path =
"train_small/" # Unsegmented word classification prediction library path
# seg_path = "train_seg/" # Classification corpus path after word segmentation corpus_path = "test_small/" # Unsegmented word classification prediction library path seg_path = "test_seg/" # Classification after word segmentation Corpus pathcatelist=
os.listdir(corpus_path) # Get all subdirectories under the changed directory for mydir in catelist:
class_path = corpus_path + mydir + "/" # Spell out the category subdirectory Path
seg_dir = seg_path + mydir + "/" # Predict the category directory after spelling out the word segmentation
if not os.path.exists(seg_dir): # Whether it exists, create it if it does not exist
os.makedirs(seg_dir)
file_list = os.listdir(class_path)
for file_pathin file_list:
fullname = class_path + file_path
content =
readfile(fullname).strip() #Read filecontent
Content = content.replace("\r\n", "").strip() #Remove newlines and extra spaces
Content_seg = jieba.cut (Content)
Savefile (seg_dir + FILE_PATH, "". Join (Content_seg))
PRINT ("End of the Word")
import os import pickle from sklearn.datasets.base import Bunch #Bunch 类提供了一种key,value的对象形式 #target_name 所有分类集的名称列表 #label 每个文件的分类标签列表 #filenames 文件路径 #contents 分词后文件词向量形式def readfile(path): fp = open(path, "r", encoding='gb2312', errors='ignore') content = fp.read() fp.close() return content bunch=Bunch(target_name=[],label=[],filenames=[],contents=[]) # wordbag_path="train_word_bag/train_set.dat" # seg_path="train_seg/"wordbag_path="test_word_bag/test_set.dat"seg_path="test_seg/"catelist=os.listdir(seg_path) bunch.target_name.extend(catelist)#将类别信息保存到Bunch对象for mydir in catelist: class_path=seg_path+mydir+"/" file_list=os.listdir(class_path) for file_path in file_list: fullname=class_path+file_path bunch.label.append(mydir)#保存当前文件的分类标签 bunch.filenames.append(fullname)#保存当前文件的文件路径 bunch.contents.append(readfile(fullname).strip())#保存文件词向量 #Bunch对象持久化file_obj=open(wordbag_path,"wb") pickle.dump(bunch,file_obj) file_obj.close() print("构建文本对象结束")3. Vector space modelSince the text stored in the vector space has a higher dimension, in order to save storage space and improve search efficiency, certain words will be
automatically filtered out before text classification. These words Words or phrases are called stop words. You can download this table of stop words here. 4. Weight strategy: TF-IDF method
If a word or phrase appears frequently in an article and rarely appears in other articles, then this word is considered Or the phrase has good category distinguishing ability and is suitable for classification.
Before giving this part of the code, let’s first look at the concepts of word frequency and reverse file frequency
Word frequency (TF): refers to the occurrence of a given word in the file Frequency of. This number is the normalization of the number of words to prevent it from being biased towards long files. For a word in a specific file, its importance can be expressed as:
The numerator is the number of words in the file The denominator is the sum of the number of occurrences of all words in the document
Inverse document frequency (IDF) is a measure of the general importance of a word. The IDF of a specific word can be calculated by the total document frequency Divide the number by the number of files containing the word, and then take the logarithm of the quotient:
|D| is the total number of files in the corpus, j is the number of files containing the word, if the word is not in the corpus , will cause the denominator to be zero, so generally an additional 1
is added to the denominator to calculate the product of word frequency and reverse file frequency, the frequency of high words in a specific file, and the frequency of the word in Low document frequency in the entire document collection can produce high-weighted TF-IDF, so TF-IDF tends to filter out common words and retain important words. The code is as follows:
import os from sklearn.datasets.base import Bunch import pickle#持久化类from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类def readbunchobj(path): file_obj=open(path,"rb") bunch=pickle.load(file_obj) file_obj.close() return bunch def writebunchobj(path,bunchobj): file_obj=open(path,"wb") pickle.dump(bunchobj,file_obj) file_obj.close() def readfile(path): fp = open(path, "r", encoding='gb2312', errors='ignore') content = fp.read() fp.close() return content path="train_word_bag/train_set.dat"bunch=readbunchobj(path) #停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines() #构建TF-IDF词向量空间对象tfidfspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={}) #使用TfidVectorizer初始化向量空间模型vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5) transfoemer=TfidfTransformer()#该类会统计每个词语的TF-IDF权值 #文本转为词频矩阵,单独保存字典文件tfidfspace.tdm=vectorizer.fit_transform(bunch.contents) tfidfspace.vocabulary=vectorizer.vocabulary_ #创建词袋的持久化space_path="train_word_bag/tfidfspace.dat"writebunchobj(space_path,tfidfspace)
5. Use Naive Bayes classification module
Commonly used text classification methods include kNN nearest neighbor method, Naive Bayes algorithm and support vector machine algorithm. Generally speaking, :
kNN algorithm is originally the simplest, with acceptable classification accuracy, but it is the fastest.
The Naive Bayes algorithm has the best effect on short text classification, with high accuracy
The advantage of the support vector machine algorithm is that it supports linearly inseparable situations, and the accuracy is average
上文代码中进行操作的都是训练集的数据,下面是测试集(抽取字训练集),训练步骤和训练集相同,首先是分词,之后生成词向量文件,直至生成词向量模型,不同的是,在训练词向量模型时需要加载训练集词袋,将测试集产生的词向量映射到训练集词袋的词典中,生成向量空间模型,代码如下:
import os from sklearn.datasets.base import Bunch import pickle#持久化类from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类from TF_IDF import space_path def readbunchobj(path): file_obj=open(path,"rb") bunch=pickle.load(file_obj) file_obj.close() return bunch def writebunchobj(path,bunchobj): file_obj=open(path,"wb") pickle.dump(bunchobj,file_obj) file_obj.close() def readfile(path): fp = open(path, "r", encoding='gb2312', errors='ignore') content = fp.read() fp.close() return content #导入分词后的词向量bunch对象path="test_word_bag/test_set.dat"bunch=readbunchobj(path) #停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines() #构建测试集TF-IDF向量空间testspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={}) #导入训练集的词袋trainbunch=readbunchobj("train_word_bag/tfidfspace.dat") #使用TfidfVectorizer初始化向量空间vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5,vocabulary=trainbunch.vocabulary) transformer=TfidfTransformer(); testspace.tdm=vectorizer.fit_transform(bunch.contents) testspace.vocabulary=trainbunch.vocabulary #创建词袋的持久化space_path="test_word_bag/testspace.dat"writebunchobj(space_path,testspace)
下面执行多项式贝叶斯算法进行测试文本分类并返回精度,代码如下:
import pickle from sklearn.naive_bayes import MultinomialNB # 导入多项式贝叶斯算法包 def readbunchobj(path): file_obj = open(path, "rb") bunch = pickle.load(file_obj) file_obj.close() return bunch # 导入训练集向量空间trainpath = "train_word_bag/tfidfspace.dat"train_set = readbunchobj(trainpath) # d导入测试集向量空间testpath = "test_word_bag/testspace.dat"test_set = readbunchobj(testpath) # 应用贝叶斯算法 # alpha:0.001 alpha 越小,迭代次数越多,精度越高clf = MultinomialNB(alpha=0.001).fit(train_set.tdm, train_set.label) # 预测分类结果predicted = clf.predict(test_set.tdm) total = len(predicted);rate = 0 for flabel, file_name, expct_cate in zip(test_set.label, test_set.filenames, predicted): if flabel != expct_cate: rate += 1 print(file_name, ": 实际类别:", flabel, "-->预测分类:", expct_cate) # 精度print("error_rate:", float(rate) * 100 / float(total), "%")
6.分类结果评估
机器学习领域的算法评估有三个基本指标:
· 召回率(recall rate,查全率):是检索出的相关文档数与文档库中所有相关文档的比率,衡量的是检索系统的查全率
召回率=系统检索到的相关文件/系统所有相关的文件综述
· 准确率(Precision,精度):是检索出的相关文档数于检索出的文档总数的比率,衡量的是检索系统的查准率
准确率=系统检索到的相关文件/系统所有的检索到的文件数
准确率和召回率是相互影响的,理想情况下是二者都高,但是一般情况下准确率高,召回率就低;召回率高,准确率就低
· F-Score():计算公式为:
当=1时就是最常见的-Measure
三者关系如下:
具体评估代码如下:
import numpy as np from sklearn import metrics #评估def metrics_result(actual,predict): print("精度:{0:.3f}".format(metrics.precision_score(actual,predict))) print("召回:{0:0.3f}".format(metrics.recall_score(actual,predict))) print("f1-score:{0:.3f}".format(metrics.f1_score(actual,predict))) metrics_result(test_set.label,predicted) 中文文本语料 中文停用词文本集合 工程全部代码 原文链接
The above is the detailed content of Implement a small text classification system using python. For more information, please follow other related articles on the PHP Chinese website!