Implement a small text classification system using python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Implement a small text classification system using python

高洛峰

Mar 27, 2017 pm 03:02 PM

python

Background

Text mining refers to the process of extracting unknown, understandable, and ultimately usable knowledge from large amounts of text data, and at the same time using this knowledge to better organize information for future reference. That is, the process of finding knowledge from unstructured text.

Currently there are 7 main areas of text mining:

·Search and information retrieval IR
·Text clustering : Use clustering methods to group and classify words, fragments, paragraphs or files
· Text Classification: Group and classify fragments, paragraphs or files while using data mining Based on the classification method, trained labeled instances Interconnection
python
3.5)

1. Preprocessing: remove text noise information, such as

HTML tag, text format conversion, sentence boundary detection

2. Chinese word segmentation: Use Chinese word segmentation to segment the text and remove stop words

3. Build words Vector space: Count text word frequency and generate text word vector space4. Weight strategy - TF-IDF: Use TF-IDF to discover feature words and extract them as features that reflect the document theme

5. Classifiers: Use algorithms to train classifiers

6. Evaluate classification results

1. Preprocessing

a. Select the range of text to be processed

b. Establish a classified text corpus

·Training set corpus

c. Text Format conversion: Use Python's l
xml

d. Detect sentence boundaries: mark the end of the sentence

2. Chinese word segmentation Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. Chinese word segmentation is to divide a sequence of Chinese characters (sentences) into independent words. Chinese word segmentation is very complicated, and to some extent it is not completely An algorithmic problem. Finally, probability theory solved this problem. The algorithm is conditional random field (CRF) based on the probability graphical model.

Word segmentation is the most basic and lowest module in natural language processing. The accuracy of word segmentation is crucial to subsequent applications. Modules have a great influence. The structured representation of text or sentences is the core task in language processing. Currently, the structured representation of text is divided into four categories: word vector space, subject model, tree representation of dependent syntax,

Graph representation of RDF

The following is a sample code for Chinese words:

# -*- coding: utf-8 -*-import os
import jieba
def savefile(savepath, content):
fp = open(savepath,"w",encoding='gb2312', errors='ignore')
fp.write(content)
fp.close()
def readfile(path):
fp = open(path,"r", encoding= 'gb2312', errors='ignore')
content = fp.read()
fp.close()
return content
# corpus_path = "train_small/" # Unsegmented word classification prediction library path
# seg_path = "train_seg/" # Classification corpus path after word segmentation corpus_path = "test_small/" # Unsegmented word classification prediction library path seg_path = "test_seg/" # Classification after word segmentation Corpus pathcatelist= os.listdir(corpus_path) # Get all subdirectories under the changed directory for mydir in catelist:
class_path = corpus_path + mydir + "/" # Spell out the category subdirectory Path
seg_dir = seg_path + mydir + "/" # Predict the category directory after spelling out the word segmentation
if not os.path.exists(seg_dir): # Whether it exists, create it if it does not exist
os.makedirs(seg_dir)
file_list = os.listdir(class_path)
for file_pathin file_list:
fullname = class_path + file_path
content = readfile(fullname).strip() #Read filecontent
Content = content.replace("\r\n", "").strip() #Remove newlines and extra spaces
Content_seg = jieba.cut (Content)
Savefile (seg_dir + FILE_PATH, "". Join (Content_seg))
PRINT ("End of the Word")

## In order to be for For the convenience of subsequent generation of word vector space models, these word segmented text information must be converted into text vector information and objectified, using the Bunch data structure of the Scikit-Learn library. The specific code is as follows:

import os
import pickle
from sklearn.datasets.base import Bunch
#Bunch 类提供了一种key，value的对象形式
#target_name 所有分类集的名称列表
#label 每个文件的分类标签列表
#filenames 文件路径
#contents 分词后文件词向量形式def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
bunch=Bunch(target_name=[],label=[],filenames=[],contents=[])
# wordbag_path="train_word_bag/train_set.dat"
# seg_path="train_seg/"wordbag_path="test_word_bag/test_set.dat"seg_path="test_seg/"catelist=os.listdir(seg_path)
bunch.target_name.extend(catelist)#将类别信息保存到Bunch对象for mydir in catelist:
    class_path=seg_path+mydir+"/"
    file_list=os.listdir(class_path)
    for file_path in file_list:
        fullname=class_path+file_path
        bunch.label.append(mydir)#保存当前文件的分类标签
        bunch.filenames.append(fullname)#保存当前文件的文件路径
        bunch.contents.append(readfile(fullname).strip())#保存文件词向量
#Bunch对象持久化file_obj=open(wordbag_path,"wb")
pickle.dump(bunch,file_obj)
file_obj.close()
print("构建文本对象结束")

3. Vector space model

Since the text stored in the vector space has a higher dimension, in order to save storage space and improve search efficiency, certain words will be

automatically filtered out before text classification. These words Words or phrases are called stop words. You can download this table of stop words here. 4. Weight strategy: TF-IDF method

If a word or phrase appears frequently in an article and rarely appears in other articles, then this word is considered Or the phrase has good category distinguishing ability and is suitable for classification.

Before giving this part of the code, let’s first look at the concepts of word frequency and reverse file frequency

Word frequency (TF): refers to the occurrence of a given word in the file Frequency of. This number is the normalization of the number of words to prevent it from being biased towards long files. For a word in a specific file, its importance can be expressed as:

The numerator is the number of words in the file The denominator is the sum of the number of occurrences of all words in the document

Inverse document frequency (IDF) is a measure of the general importance of a word. The IDF of a specific word can be calculated by the total document frequency Divide the number by the number of files containing the word, and then take the logarithm of the quotient:

|D| is the total number of files in the corpus, j is the number of files containing the word, if the word is not in the corpus , will cause the denominator to be zero, so generally an additional 1

is added to the denominator to calculate the product of word frequency and reverse file frequency, the frequency of high words in a specific file, and the frequency of the word in Low document frequency in the entire document collection can produce high-weighted TF-IDF, so TF-IDF tends to filter out common words and retain important words. The code is as follows:

import os
from sklearn.datasets.base import Bunch
import pickle#持久化类from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类def readbunchobj(path):
    file_obj=open(path,"rb")
    bunch=pickle.load(file_obj)
    file_obj.close()
    return bunch
def writebunchobj(path,bunchobj):
    file_obj=open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()
def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
path="train_word_bag/train_set.dat"bunch=readbunchobj(path)
#停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines()
#构建TF-IDF词向量空间对象tfidfspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={})
#使用TfidVectorizer初始化向量空间模型vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5)
transfoemer=TfidfTransformer()#该类会统计每个词语的TF-IDF权值
#文本转为词频矩阵，单独保存字典文件tfidfspace.tdm=vectorizer.fit_transform(bunch.contents)
tfidfspace.vocabulary=vectorizer.vocabulary_
#创建词袋的持久化space_path="train_word_bag/tfidfspace.dat"writebunchobj(space_path,tfidfspace)

5. Use Naive Bayes classification module

Commonly used text classification methods include kNN nearest neighbor method, Naive Bayes algorithm and support vector machine algorithm. Generally speaking, :

kNN algorithm is originally the simplest, with acceptable classification accuracy, but it is the fastest.

The Naive Bayes algorithm has the best effect on short text classification, with high accuracy

The advantage of the support vector machine algorithm is that it supports linearly inseparable situations, and the accuracy is average

上文代码中进行操作的都是训练集的数据，下面是测试集（抽取字训练集），训练步骤和训练集相同，首先是分词，之后生成词向量文件，直至生成词向量模型，不同的是，在训练词向量模型时需要加载训练集词袋，将测试集产生的词向量映射到训练集词袋的词典中，生成向量空间模型，代码如下：

import os
from sklearn.datasets.base import Bunch
import pickle#持久化类from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类from TF_IDF import space_path
def readbunchobj(path):
    file_obj=open(path,"rb")
    bunch=pickle.load(file_obj)
    file_obj.close()
    return bunch
def writebunchobj(path,bunchobj):
    file_obj=open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()
def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
#导入分词后的词向量bunch对象path="test_word_bag/test_set.dat"bunch=readbunchobj(path)
#停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines()
#构建测试集TF-IDF向量空间testspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={})
#导入训练集的词袋trainbunch=readbunchobj("train_word_bag/tfidfspace.dat")
#使用TfidfVectorizer初始化向量空间vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5,vocabulary=trainbunch.vocabulary)
transformer=TfidfTransformer();
testspace.tdm=vectorizer.fit_transform(bunch.contents)
testspace.vocabulary=trainbunch.vocabulary
#创建词袋的持久化space_path="test_word_bag/testspace.dat"writebunchobj(space_path,testspace)

下面执行多项式贝叶斯算法进行测试文本分类并返回精度，代码如下：

import pickle
from sklearn.naive_bayes import MultinomialNB  # 导入多项式贝叶斯算法包
def readbunchobj(path):
    file_obj = open(path, "rb")
    bunch = pickle.load(file_obj)
    file_obj.close()
    return bunch
# 导入训练集向量空间trainpath = "train_word_bag/tfidfspace.dat"train_set = readbunchobj(trainpath)
# d导入测试集向量空间testpath = "test_word_bag/testspace.dat"test_set = readbunchobj(testpath)
# 应用贝叶斯算法
# alpha:0.001 alpha 越小，迭代次数越多，精度越高clf = MultinomialNB(alpha=0.001).fit(train_set.tdm, train_set.label)
# 预测分类结果predicted = clf.predict(test_set.tdm)
total = len(predicted);rate = 0
for flabel, file_name, expct_cate in zip(test_set.label, test_set.filenames, predicted):
    if flabel != expct_cate:
        rate += 1
        print(file_name, ": 实际类别：", flabel, "-->预测分类：", expct_cate)
# 精度print("error_rate:", float(rate) * 100 / float(total), "%")

6.分类结果评估

机器学习领域的算法评估有三个基本指标：

· 召回率（recall rate,查全率）：是检索出的相关文档数与文档库中所有相关文档的比率，衡量的是检索系统的查全率

召回率=系统检索到的相关文件/系统所有相关的文件综述

· 准确率（Precision，精度）：是检索出的相关文档数于检索出的文档总数的比率，衡量的是检索系统的查准率

准确率=系统检索到的相关文件/系统所有的检索到的文件数

准确率和召回率是相互影响的，理想情况下是二者都高，但是一般情况下准确率高，召回率就低；召回率高，准确率就低

· F-Score（）：计算公式为：

当=1时就是最常见的-Measure

三者关系如下：

具体评估代码如下：

import numpy as np
from sklearn import metrics
#评估def metrics_result(actual,predict):
    print("精度：{0:.3f}".format(metrics.precision_score(actual,predict)))
    print("召回：{0:0.3f}".format(metrics.recall_score(actual,predict)))
    print("f1-score:{0:.3f}".format(metrics.f1_score(actual,predict)))
metrics_result(test_set.label,predicted)
中文文本语料
中文停用词文本集合
工程全部代码
原文链接

The above is the detailed content of Implement a small text classification system using python. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What are the alternatives to concatenate two lists in Python?May 09, 2025 am 12:16 AM

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

Python: Efficient Ways to Merge Two ListsMay 09, 2025 am 12:15 AM

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiled vs Interpreted Languages: pros and consMay 09, 2025 am 12:06 AM

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

Python: For and While Loops, the most complete guideMay 09, 2025 am 12:05 AM

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

Python concatenate lists into a stringMay 09, 2025 am 12:02 AM

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Python's Hybrid Approach: Compilation and Interpretation CombinedMay 08, 2025 am 12:16 AM

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

Learn the Differences Between Python's 'for' and 'while' LoopsMay 08, 2025 am 12:11 AM

ThekeydifferencesbetweenPython's"for"and"while"loopsare:1)"For"loopsareidealforiteratingoversequencesorknowniterations,while2)"while"loopsarebetterforcontinuinguntilaconditionismetwithoutpredefinediterations.Un

Python concatenate lists with duplicatesMay 08, 2025 am 12:09 AM

In Python, you can connect lists and manage duplicate elements through a variety of methods: 1) Use operators or extend() to retain all duplicate elements; 2) Convert to sets and then return to lists to remove all duplicate elements, but the original order will be lost; 3) Use loops or list comprehensions to combine sets to remove duplicate elements and maintain the original order.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

Dreamweaver Mac version

Visual web development tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Chinese version

Chinese version, very easy to use

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

1664

1422

1316

1268

1242