Home  >  Article  >  Backend Development  >  How Python sklearn performs feature extraction on text data

How Python sklearn performs feature extraction on text data

WBOY
WBOYforward
2023-05-17 10:55:411436browse

Text feature extraction

Function: Characterize text data

(sentences, phrases, words, letters) generally use words as feature values

Method 1: CountVectorizer

sklearn.feature_extraction.text.CountVectorizer(stop_words=[])

Returns the word frequency matrix (counts the number of feature words appearing in each sample)

CountVectorizer.fit_transform(X)

X: text or iterable object containing text string

Return value: return sparse matrix

CountVectorizer.inverse_transform(X)

X:array array or sparse matrix

Return value: data format before conversion

CountVectorizer .get_feature_names()

Return value: word list

Code display:

from sklearn.feature_extraction.text import CountVectorizer
def count_demo():
    #文本特征抽取
    data=["life is short, i like like python","life is too long,i dislike python"]
    #1、实例化一个转换器类
    transfer=CountVectorizer()
    #2、调用fit_transform()
    result=transfer.fit_transform(data)
    print("result:\n",result.toarray())
    print("特征名字:\n", transfer.get_feature_names())
    return None

Method 2: TfidfVectorizer

Keywords: in a certain In articles of a category, the number of occurrences is high, but the number of occurrences in articles of other categories is rarely called keywords

Tf-idf Text Feature Extraction

①The main idea of ​​TF-IDF Yes: If a word or phrase has a high probability of appearing in an article and rarely appears in other articles, it is considered that the word or phrase has good category distinguishing ability and is suitable for classification.

②TF-IDF function: Used to evaluate the importance of a word to a document set or one of the documents in a corpus.

Formula

①Term frequency (tf) refers to the frequency of a given word appearing in the document

②Inverse document frequency (inverse document frequency, idf) is a measure of the general importance of a word. To calculate the idf of a term, divide the number of files containing the term by the total number of files and use the base 10 logarithm

tfidf = tf * idf

The output results can be understood as the degree of importance

API

##sklearn.feature_extraction.text.TfidfVectorizer(stop_words=None,...)

Return the weight matrix of the word

TfidfVectorizer.fit_transform(X)

X: text or iterable object containing text string

Return value: Return sparse matrix

TfidfVectorizer.inverse_transform(X)

X:array array or sparse matrix

Return value: Data format before conversion

TfidfVectorizer.get_feature_names()

Return value: word list

Chinese word segmentation feature extraction

from sklearn.feature_extraction.text import TfidfVectorizer
import jieba
def cut_word(text):
    #中文分词
    #jieba.cut(text)返回的是生成器对象,用list强转成列表
    word=list(jieba.cut(text))
    #转成字符串
    words=" ".join(word)
    return words
def tfidf_demo():
    data = ["今天很残酷,明天更残酷,后天会很美好,但绝大多数人都死在明天晚上,却见不到后天的太阳,所以我们干什么都要坚持",
            "注重自己的名声,努力工作、与人为善、遵守诺言,这样对你们的事业非常有帮助",
            "服务是全世界最贵的产品,所以最佳的服务就是不要服务,最好的服务就是不需要服务"]
    data_new = []
    # 将中文文本进行分词
    for sentence in data:
        data_new.append(cut_word(sentence))
    # 1、实例化一个转换器类
    transfer = TfidfVectorizer()
    # 2、调用fit_transform()
    result = transfer.fit_transform(data_new)  # 得到词频矩阵 是一个sparse矩阵
    print("result:\n", result.toarray())  # 将sparse矩阵转化为二维数组
    print("特征名字:\n", transfer.get_feature_names())
    return None

The above is the detailed content of How Python sklearn performs feature extraction on text data. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete