Detailed explanation of Python text feature extraction and vectorization algorithm learning examples-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed explanation of Python text feature extraction and vectorization algorithm learning examples

小云云

Dec 23, 2017 pm 05:05 PM

pythonextractQuantify

Suppose we have just watched Nolan's blockbuster "Interstellar", how can we let the machine automatically analyze whether the audience's evaluation of the movie is "positive" or "negative"? This type of problem is a sentiment analysis problem. The first step in dealing with this type of problem is to convert text into features. This article mainly introduces the Python text feature extraction and vectorization algorithm in detail. It has certain reference value. Interested friends can refer to it. I hope it can help everyone.

Therefore, in this chapter we only learn the first step, how to extract features from text and vectorize them.

Since the processing of Chinese involves word segmentation, this article uses a simple example to illustrate how to use Python's machine learning library to extract features from English.

1. Data preparation

Python's sklearn.datasets supports reading all classified texts from the directory. However, the directories must be placed according to the rules of one folder and one label name. For example, the data set used in this article has a total of 2 labels, one is "net" and the other is "pos", and there are 6 text files under each directory. The directory is as follows:

neg
1.txt
2.txt
......
pos
1.txt
2 .txt
....

The contents of the 12 files are summarized as follows:

neg: 
  shit. 
  waste my money. 
  waste of money. 
  sb movie. 
  waste of time. 
  a shit movie. 
pos: 
  nb! nb movie! 
  nb! 
  worth my money. 
  I love this movie! 
  a nb movie. 
  worth it!

2. Text features

How to extract emotional attitudes from these English words and classify them?

The most intuitive way is to extract words. It is generally believed that many keywords can reflect the speaker's attitude. For example, in the simple data set above, it is easy to find that anything that says "shit" must belong to the neg category.

Of course, the above data set is simply designed for convenience of description. In reality, a word often has ambiguous attitudes. But there is still reason to believe that the more a word appears in the neg category, the greater the probability that it expresses the neg attitude.

We also noticed that some words are meaningless for sentiment classification. For example, words such as "of" and "I" in the above data. This type of word has a name, called "
Stop_Word" (stop word). Such words can be completely ignored and not counted. Obviously by ignoring these words, the storage space of word frequency records can be optimized and the construction speed is faster. There is also a problem in using the word frequency of each word as an important feature. For example, "movie" in the above data appears 5 times in 12 samples, but the number of positive and negative occurrences is almost the same, and there is no distinction. And "worth" appears twice, but only in the pos category. It obviously has a strong strong color, that is, the distinction is very high.

Therefore, we need to introduce

TF-IDF (Term Frequency-Inverse Document Frequency, Term frequency and reverse document frequency) to further consider each word .

TF (Word Frequency) is calculated very simply, that is, for a document t, the frequency of a certain word Nt appearing in the document. For example, in the document "I love this movie", the TF of the word "love" is 1/4. If you remove the stop words "I" and "it", it is 1/2.

IDF (Inverse Document Frequency) means that for a certain word t, the number of documents Dt in which the word appears accounts for the proportion of all test documents D. Then find the natural logarithm. For example, the word "movie" appears 5 times in total, and the total number of documents is 12, so the IDF is ln(5/12).
Obviously, IDF is to highlight the words that appear rarely but have strong emotional color. For example, the IDF of a word like "movie" is ln(12/5)=0.88, which is much smaller than the IDF of "love"=ln(12/1)=2.48.

TF-IDF is simply multiplying the two together. In this way, finding the TF-IDF of each word in each document is the text feature value we extracted.

3. Vectorization

With the above foundation, the document can be vectorized. Let’s look at the code first, and then analyze the meaning of vectorization:

# -*- coding: utf-8 -*- 
import scipy as sp 
import numpy as np 
from sklearn.datasets import load_files 
from sklearn.cross_validation import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer 
 
&#39;&#39;&#39;&#39;&#39;加载数据集，切分数据集80%训练，20%测试&#39;&#39;&#39; 
movie_reviews = load_files(&#39;endata&#39;)  
doc_terms_train, doc_terms_test, y_train, y_test\ 
  = train_test_split(movie_reviews.data, movie_reviews.target, test_size = 0.3) 
   
&#39;&#39;&#39;&#39;&#39;BOOL型特征下的向量空间模型，注意，测试样本调用的是transform接口&#39;&#39;&#39; 
count_vec = TfidfVectorizer(binary = False, decode_error = &#39;ignore&#39;,\ 
              stop_words = &#39;english&#39;) 
x_train = count_vec.fit_transform(doc_terms_train) 
x_test = count_vec.transform(doc_terms_test) 
x    = count_vec.transform(movie_reviews.data) 
y    = movie_reviews.target 
print(doc_terms_train) 
print(count_vec.get_feature_names()) 
print(x_train.toarray()) 
print(movie_reviews.target)

运行结果如下：
[b'waste of time.', b'a shit movie.', b'a nb movie.', b'I love this movie!', b'shit.', b'worth my money.', b'sb movie.', b'worth it!']
['love', 'money', 'movie', 'nb', 'sb', 'shit', 'time', 'waste', 'worth']
[[ 0.          0.          0.          0.          0.          0.   0.70710678 0.70710678 0.        ]
[ 0.          0.          0.60335753 0.          0.          0.79747081   0.          0.          0.        ]
[ 0.          0.          0.53550237 0.84453372 0.          0.          0.   0.          0.        ]
[ 0.84453372 0.          0.53550237 0.          0.          0.          0.   0.          0.        ]
[ 0.          0.          0.          0.          0.          1.          0.   0.          0.        ]
[ 0.          0.76642984 0.          0.          0.          0.          0.   0.          0.64232803]
[ 0.          0.          0.53550237 0.          0.84453372 0.          0.   0.          0.        ]
[ 0.          0.          0.          0.          0.          0.          0.   0.          1.        ]]
[1 1 0 1 0 1 0 1 1 0 0 0]

python输出的比较混乱。我这里做了一个表格如下：

从上表可以发现如下几点：

1、停用词的过滤。

初始化count_vec的时候，我们在count_vec构造时传递了stop_words = 'english'，表示使用默认的英文停用词。可以使用count_vec.get_stop_words()查看TfidfVectorizer内置的所有停用词。当然，在这里可以传递你自己的停用词list（比如这里的“movie”）

2、TF-IDF的计算。

这里词频的计算使用的是sklearn的TfidfVectorizer。这个类继承于CountVectorizer，在后者基本的词频统计基础上增加了如TF-IDF之类的功能。
我们会发现这里计算的结果跟我们之前计算不太一样。因为这里count_vec构造时默认传递了max_df=1，因此TF-IDF都做了规格化处理，以便将所有值约束在[0,1]之间。

3. The result of count_vec.fit_transform is a huge matrix. We can see that there are a lot of 0's in the above table, so sklearn uses a sparse matrix for its internal implementation. The data in this example is small. If readers are interested, you can try real data used by machine learning researchers, from Cornell University: http://www.cs.cornell.edu/people/pabo/movie-review-data/. This website provides many data sets, including several databases of about 2M, with about 700 positive and negative examples. The scale of this kind of data is not large and can still be completed within 1 minute. I suggest you give it a try. However, be aware that these data sets may have illegal character issues. So when constructing count_vec, decode_error = 'ignore' is passed in to ignore these illegal characters.

The results in the above table are the results of training 8 features of 8 samples. This result can be classified using various classification algorithms.

Related recommendations:

Share Python text generation QR code example

Detailed explanation of edit distance for Python text similarity calculation

Example detailed explanation of Python implementation of simple web page image grabbing

The above is the detailed content of Detailed explanation of Python text feature extraction and vectorization algorithm learning examples. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How do you create multi-dimensional arrays using NumPy?Apr 29, 2025 am 12:27 AM

Create multi-dimensional arrays with NumPy can be achieved through the following steps: 1) Use the numpy.array() function to create an array, such as np.array([[1,2,3],[4,5,6]]) to create a 2D array; 2) Use np.zeros(), np.ones(), np.random.random() and other functions to create an array filled with specific values; 3) Understand the shape and size properties of the array to ensure that the length of the sub-array is consistent and avoid errors; 4) Use the np.reshape() function to change the shape of the array; 5) Pay attention to memory usage to ensure that the code is clear and efficient.

Explain the concept of 'broadcasting' in NumPy arrays.Apr 29, 2025 am 12:23 AM

BroadcastinginNumPyisamethodtoperformoperationsonarraysofdifferentshapesbyautomaticallyaligningthem.Itsimplifiescode,enhancesreadability,andboostsperformance.Here'showitworks:1)Smallerarraysarepaddedwithonestomatchdimensions.2)Compatibledimensionsare

Explain how to choose between lists, array.array, and NumPy arrays for data storage.Apr 29, 2025 am 12:20 AM

ForPythondatastorage,chooselistsforflexibilitywithmixeddatatypes,array.arrayformemory-efficienthomogeneousnumericaldata,andNumPyarraysforadvancednumericalcomputing.Listsareversatilebutlessefficientforlargenumericaldatasets;array.arrayoffersamiddlegro

Give an example of a scenario where using a Python list would be more appropriate than using an array.Apr 29, 2025 am 12:17 AM

Pythonlistsarebetterthanarraysformanagingdiversedatatypes.1)Listscanholdelementsofdifferenttypes,2)theyaredynamic,allowingeasyadditionsandremovals,3)theyofferintuitiveoperationslikeslicing,but4)theyarelessmemory-efficientandslowerforlargedatasets.

How do you access elements in a Python array?Apr 29, 2025 am 12:11 AM

ToaccesselementsinaPythonarray,useindexing:my_array[2]accessesthethirdelement,returning3.Pythonuseszero-basedindexing.1)Usepositiveandnegativeindexing:my_list[0]forthefirstelement,my_list[-1]forthelast.2)Useslicingforarange:my_list[1:5]extractselemen

Is Tuple Comprehension possible in Python? If yes, how and if not why?Apr 28, 2025 pm 04:34 PM

Article discusses impossibility of tuple comprehension in Python due to syntax ambiguity. Alternatives like using tuple() with generator expressions are suggested for creating tuples efficiently.(159 characters)

What are Modules and Packages in Python?Apr 28, 2025 pm 04:33 PM

The article explains modules and packages in Python, their differences, and usage. Modules are single files, while packages are directories with an __init__.py file, organizing related modules hierarchically.

What is docstring in Python?Apr 28, 2025 pm 04:30 PM

Article discusses docstrings in Python, their usage, and benefits. Main issue: importance of docstrings for code documentation and accessibility.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

InZoi: How To Apply To School And University

3 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

2 weeks agoByDDD

Roblox: Dead Rails – How To Summon And Defeat Nikola Tesla

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Atom editor mac version download

The most popular open source editor

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Linux new version

SublimeText3 Linux latest version

Hot Topics

Where is the login entrance for gmail email?

7817

1646

1402

1300

1238