search
HomeSystem TutorialLINUXPython crawler analyzes 'Wolf Warrior' movie review
Python crawler analyzes 'Wolf Warrior' movie reviewJan 05, 2024 pm 09:44 PM
linuxlinux tutorialRed Hatlinux systemlinux commandlinux certificationred hat linuxlinux video

Introduction As of August 20, the 25th day after "Wolf Warrior II" was released, its box office has exceeded 5 billion yuan, truly becoming the only Asian film to enter the top 100 box office in world film history. This article uses Python crawlers to obtain data, analyze Douban movie reviews, and create a cloud image of Douban movie reviews. Now, let’s take a look at what interesting subtexts are hidden in the reviews of “Wolf Warrior II”.

Python crawler analyzes Wolf Warrior movie review

Regardless of the explosive box office, the movie also aroused various emotions in the audience. Some people even said harshly: Anyone who dares to criticize "Wolf Warrior II" is either mentally retarded or a public enemy. It's as simple and crude as that.
Everyone has mixed reviews of "Wolf Warrior II" and have left comments on Douban to express their opinions on the movie. Although various comments were published and the media made a fuss, the audience still couldn't tell which opinion was more reliable.

So far, there have been more than 150,000 comments. When you read the comments, you may see most of them for a period of time, either praising or disparaging comments. So it’s hard to tell by browsing the comments what everyone’s overall opinion of this movie is. Now let’s use data analysis to see what interesting things happened in these comments!

This article obtains data through Python crawler, analyzes Douban movie reviews, and creates a cloud image of Douban movie reviews. Now, let’s take a look at what interesting subtexts are hidden in the reviews of “Wolf Warrior II”.

Data acquisition

This article uses the data obtained by Python crawler. It mainly uses the requests package and the regular package re. This program does not process the verification code. I have crawled Douban's webpage before. At that time, because the crawled content was small, I did not encounter the verification code. When I wrote this crawler, I thought there would be no verification code, but when about 15,000 comments were crawled, the verification code popped up.
Then I thought, isn’t it just 120,000? At most, I only entered the verification code about a dozen times, so I didn’t have to deal with the verification code. But what happened next was a bit confusing for me. When I crawled about 15,000 comments and entered the verification code, I thought it would crawl to about 30,000, but after crawling about 3,000, it didn’t work. I still had to enter the verification code. .

Then it has been like this, stumbling, sometimes crawling for a long time before requiring a verification code, sometimes not. But in the end, the comments were crawled. The content crawled is mainly: user name, whether you have seen it, the number of stars of the comment, the time of the comment, the number of people who found it useful, and the content of the comment. The following is the code of the Python crawler:
import requests<br> import re<br> import pandas as pd<br> url_first='https://movie.douban.com/subject/26363254/comments?start=0'<br> head={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/59.0.3071.109 Chrome/59.0.3071.109 Safari/537.36'}<br> html=requests.get(url_first,headers=head,cookies=cookies)<br> cookies={'cookie':'your own cookie'} #That is, find the cookie corresponding to your account<br> reg=re.compile(r'') #Next page<br> ren=re.compile(r'<span>(.*?)</span>.*?comment">(.*?).*?.*?<span .>(.*?).*?<span> (.*?)</span>.*?title="(.*?)"></span>.*?title="(.*?)">.*?class=""> (.*?) \n',re.S) #Comments and other content<br> while html.status_code==200:<br> url_next='https://movie.douban.com/subject/26363254/comments' re.findall(reg,html.text)[0]<br> zhanlang=re.findall(ren,html.text)<br> data=pd.DataFrame(zhanlang)<br> data.to_csv('/home/wajuejiprince/document/zhanlang/zhanlangpinglun.csv', header=False,index=False,mode='a ') #Write a csv file, 'a ' is the append mode<br> data=[]<br> zhanlang=[]<br> html=requests.get(url_next,cookies=cookies,headers=head)
In the above code, please set your own User-Agent, Cookie, CSV saving path, etc., and save the crawled content into a CSV format file.

Data Cleaning

This article uses R language to process data. Although we have paid great attention to the structure of the crawled content when crawling, it is inevitable that there are some values ​​that are not what we want. For example, some comment content will appear in the commenter item, so it is still necessary to clean the data.

First load all the packages to be used:
library(data.table)<br> library(plotly)<br> library(stringr)<br> library(jiebaR)<br> library(wordcloud2)<br> library(magrittr)
Import data and clean it:
dt

Data analysis

Let’s first take a look at the situation of comments based on the number of stars:
plot_ly(my_dt[,.(.N),by=.(five-star number)],type = 'bar',x=~five-star number,y=~N)
Python crawler analyzes Wolf Warrior movie review

The number of five-pointed stars corresponds to 5 levels, 5 stars means highly recommended, 4 stars means recommended, 3 stars means okay, 2 stars means poor, and 1 star means very poor.
It is obvious from the reviews of Pentagram that we have reason to believe that the vast majority of viewers will be satisfied with this film.

First we should segment the comments:
wk <br> Overall comment cloud display: <br> <code>words�ta.table()<br> setnames(words,"N","pinshu")<br> words[pinshu>1000] #Remove lower frequency words (less than 1000)<br> wordcloud2(words[pinshu>1000], size = 2, fontFamily = "Microsoft Yahei", color = "random-light", backgroundColor = "grey")
Because there was too much data, my broken computer froze, so I removed words with frequencies lower than 1,000 when making the cloud chart. The cloud chart results are as follows:
Python crawler analyzes Wolf Warrior movie review

Overall, everyone’s comments on this movie are pretty good! Topics such as plot, action, and patriotism are the focus of discussion.

Evaluation keywords: Wu Jing, personal heroism, main theme, China, protagonist aura, Secretary Dakang, very burning.

It can be seen that "Ran" was not the most popular response after watching it. The audience was more interested in admiring Wu Jing himself and commenting on patriotism and individualism.

Cloud image display of different comment levels

But what would it look like if the comments of people with different ratings were displayed separately? That is to create a cloud chart for the review content of five levels (strongly recommended, recommended, okay, poor, very poor), the code is as follows (just change the code to "strongly recommended" to other).

1. Comment cloud of highly recommended reviewers

Python crawler analyzes Wolf Warrior movie review

2. Comment cloud of recommended reviewers

Python crawler analyzes Wolf Warrior movie review

3. Review cloud of good reviewers

Python crawler analyzes Wolf Warrior movie review

4. Review cloud of poor reviewers

Python crawler analyzes Wolf Warrior movie review

5. Review cloud of bad reviewers

Python crawler analyzes Wolf Warrior movie review

in conclusion

Judging from the word segmentation results of different comments, they all have a common topic: patriotism.

The number of patriotic topics in highly recommended comments may be higher than in poorly recommended comments. In highly recommended comments, people are more willing to discuss things other than patriotic topics. Most of the negative comments were about patriotic topics. And their proportion is very interesting. From those who highly recommend it to those who comment poorly, the proportion of patriotic topics gradually increases.

We cannot subjectively think who is right and who is wrong. We can only say that they stand from different perspectives, so the results they see are also different. When we disagree with others, it is often from different perspectives. People with bad comments may be thinking more about patriotic topics (this is just a discussion of patriotic topics, not who loves or dislikes the country)! !

After the analysis, the fundamental reason why this "Wolf Warrior 2" has been supported by so many people is that it has achieved an American blockbuster-level scene in production that "Wolf Warrior 1" did not have, and at the same time, it also embodies patriotism. It aroused resonance and aroused people's hearts.

The above is the detailed content of Python crawler analyzes 'Wolf Warrior' movie review. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:Linux就该这么学. If there is any infringement, please contact admin@php.cn delete
什么是linux设备节点什么是linux设备节点Apr 18, 2022 pm 08:10 PM

linux设备节点是应用程序和设备驱动程序沟通的一个桥梁;设备节点被创建在“/dev”,是连接内核与用户层的枢纽,相当于硬盘的inode一样的东西,记录了硬件设备的位置和信息。设备节点使用户可以与内核进行硬件的沟通,读写设备以及其他的操作。

Linux中open和fopen的区别有哪些Linux中open和fopen的区别有哪些Apr 29, 2022 pm 06:57 PM

区别:1、open是UNIX系统调用函数,而fopen是ANSIC标准中的C语言库函数;2、open的移植性没fopen好;3、fopen只能操纵普通正规文件,而open可以操作普通文件、网络套接字等;4、open无缓冲,fopen有缓冲。

linux中什么叫端口映射linux中什么叫端口映射May 09, 2022 pm 01:49 PM

端口映射又称端口转发,是指将外部主机的IP地址的端口映射到Intranet中的一台计算机,当用户访问外网IP的这个端口时,服务器自动将请求映射到对应局域网内部的机器上;可以通过使用动态或固定的公共网络IP路由ADSL宽带路由器来实现。

什么是linux交叉编译什么是linux交叉编译Apr 29, 2022 pm 06:47 PM

在linux中,交叉编译是指在一个平台上生成另一个平台上的可执行代码,即编译源代码的平台和执行源代码编译后程序的平台是两个不同的平台。使用交叉编译的原因:1、目标系统没有能力在其上进行本地编译;2、有能力进行源代码编译的平台与目标平台不同。

linux中eof是什么linux中eof是什么May 07, 2022 pm 04:26 PM

在linux中,eof是自定义终止符,是“END Of File”的缩写;因为是自定义的终止符,所以eof就不是固定的,可以随意的设置别名,linux中按“ctrl+d”就代表eof,eof一般会配合cat命令用于多行文本输出,指文件末尾。

linux怎么判断pcre是否安装linux怎么判断pcre是否安装May 09, 2022 pm 04:14 PM

在linux中,可以利用“rpm -qa pcre”命令判断pcre是否安装;rpm命令专门用于管理各项套件,使用该命令后,若结果中出现pcre的版本信息,则表示pcre已经安装,若没有出现版本信息,则表示没有安装pcre。

linux怎么查询mac地址linux怎么查询mac地址Apr 24, 2022 pm 08:01 PM

linux查询mac地址的方法:1、打开系统,在桌面中点击鼠标右键,选择“打开终端”;2、在终端中,执行“ifconfig”命令,查看输出结果,在输出信息第四行中紧跟“ether”单词后的字符串就是mac地址。

linux中rpc是什么意思linux中rpc是什么意思May 07, 2022 pm 04:48 PM

在linux中,rpc是远程过程调用的意思,是Reomote Procedure Call的缩写,特指一种隐藏了过程调用时实际通信细节的IPC方法;linux中通过RPC可以充分利用非共享内存的多处理器环境,提高系统资源的利用率。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software