In daily data mining work, in addition to using Python to handle classification or prediction tasks, sometimes it also involves tasks related to recommendation systems.
Recommendation systems are used in various fields, common examples include playlist generators for video and music services, product recommenders for online stores, or content recommenders for social media platforms. In this project, we create a movie recommender.
Collaborative filtering automatically predicts (filters) users' interests by collecting the preferences or taste information of many users. Recommender systems have been developed for a long time so far, and their models are based on various techniques such as weighted average, correlation, machine learning, deep learning, etc.
The Movielens 20M dataset has over 20 million movie ratings and tagging events since 1995. In this article, we will retrieve information from movie.csv & rating.csv files. Use Python libraries: Pandas, Seaborn, Scikit-learn and SciPy to train the model using cosine similarity in the k-nearest neighbor algorithm.
The following are the core steps of the project:
- Import and merge datasets and create Pandas DataFrame
- Add necessary features to analyze the data
- Use Seaborn to visualize and analyze data
- Filter invalid data by setting thresholds
- Create a pivot table with users as the index and movies as the columns
- Create a KNN model And output 5 recommendations similar to each movie
Import data
Import and merge datasets and create Pandas DataFrame
MovieLens 20M dataset since 1995 Over 20 million movie ratings and tagging activities since.
# usecols 允许选择自己选择的特征,并通过dtype设定对应类型 movies_df=pd.read_csv('movies.csv', usecols=['movieId','title'], dtype={'movieId':'int32','title':'str'}) movies_df.head()
ratings_df=pd.read_csv('ratings.csv', usecols=['userId', 'movieId', 'rating','timestamp'], dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'}) ratings_df.head()
Check if there are any null values and the number of entries in both data.
# 检查缺失值 movies_df.isnull().sum()
movieId 0
title 0
dtype: int64
ratings_df.isnull().sum()
userId 0
movieId 0
rating 0
timestamp 0
dtype: int64
print("Movies:",movies_df.shape) print("Ratings:",ratings_df.shape)
Movies: (9742, 2)
Ratings: (100836, 4)
Merge dataframe on column 'movieId'
# movies_df.info() # ratings_df.info() movies_merged_df=movies_df.merge(ratings_df, on='movieId') movies_merged_df.head()
The imported datasets have now been merged successfully.
Add Derived Features
Add necessary features to analyze the data.
Create 'Average Rating' & 'Rating Count' columns by grouping user ratings by movie title.
movies_average_rating=movies_merged_df.groupby('title')['rating'] .mean().sort_values(ascending=False) .reset_index().rename(columns={'rating':'Average Rating'}) movies_average_rating.head()
movies_rating_count=movies_merged_df.groupby('title')['rating'] .count().sort_values(ascending=True) .reset_index().rename(columns={'rating':'Rating Count'}) #ascending=False movies_rating_count_avg=movies_rating_count.merge(movies_average_rating, on='title') movies_rating_count_avg.head()
Currently 2 new derived features have been created.
Data Visualization
Using Seaborn to visualize data:
- After analysis, it was found that many movies have perfect 5 stars on a data set rated by nearly 100,000 users Average rating. This indicates the presence of outliers, which we need to further confirm through visualization.
- The ratings of many movies are relatively single. It is recommended to set a rating threshold in order to generate valuable recommendations.
Use seaborn & matplotlib to visualize data to better observe and analyze the data.
Plot a histogram of the newly created features and view their distribution. Set the bin size to 80. The setting of this value requires detailed analysis and reasonable setting.
# 导入可视化库 import seaborn as sns import matplotlib.pyplot as plt sns.set(font_scale = 1) plt.rcParams["axes.grid"] = False plt.style.use('dark_background') %matplotlib inline # 绘制图形 plt.figure(figsize=(12,4)) plt.hist(movies_rating_count_avg['Rating Count'],bins=80,color='tab:purple') plt.ylabel('Ratings Count(Scaled)', fontsize=16) plt.savefig('ratingcounthist.jpg') plt.figure(figsize=(12,4)) plt.hist(movies_rating_count_avg['Average Rating'],bins=80,color='tab:purple') plt.ylabel('Average Rating',fontsize=16) plt.savefig('avgratinghist.jpg')
Figure 1 Average Rating Histogram
Figure 2 Rating Count Histogram
Now create a joinplot 2D chart to visualize these two features together.
plot=sns.jointplot(x='Average Rating', y='Rating Count', data=movies_rating_count_avg, alpha=0.5, color='tab:pink') plot.savefig('joinplot.jpg')
Two-dimensional graph of Average Rating and Rating Count
分析
- 图1证实了,大部分电影的评分都是较低的。除了设置阈值之外,我们还可以在这个用例中使用一些更高百分比的分位数。
- 直方图 2 展示了“Average Rating”的分布函数。
数据清洗
运用describe()函数得到数据集的描述统计值,如分位数和标准差等。
pd.set_option('display.float_format', lambda x: '%.3f' % x) print(rating_with_RatingCount['Rating Count'].describe())
count 100836.000 mean58.759 std 61.965 min1.000 25% 13.000 50% 39.000 75% 84.000 max329.000 Name: Rating Count, dtype: float64
设置阈值并筛选出高于阈值的数据。
popularity_threshold = 50 popular_movies= rating_with_RatingCount[ rating_with_RatingCount['Rating Count']>=popularity_threshold] popular_movies.head() # popular_movies.shape
至此已经通过过滤掉了评论低于阈值的电影来清洗数据。
创建数据透视表
创建一个以用户为索引、以电影为列的数据透视表
为了稍后将数据加载到模型中,需要创建一个数据透视表。并设置'title'作为索引,'userId'为列,'rating'为值。
import os movie_features_df=popular_movies.pivot_table( index='title',columns='userId',values='rating').fillna(0) movie_features_df.head() movie_features_df.to_excel('output.xlsx')
接下来将创建的数据透视表加载到模型。
建立 kNN 模型
建立 kNN 模型并输出与每部电影相似的 5 个推荐
使用scipy.sparse模块中的csr_matrix方法,将数据透视表转换为用于拟合模型的数组矩阵。
from scipy.sparse import csr_matrix movie_features_df_matrix = csr_matrix(movie_features_df.values)
最后,使用之前生成的矩阵数据,来训练来自sklearn中的NearestNeighbors算法。并设置参数:metric = 'cosine', algorithm = 'brute'
from sklearn.neighbors import NearestNeighbors model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute') model_knn.fit(movie_features_df_matrix)
现在向模型传递一个索引,根据'kneighbors'算法要求,需要将数据转换为单行数组,并设置n_neighbors的值。
query_index = np.random.choice(movie_features_df.shape[0]) distances, indices = model_knn.kneighbors(movie_features_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)
最后在 query_index 中输出出电影推荐。
for i in range(0, len(distances.flatten())): if i == 0: print('Recommendations for {0}:n' .format(movie_features_df.index[query_index])) else: print('{0}: {1}, with distance of {2}:' .format(i, movie_features_df.index[indices.flatten()[i]], distances.flatten()[i]))
Recommendations for Harry Potter and the Order of the Phoenix (2007): 1: Harry Potter and the Half-Blood Prince (2009), with distance of 0.2346513867378235: 2: Harry Potter and the Order of the Phoenix (2007), with distance of 0.3396233320236206: 3: Harry Potter and the Goblet of Fire (2005), with distance of 0.4170845150947571: 4: Harry Potter and the Prisoner of Azkaban (2004), with distance of 0.4499547481536865: 5: Harry Potter and the Chamber of Secrets (2002), with distance of 0.4506162405014038:
至此我们已经能够成功构建了一个仅基于用户评分的推荐引擎。
总结
以下是我们构建电影推荐系统的步骤摘要:
- 导入和合并数据集并创建 Pandas DataFrame
- 为了更好分析数据创建衍生变量
- 使用 Seaborn 可视化数据
- 通过设置阈值来清洗数据
- 创建了一个以用户为索引、以电影为列的数据透视表
- 建立一个 kNN 模型,并输出 5 个与每部电影最相似的推荐
写在最后
以下是可以扩展项目的一些方法:
- 这个数据集不是很大,可以在项目中的包含数据集中的其他文件来扩展这个项目的范围。
- 可以利用' ratings.csv' 中时间戳,分析评级在一段时间内的变化情况,并且可以在解析我们的模型时,根据时间戳对评级进行加权。
- 该模型的性能远优于加权平均或相关模型,但仍有提升的空间,如使用高级 ML 算法甚至 DL 模型。
The above is the detailed content of Build a movie recommendation system using Python. For more information, please follow other related articles on the PHP Chinese website!

译者 | 布加迪审校 | 孙淑娟目前,没有用于构建和管理机器学习(ML)应用程序的标准实践。机器学习项目组织得不好,缺乏可重复性,而且从长远来看容易彻底失败。因此,我们需要一套流程来帮助自己在整个机器学习生命周期中保持质量、可持续性、稳健性和成本管理。图1. 机器学习开发生命周期流程使用质量保证方法开发机器学习应用程序的跨行业标准流程(CRISP-ML(Q))是CRISP-DM的升级版,以确保机器学习产品的质量。CRISP-ML(Q)有六个单独的阶段:1. 业务和数据理解2. 数据准备3. 模型

人工智能(AI)在流行文化和政治分析中经常以两种极端的形式出现。它要么代表着人类智慧与科技实力相结合的未来主义乌托邦的关键,要么是迈向反乌托邦式机器崛起的第一步。学者、企业家、甚至活动家在应用人工智能应对气候变化时都采用了同样的二元思维。科技行业对人工智能在创建一个新的技术乌托邦中所扮演的角色的单一关注,掩盖了人工智能可能加剧环境退化的方式,通常是直接伤害边缘人群的方式。为了在应对气候变化的过程中充分利用人工智能技术,同时承认其大量消耗能源,引领人工智能潮流的科技公司需要探索人工智能对环境影响的

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

条形统计图用“直条”呈现数据。条形统计图是用一个单位长度表示一定的数量,根据数量的多少画成长短不同的直条,然后把这些直条按一定的顺序排列起来;从条形统计图中很容易看出各种数量的多少。条形统计图分为:单式条形统计图和复式条形统计图,前者只表示1个项目的数据,后者可以同时表示多个项目的数据。

arXiv论文“Sim-to-Real Domain Adaptation for Lane Detection and Classification in Autonomous Driving“,2022年5月,加拿大滑铁卢大学的工作。虽然自主驾驶的监督检测和分类框架需要大型标注数据集,但光照真实模拟环境生成的合成数据推动的无监督域适应(UDA,Unsupervised Domain Adaptation)方法则是低成本、耗时更少的解决方案。本文提出对抗性鉴别和生成(adversarial d

数据通信中的信道传输速率单位是bps,它表示“位/秒”或“比特/秒”,即数据传输速率在数值上等于每秒钟传输构成数据代码的二进制比特数,也称“比特率”。比特率表示单位时间内传送比特的数目,用于衡量数字信息的传送速度;根据每帧图像存储时所占的比特数和传输比特率,可以计算数字图像信息传输的速度。

数据分析方法有4种,分别是:1、趋势分析,趋势分析一般用于核心指标的长期跟踪;2、象限分析,可依据数据的不同,将各个比较主体划分到四个象限中;3、对比分析,分为横向对比和纵向对比;4、交叉分析,主要作用就是从多个维度细分数据。

在日常开发中,对数据进行序列化和反序列化是常见的数据操作,Python提供了两个模块方便开发者实现数据的序列化操作,即 json 模块和 pickle 模块。这两个模块主要区别如下:json 是一个文本序列化格式,而 pickle 是一个二进制序列化格式;json 是我们可以直观阅读的,而 pickle 不可以;json 是可互操作的,在 Python 系统之外广泛使用,而 pickle 则是 Python 专用的;默认情况下,json 只能表示 Python 内置类型的子集,不能表示自定义的


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),