Home >Backend Development >Python Tutorial >Implement behavioral analysis with Python, SciKit, and text classification
Introduction
Almost everyone can shop. We buy a variety of items, from basic necessities (like food) to entertainment products (like music albums). When shopping, we are not only looking for things we use in our lives, but we are also expressing our interest in certain social groups. Our online behavior and decisions shape our own behavioral traits.
When purchasing a product, the product has multiple attributes that make it similar or different from other products. For example, a product's price, size, or type are its different characteristics. In addition to these structured properties of numerical or enumeration classes, there are also unstructured text properties. For example, the text of a product description or customer review also constitutes its distinctive features.
Text analytics and other natural language processing (NLP) techniques are very helpful for extracting something meaningful from these unstructured text attributes, which in turn is valuable for tasks such as behavioral analysis.
This article will introduce how to use text classification to build a behavior description model. You'll show how to use SciKit, a powerful Python-based machine learning package, to implement model construction and evaluation, and apply the model to simulated customers and their product purchase histories. In this particular scenario, a model would be constructed to assign the client some featured content of interest to music listeners, such as rave, goth, or metal music. This assignment is based on the specific product purchased by each customer and the corresponding text product description.
Music Act Description Scene
Please see the scene below. You have a dataset containing many customer profiles. Each customer profile includes a concise, natural language-based list of descriptions of all the products the customer has purchased. Below is a sample product description for a boot.
Description: This men's buckle boots are a pair of gothic boots with a dark ripple subculture atmosphere. The rivet heads of the boots bring the latest fashion in the industry. The boots feature a synthetic faux leather upper with cross-buckle laces on the front that continue all the way to the shaft, a rubber outsole with a treaded base and a combat-style front , equipped with a zipper on the inside, making it easy to put on and take off shoes. Shaft 13.5 inches, leg opening circumference approximately 16 inches. (Shoe size 9.5.) Style: Men's buckle boots.
Our goal is to classify every current and future user into a behavioral profile based on these product descriptions.
As shown below, the person in charge uses product examples to establish behavioral characteristics, behavioral models, customer characteristics, and ultimately customer behavioral characteristics.
Figure 1. High-level approach to building customer behavioral profiles
The first step is to assume the role of the person responsible and provide the system with an understanding of each behavioral profile. One way to do this is to manually put examples of each product into the system. Examples help define behavioral characteristics. This discussion classifies users into one of the following musical behavior descriptions:
punk
goth
hip hop
metal
rave
towards products defined as punk Provide examples such as descriptions of punk albums and bands, for example, "Never Mind the Bollocks" by the Sex Pistols. Other items may include hair or footwear-related products, such as cockscombs and Doc Marten leather boots.
Creation of libraries, software and data
All data and source code used in this article can be downloaded from bpro project on JazzHub. After downloading and unzipping the tar file, make sure you have Python, SciKit Learn (machine learning and text analysis package), and all dependencies (such as numpy, scipy, etc.). If you're using a Mac, SciPy Superpack is probably your best option.
After unzipping the tar file, you will notice two YAML files containing introductory data. Product descriptions are manually generated by reading the seed corpus (or the body of the document). When generating product descriptions, the frequency of words appearing in product descriptions is taken into account. Listing 1 is an artificial product description.
Note: The following description is not a true natural language description, but this description may appear in actual situations.
List 1. Manual product description
customer single clothes for his size them 1978 course group rhymes have master record-breaking group few starts heard blue ending company that the band the music packaged master kilmister not trousers got cult albums heart commentary cut 20.85 tour...
This analysis includes two data files:
customers.yaml:包括一个客户列表。对于每个客户,包括一个产品描述列表,以及目标标签,或正确的 行为描述。正确的行为描述是指您知道的那个行为描述是正确的。例如,在实际的场景中,将会检查哥特用户的特征数据,以便验证这些购买行为表明该用户是一个哥特用户。
behavioral_profiles.yaml:包含描述文件(朋克、哥特等)的列表,以及定义该描述文件的产品描述的样本集。
您可以通过运行命令 python bpro.py -g 生成自己的模拟文件。
备注:必须先在种子目录中填充一些内容,定义感兴趣的流派。进入种子目录,打开任何文件,并了解相关说明。您可以操纵 bpro.py 文件中的参数,以改变产品描述长度、噪声量、训练示例的数量或其他参数。
构建行为描述模型
首先,使用 SciKit 的 CountVectorizer 构建一个基于术语计数的简单语料库描述。语料库对象是包含产品描述的一个简单字符串列表。
清单 2. 构建一个简单的术语计数
vectorizer = CountVectorizer(gmin_df=1) corpus=[] for bp in behavioral_profiles: for pd in bp.product_descriptions: corpus.append(pd.description)
SciKit 还有其他更先进的矢量器(vectorizers),比如 TFIDFVectorizer,它使用术语频率/逆文档频率 (TF/IDF) 加权来存储文档术语。TF/IDF 表示有助于让独特的术语(比如 Ozzy、 raver和 Bauhaus)的权重比反复出现的术语(比如 and、 the 和 for)的权重还要高。
接下来,将产品描述划分为单个单词,并建立一个术语字典。分析器在匹配过程中找到的每个术语被赋予一个与在结果矩阵中的列相对应的惟一整数索引:
fit_corpus = vectorizer.fit_transform(corpus)
备注:这个分词器配置(tokenizer configuration)也丢弃了单字符单词。
您可以使用 print vectorizer.get_feature_names()[200:210] 打印出一些特性,看看哪些单词被分词。此命令的输出如下所示。
清单 3. print 命令的输出
[u'better', u'between', u'beyond', u'biafra', u'big', u'bigger', u'bill', u'billboard', u'bites', u'biting']
请注意,当前矢量器没有词干化的单词。词干化 是为词尾变化或派生的单词得到一个共同的基础或词根形式的过程。例如,big 是在前面列表中的 bigger 的一个常见词干。SciKit 不处理更复杂的分词(比如词干化、词簇化和复合断词),但您可以使用自定义分词器,比如那些来自 Natural Language Toolkit (NLTK) 库的那些分词器。关于自定义分词器的示例,请参见 scikit-learn.org。
分词过程(比如,词干化)有助于减少所需的训练实例的数量,因为如果某个单词有多种形式,而且不要求对每种形式都提供统计表示。您可以使用其他技巧来减少培训需求,比如使用类型字典。例如,如果您有所有哥特乐队的乐队名称列表,那么可以创建一个共同的文字标记,比如goth_band,并在生成特性之前将它添加到您的描述中。通过使用这种方法,如果在描述中第一次遇到某个乐队,该模型处理此乐队的方式会与处理模型可以理解其模式的其他乐队的方式相同。对于本文中的模拟数据,我们要关心的不是减少培训需求,所以我们应该继续执行下一个步骤。
在机器学习中,出现这样的监督分类问题是因为首先要为一组观察定义一组特性和相应的目标,或者正确的标签。然后,所选择的算法会尝试相应的模型,该模型会找到最适合的数据,并且参照已知的数据集来最大限度地减少错误。因此,我们的下一步操作是构建特性和目标标签矢量(参见清单 4)。随机化观察总是一个好办法,因为它可以防止验证技术没有这样做。
清单 4. 构建特性和目标标签矢量
data_target_tuples=[ ] for bp in behavioral_profiles: for pd in bp.product_descriptions: data_target_tuples.append((bp.type, pd.description)) shuffle(data_target_tuples)
接下来,组装矢量,如清单 5 所示。
清单 5. 组装矢量
X_data=[ ] y_target=[ ] for t in data_target_tuples: v = vectorizer.transform([t[1]]).toarray()[0] X_data.append(v) y_target.append(t[0]) X_data=np.asarray(X_data) y_target=np.asarray(y_target)
现在,您可以选择一个分类器并修整您的行为描述模型。在此之前,最好先评估模型,这样做只是为了确保该模型可用,然后再让客户试用。
评估行为描述模型
首先使用 Linear Support Vector Machine (SVM),对于此类稀疏矢量问题,这是一个匹配度很高的不错的模型。使用代码linear_svm_classifier = SVC(kernel="linear", C=0.025)。
备注:您可以通过修改这个模式初始化代码来切换到其他模型类型。如果需要试用不同的模型类型,那么可以使用这个分类器映射,它为一些常见的选项设置了初始化。
清单 6. 使用分类器的映射
classifier_map = dict() classifier_map["Nearest Neighbors"]=KNeighborsClassifier(3) classifier_map["Linear SVM"]=SVC(kernel="linear", C=0.025) classifier_map["RBF SVM"]= SVC(gamma=2, C=1) classifier_map["Decision Tree"]=DecisionTreeClassifier(max _depth=5) classifier_map["Random Forest"]=RandomForestClassifier (max_depth=5, n_estimators=10, max_features=1) classifier_map["AdaBoost"]=AdaBoostClassifier() classifier_map["Naive Bayes"]=GaussianNB() classifier_map["LDA"]=LDA() classifier_map["QDA"]=QDA()
因为这是一个多级分类问题(也就是说,在该问题中,您需要选择的可能类别多于两个),您还需要指定相应的策略。一种常见的方法是执行一对全的分类。例如,来自 goth 类的产品描述被用于定义一个类,而另一个类包括来自其他所有类( metal、rave,等等)的示例描述。最后,作为验证的一部分,您需要确保修整该模型的数据不是测试数据。一个常见的技术是使用交叉折叠验证法。您可以使用此技术五次,这意味着穿过数据的五个部分的分区五次。在每次穿过时,五分之四的数据被用于修整,其余五分之一用于测试。
清单 7. 交叉折叠验证
scores = cross_validation.cross_val_score(OneVsRestClassifier (linear_svm_classifier), X_data, y_target, cv=2) print("Accuracy using %s:%0.2f (+/- %0.2f) and %d folds" % ("Linear SVM", scores.mean(), scores.std() * 2, 5))
尽管如此,您仍会得到完全精确的结果,这标志着模拟数据有点过于完美。当然,在现实生活中,始终会有干扰因素,因为群体之间的完美界限并不总是存在。例如,有 goth punk 的问题流派,所以像 Crimson Scarlet 这样的乐队可能会同时进入 goth 和 punk 的训练示例。您可以试一下 bpro 下载软件包 中的种子数据,以便更好地了解这种类型的干扰因素。
在了解一个行为描述模型之后,您可以再绕回来,用您的所有数据修整它。
清单 8. 修整行为描述模型
behavioral_profiler = SVC(kernel="linear", C=0.025) behavioral_profiler.fit(X_data, y_target)
试用行为模型
现在,您可以玩一下模型,键入一些虚构的产品描述,看看模型如何工作。
清单 9. 试用模型
print behavioral_profiler.predict(vectorizer.transform(['Some black Bauhaus shoes to go with your Joy Division hand bag']).toarray()[0])
请注意,它的确会返回 ['goth']。如果删除单词 Bauhaus 并重新运行,您可能会注意到,它现在会返回 ['punk']。
对您的客户应用行为模型
继续将修整过的模型应用于客户及其购买的产品描述。
清单 10. 将修整过的模型应用于我们的客户及其产品描述
predicted_profiles=[ ] ground_truth=[ ] for c in customers: customer_prod_descs = ' '.join(p.description for p in c.product_descriptions) predicted = behavioral_profiler.predict(vectorizer .transform([customer_product_descriptions]).toarray()[0]) predicted_profiles.append(predicted[0]) ground_truth.append(c.type) print "Customer %d, known to be %s, was predicted to be %s" % (c.id,c.type,predicted[0])
最后,计算准确性,看看您可以多频繁地分析购物者。
清单 11. 计算准确性
a=[x1==y1 for x1, y1 in zip(predicted_profiles,ground_truth)] accuracy=float(sum(a))/len(a) print "Percent Profiled Correctly %.2f" % accuracy
如果使用所提供的默认描述数据,结果应该是 95%。如果这是真实的数据,那么这是一个相当不错的准确率。
扩展模型
现在,我们已经构建和测试了模型,可以把它应用于数以百万计的客户个人资料。您可以使用 MapReduce 框架,并将修整后的行为分析器发送到工作节点。然后,每个工作节点都会得到一批客户个人资料及其购买历史,并应用模型。保存结果。此时,模型已被应用,您的客户被分配为一个行为描述。您可以在很多方面使用该行为描述分配任务。例如,您可能决定用定制的促销活动来定位目标客户,或者使用行为描述作为产品推荐系统的输入。