An introduction to four methods to implement machine learning functions in Python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

An introduction to four methods to implement machine learning functions in Python

不言

Apr 13, 2019 am 11:41 AM

pythondevelopmachine learningprogramming language

This article brings you an introduction to the four methods of implementing machine learning functions in Python. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

In this article, we will introduce different methods of selecting features from a dataset; and discuss types of feature selection algorithms and their implementation in Python using the Scikit-learn (sklearn) library:

Univariate feature selection
Recursive feature elimination (RFE)
Principal component analysis (PCA)
Feature selection (feature importance)

Univariate Feature Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a different set of statistical tests to select a specific number of features.

The following example uses the chi squared (chi ^ 2) statistic to test non-negative features to select the four best features in the Pima Indians Diabetes dataset:

#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm

from sklearn.feature_selection import SelectKBest

#Import chi2 for performing chi square test from sklearn.feature_selection import chi2

#URL for loading the dataset

url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#We will select the features using chi square

test = SelectKBest(score_func=chi2, k=4)

#Fit the function for ranking the features by score

fit = test.fit(X, Y)

#Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_)

#Apply the transformation on to dataset

features = fit.transform(X)

#Summarize selected features print(features[0:5,:])

Each The score of the attribute and the four selected attributes (the ones with the highest scores): plas, test, mass and age.

Score per feature:

[111.52   1411.887 17.605 53.108  2175.565   127.669 5.393

181.304]

Features:

[[148. 0. 33.6 50. ]

[85. 0. 26.6 31. ]

[183. 0. 23.3 32. ]

[89. 94. 28.1 21. ]

[137. 168. 43.1 33. ]]

Recursive Feature Elimination (RFE)

RFE via recursive deletion properties and build models on the remaining properties to work on. It uses model accuracy to identify which attributes (and attribute combinations) contribute most to predicting the target attribute. The following example uses RFE and logistic regression algorithms to select the top three features. The choice of algorithm does not matter as long as it is skillful and consistent:

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE

#Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

model = LogisticRegression() rfe = RFE(model, 3)

fit = rfe.fit(X, Y)

print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)

After execution, we will obtain:

Num Features: 3

Selected Features: [ True False False False False   True  True False]

Feature Ranking: [1 2 3 5 6 1 1 4]

You can see that RFE selects the first three features as preg , mass and pedi. These are marked True in the support_array and Option 1 in the ranking_array.

Principal Component Analysis (PCA)

PCA uses linear algebra to transform a data set into a compressed form. Typically, it is considered a data reduction technique. One property of PCA is that you can choose to transform the number of dimensions or principal components in the result.

In the following example, we use PCA and select three principal components:

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's PCA algorithm

from sklearn.decomposition import PCA

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

pca = PCA(n_components=3) fit = pca.fit(X)

#Summarize components

print("Explained Variance: %s") % fit.explained_variance_ratio_

print(fit.components_)

You can see that the transformed data set (three principal components) has little similarity to the source data At:

Explained Variance: [ 0.88854663   0.06159078  0.02579012]

[[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02

9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03]

[ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01

[ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]]

Feature importance

Feature importance is a technique used to select features using a trained supervised classifier. When we train a classifier (such as a decision tree), we evaluate each attribute to create a split; we can use this measure as a feature selector. Let us know it in detail.

Random forests are one of the most popular machine learning methods because of their relatively good accuracy, robustness, and ease of use. They also provide two straightforward feature selection methods - Average Reduction in Impurity and Average Reduction in Precision.

Random forest consists of many decision trees. Each node in the decision tree is a condition on a single feature designed to split the data set into two so that similar response values end up in the same set. The metric that selects (locally) optimal conditions is called Impurity. For classification it is usually the Gini coefficient

impurity or information gain/entropy, and for regression trees it is the variance. Therefore, when training a tree, it can be calculated by how much each feature reduces weighted impurity in the tree. For forests, the impurity reduction for each feature can be averaged and the features ranked according to this measure.

Let us see how to use the Random Forest classifier for feature selection and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset.

This dataset describes 93 fuzzy details for over 61,000 products grouped into 10 product categories (e.g., fashion, electronics, etc.) . The input attribute is some kind of count of distinct events.

The goal is to get predictions for new products as an array of probabilities for each of the 10 classes and evaluate the model using a multi-class log loss (also known as cross-entropy).

We will start by importing all the libraries:

#Import the supporting libraries

#Import pandas to load the dataset from csv file

from pandas import read_csv

#Import numpy for array based operations and calculations

import numpy as np

#Import Random Forest classifier class from sklearn

from sklearn.ensemble import RandomForestClassifier

#Import feature selector class select model of sklearn

        from sklearn.feature_selection

        import SelectFromModel

         np.random.seed(1)

Let us define a way to split the dataset into training and test data; we will train our dataset in the training part, test Part will be used to evaluate the trained model:

#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split):

np.random.seed(0) training = [] testing = []

np.random.shuffle(dataset) shape = np.shape(dataset)

trainlength = np.uint16(np.floor(split*shape[0]))

for i in range(trainlength): training.append(dataset[i])

for i in range(trainlength,shape[0]): testing.append(dataset[i])

training = np.array(training) testing = np.array(testing)

return training,testing

We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percent accuracy:

#Function to evaluate model performance

def getAccuracy(pre,ytest): count = 0

for i in range(len(ytest)):

if ytest[i]==pre[i]: count+=1

acc = float(count)/len(ytest)

return acc

This is the time to load the dataset. We will load the train.csv file; this file contains over 61,000 training instances. We will use 50000 instances in our example, of which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier:

#Load dataset as pandas data frame

data = read_csv('train.csv')

#Extract attribute names from the data frame

feat = data.keys()

feat_labels = feat.get_values()

#Extract data values from the data frame

dataset = data.values

#Shuffle the dataset

np.random.shuffle(dataset)

#We will select 50000 instances to train the classifier

inst = 50000

#Extract 50000 instances from the dataset

dataset = dataset[0:inst,:]

#Create Training and Testing data for performance evaluation

train,test = getTrainTestData(dataset, 0.7)

#Split data into input and output variable with selected features

Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain)

print("Shape of the dataset ",shape)

#Print the size of Data in MBs

print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))

We pay attention to the data size here; because our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is very large. Let's take a look:

Shape of the dataset (35000, 94)

Size of Data set before feature selection: 26.32 MB

As you can see, our dataset has 35000 rows and 94 columns, which is over 26 MB of data.

In the next code block, we will configure the random forest classifier; we will use 250 trees, the maximum depth is 30, and the number of random features is 7. The other hyperparameters will be the default values of sklearn:

#Lets select the test data for model evaluation purpose

Xtest = test[:,0:94] ytest = test[:,94]

#Create a random forest classifier with the following Parameters

trees            = 250

max_feat     = 7

max_depth = 30

min_sample = 2

clf = RandomForestClassifier(n_estimators=trees,

max_features=max_feat,

max_depth=max_depth,

min_samples_split= min_sample, random_state=0,

n_jobs=-1)

#Train the classifier and calculate the training time

import time

start = time.time() clf.fit(Xtrain, ytrain) end = time.time()

#Lets Note down the model training time

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

pre = clf.predict(Xtest)

Let's see how much time is required to train the model on the training dataset:

Execution time for building the Tree is: 2.913641

#Evaluate the model performance for the test data

acc = getAccuracy(pre, ytest)

print("Accuracy of model before feature selection is %.2f"%(100*acc))

我们模型的准确性是：

特征选择前的模型精度为98.82

正如您所看到的，我们正在获得非常好的准确性，因为我们将近99％的测试数据分类到正确的类别中。这意味着我们正在对15,000个正确类中的14,823个实例进行分类。

那么，现在我的问题是：我们是否应该进一步改进？好吧，为什么不呢？如果可以的话，我们肯定会寻求更多的改进; 在这里，我们将使用功能重要性来选择功能。如您所知，在树木构建过程中，我们使用杂质测量来选择节点。选择具有最低杂质的属性值作为树中的节点。我们可以使用类似的标准进行特征选择。我们可以更加重视杂质较少的功能，这可以使用sklearn库的feature_importances_函数来完成。让我们找出每个功能的重要性：

#Once我们培养的模型中，我们的排名将所有功能的功能在拉链（feat_labels，clf.feature_importances_）：

print(feature)

('id', 0.33346650420175183)

('feat_1', 0.0036186958628801214)

('feat_2', 0.0037243050888530957)

('feat_3', 0.011579217472062748)

('feat_4', 0.010297382675187445)

('feat_5', 0.0010359139416194116)

('feat_6', 0.00038171336038056165)

('feat_7', 0.0024867672489765021)

('feat_8', 0.0096689721610546085)

('feat_9', 0.007906150362995093)

('feat_10', 0.0022342480802130366)

正如您在此处所看到的，每个要素都基于其对最终预测的贡献而具有不同的重要性。

我们将使用这些重要性分数来排列我们的功能; 在下面的部分中，我们将选择功能重要性大于0.01的模型训练功能：

#Select features which have higher contribution in the final prediction

sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)

在这里，我们将根据所选的特征属性转换输入数据集。在下一个代码块中，我们将转换数据集。然后，我们将检查新数据集的大小和形状：

#Transform input dataset

Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest)

#Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6))

shape = np.shape(Xtrain_1)

print("Shape of the dataset ",shape)

Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)

你看到数据集的形状了吗？在功能选择过程之后，我们只剩下20个功能，这将数据库的大小从26 MB减少到5.60 MB。这比原始数据集减少了约80％。

在下一个代码块中，我们将训练一个新的随机森林分类器，它具有与之前相同的超参数，并在测试数据集上进行测试。让我们看看修改训练集后得到的准确度：

#Model training time

start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time()

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

#Let's evaluate the model on test data

pre = clf.predict(Xtest_1) count = 0

acc2 = getAccuracy(pre, ytest)

print("Accuracy after feature selection %.2f"%(100*acc2))

Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97

你能看到!! 我们使用修改后的数据集获得了99.97％的准确率，这意味着我们在正确的类中对14,996个实例进行了分类，而之前我们只正确地对14,823个实例进行了分类。

这是我们在功能选择过程中取得的巨大进步; 我们可以总结下表中的所有结果：

评估标准	在选择特征之前	选择功能后
功能数量	94	20
数据集的大小	26.32 MB	5.60 MB
训练时间	2.91秒	1.71秒
准确性	98.82％	99.97％

上表显示了特征选择的实际优点。您可以看到我们显着减少了要素数量，从而降低了数据集的模型复杂性和维度。尺寸减小后我们的训练时间缩短，最后，我们克服了过度拟合问题，获得了比以前更高的精度。

The above is the detailed content of An introduction to four methods to implement machine learning functions in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:segmentfault. If there is any infringement, please contact admin@php.cn delete

Nuitka简介：编译和分发Python的更好方法Apr 13, 2023 pm 12:55 PM

译者 | 李睿审校 | 孙淑娟随着Python越来越受欢迎，其局限性也越来越明显。一方面，编写Python应用程序并将其分发给没有安装Python的人员可能非常困难。解决这一问题的最常见方法是将程序与其所有支持库和文件以及Python运行时打包在一起。有一些工具可以做到这一点，例如PyInstaller，但它们需要大量的缓存才能正常工作。更重要的是，通常可以从生成的包中提取Python程序的源代码。在某些情况下，这会破坏交易。第三方项目Nuitka提供了一个激进的解决方案。它将Python程序编

我创建了一个由 ChatGPT API 提供支持的语音聊天机器人，方法请收下Apr 07, 2023 pm 11:01 PM

今天这篇文章的重点是使用 ChatGPT API 创建私人语音 Chatbot Web 应用程序。目的是探索和发现人工智能的更多潜在用例和商业机会。我将逐步指导您完成开发过程，以确保您理解并可以复制自己的过程。为什么需要不是每个人都欢迎基于打字的服务，想象一下仍在学习写作技巧的孩子或无法在屏幕上正确看到单词的老年人。基于语音的 AI Chatbot 是解决这个问题的方法，就像它如何帮助我的孩子要求他的语音 Chatbot 给他读睡前故事一样。鉴于现有可用的助手选项，例如，苹果的 Siri 和亚马

ChatGPT 的五大功能可以帮助你提高代码质量Apr 14, 2023 pm 02:58 PM

ChatGPT 目前彻底改变了开发代码的方式，然而，大多数软件开发人员和数据专家仍然没有使用 ChatGPT 来改进和简化他们的工作。这就是为什么我在这里概述 5 个不同的功能，以提高我们的日常工作速度和质量。我们可以在日常工作中使用它们。现在，我们一起来了解一下吧。注意：切勿在 ChatGPT 中使用关键代码或信息。01.生成项目代码的框架从头开始构建新项目时，ChatGPT 是我的秘密武器。只需几个提示，它就可以生成我需要的代码框架，包括我选择的技术、框架和版本。它不仅为我节省了至少一个小时

解决Batch Norm层等短板的开放环境解决方案Apr 26, 2023 am 10:01 AM

测试时自适应（Test-TimeAdaptation,TTA）方法在测试阶段指导模型进行快速无监督/自监督学习，是当前用于提升深度模型分布外泛化能力的一种强有效工具。然而在动态开放场景中，稳定性不足仍是现有TTA方法的一大短板，严重阻碍了其实际部署。为此，来自华南理工大学、腾讯AILab及新加坡国立大学的研究团队，从统一的角度对现有TTA方法在动态场景下不稳定原因进行分析，指出依赖于Batch的归一化层是导致不稳定的关键原因之一，另外测试数据流中某些具有噪声/大规模梯度的样本

摔倒检测-完全用ChatGPT开发，分享如何正确地向ChatGPT提问Apr 07, 2023 pm 03:06 PM

哈喽，大家好。之前给大家分享过摔倒识别、打架识别，今天以摔倒识别为例，我们看看能不能完全交给ChatGPT来做。让ChatGPT来做这件事，最核心的是如何向ChatGPT提问，把问题一股脑的直接丢给ChatGPT，如：用 Python 写个摔倒检测代码是不可取的，而是要像挤牙膏一样，一点一点引导ChatGPT得到准确的答案，从而才能真正让ChatGPT提高我们解决问题的效率。今天分享的摔倒识别案例，与ChatGPT对话的思路清晰，代码可用度高，按照GPT返回的结果完全可以开

17 个可以实现高效工作与在线赚钱的 AI 工具网站Apr 11, 2023 pm 04:13 PM

自 2020 年以来，内容开发领域已经感受到人工智能工具的存在。1.Jasper AI网址：https://www.jasper.ai在可用的 AI 文案写作工具中，Jasper 作为那些寻求通过内容生成赚钱的人来讲，它是经济实惠且高效的选择之一。该工具精通短格式和长格式内容均能完成。Jasper 拥有一系列功能，包括无需切换到模板即可快速生成内容的命令、用于创建文章的高效长格式编辑器，以及包含有助于创建各种类型内容的向导的内容工作流，例如，博客文章、销售文案和重写。Jasper Chat 是该

为什么特斯拉的人形机器人长得并不像人？一文了解恐怖谷效应对机器人公司的影响Apr 14, 2023 pm 11:13 PM

1970年，机器人专家森政弘（MasahiroMori）首次描述了「恐怖谷」的影响，这一概念对机器人领域产生了巨大影响。「恐怖谷」效应描述了当人类看到类似人类的物体，特别是机器人时所表现出的积极和消极反应。恐怖谷效应理论认为，机器人的外观和动作越像人，我们对它的同理心就越强。然而，在某些时候，机器人或虚拟人物变得过于逼真，但又不那么像人时，我们大脑的视觉处理系统就会被混淆。最终，我们会深深地陷入一种对机器人非常消极的情绪状态里。森政弘的假设指出：由于机器人与人类在外表、动作上相似，所以人类亦会对

如何使用Azure Bot Services创建聊天机器人的分步说明Apr 11, 2023 pm 06:34 PM

译者 | 李睿审校 | 孙淑娟信使、网络服务和其他软件都离不开机器人（bot）。而在软件开发和应用中，机器人是一种应用程序，旨在自动执行(或根据预设脚本执行)响应用户请求创建的操作。在本文中， NIX United公司的.NET开发人员Daniil Mikhov介绍了使用微软Azure Bot Services创建聊天机器人的一个例子。本文将对想要使用该服务开发聊天机器人的开发人员有所帮助。为什么使用Azure Bot Services? 在Azure Bot Services上开发聊

See all articles