Home  >  Article  >  Backend Development  >  Detailed explanation of machine learning exploration with Python and Scikit-Learn

Detailed explanation of machine learning exploration with Python and Scikit-Learn

黄舟
黄舟Original
2017-10-17 10:24:291724browse

This article mainly introduces the relevant content of machine learning exploration based on Python and Scikit-Learn. The editor thinks it is quite good. I share it here with everyone for learning and reference by friends in need.

Hello, %username%!

My name is Alex, and I have experience in machine learning and network graph analysis (mainly theory). I was also developing a big data product for a Russian mobile operator. This is my first time writing an article online, so don’t comment if you don’t like it.

Nowadays, many people want to develop efficient algorithms and participate in machine learning competitions. So they come to me and ask, "How do I get started?". Some time ago, I led the development of big data analysis tools for media and social networks in an agency affiliated with the Russian government. I still have some documentation that my team uses that I would love to share with you. The prerequisite is that the reader already has a good knowledge of mathematics and machine learning (my team mainly consists of graduates of MIPT (Moscow University of Physics and Technology) and the School of Data Analysis).

This article is an introduction to data science. This subject is so popular recently. There are also an increasing number of machine learning competitions (e.g., Kaggle, TudedIT), and their funding is usually substantial.

R and Python are two of the most commonly used tools available to data scientists. Each tool has its pros and cons, but Python has been winning in every aspect lately (just my humble opinion, even though I use both). All this happened because of the advent of the Scikit-Learn library, which contains complete documentation and rich machine learning algorithms.
Please note that we will mainly discuss machine learning algorithms in this article. It is usually better to use the Pandas package to perform master data analysis, and it is easy to do it yourself. So, let's focus on implementation. For the sake of certainty, we assume that there is a feature-object matrix as input, which is stored in a *.csv file.

Data loading

First, the data must be loaded into the memory before it can be processed its operation. The Scikit-Learn library uses NumPy arrays in its implementation, so we will use NumPy to load *.csv files. Let's download one of the datasets from UCI Machine Learning Repository.


import numpy as np
import urllib
# url with dataset
url = “http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data”
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=“,”)
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

We will use this data set in all the examples below, in other words, use the X feature array and the value of the y target variable.

Data normalization

We all know most of the gradient methods (almost all machine learning algorithms are based on this ) is sensitive to data scaling. Therefore, before running the algorithm, we should perform normalization, or so-called normalization. Standardization involves replacing the nominal values ​​of all features so that each of them has a value between 0 and 1. For normalization, it involves preprocessing the data so that the value of each feature has a dispersion of 0 and 1. The Scikit-Learn library has provided corresponding functions for it.


from sklearn
import metrics
from sklearn.ensemble
import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)# display the relative importance of each attribute
print(model.feature_importances_)

Selection of features

There is no doubt that the most important thing to solve a problem is It is the ability to appropriately select features and even create features. This is called feature selection and feature engineering. Although feature engineering is a quite creative process that sometimes relies more on intuition and professional knowledge, there are already many algorithms for direct use in feature selection. For example, tree algorithm can calculate the information content of features.


from sklearn
import metrics
from sklearn.ensemble
import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)# display the relative importance of each attribute
print(model.feature_importances_)

All other methods are based on efficient search of feature subsets to find the best subset, which means that the evolved model is based on this subset Have the best quality. Recursive Feature Elimination (RFE) is one of these search algorithms and is also provided by the Scikit-Learn library.


from sklearn.feature_selection
import RFE
from sklearn.linear_model
import LogisticRegression
model = LogisticRegression()# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

Algorithm development

Like I said , Scikit-Learn library has implemented all basic machine learning algorithms. Let me take a look at some of them.

Logistic regression

is mostly used to solve classification problems (binary classification), but many Classification of classes (so-called one-to-many methods) also applies. The advantage of this algorithm is that for each output object there is a probability of a corresponding category.


from sklearn
import metrics
from sklearn.linear_model
import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Naive Bayes

It is also the most famous machine learning algorithm One, its main task is to restore the data distribution density of training samples. This method usually performs well on multi-class classification problems.


from sklearn
import metrics
from sklearn.naive_bayes
import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

k-nearest neighbor

kNN (k-nearest neighbor) method usually Used as part of a more complex classification algorithm. For example, we can use its estimated value as a feature of an object. Sometimes, a simple kNN


from sklearn
import metrics
from sklearn.neighbors
import KNeighborsClassifier# fit a k - nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Decision Tree

分类和回归树(CART)经常被用于这么一类问题,在这类问题中对象有可分类的特征且被用于回归和分类问题。决策树很适用于多类分类。


from sklearn
import metrics
from sklearn.tree
import DecisionTreeClassifier# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

支持向量机

SVM(支持向量机)是最流行的机器学习算法之一,它主要用于分类问题。同样也用于逻辑回归,SVM在一对多方法的帮助下可以实现多类分类。


from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

除了分类和回归问题,Scikit-Learn还有海量的更复杂的算法,包括了聚类, 以及建立混合算法的实现技术,如Bagging和Boosting。

如何优化算法的参数

在编写高效的算法的过程中最难的步骤之一就是正确参数的选择。一般来说如果有经验的话会容易些,但无论如何,我们都得寻找。幸运的是Scikit-Learn提供了很多函数来帮助解决这个问题。

作为一个例子,我们来看一下规则化参数的选择,在其中不少数值被相继搜索了:


import numpy as np
from sklearn.linear_model
import Ridge
from sklearn.grid_search
import GridSearchCV# prepare a range of alpha values to test
alphas = np.array([1, 0.1, 0.01, 0.001, 0.0001, 0])# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator = model, param_grid = dict(alpha = alphas))
grid.fit(X, y)
print(grid)# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

有时候随机地从既定的范围内选取一个参数更为高效,估计在这个参数下算法的质量,然后选出最好的。


import numpy as np
from scipy.stats
import uniform as sp_rand
from sklearn.linear_model
import Ridge
from sklearn.grid_search
import RandomizedSearchCV# prepare a uniform distribution to sample
for the alpha parameter
param_grid = {‘
  alpha': sp_rand()
}#
create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator = model, param_distributions = param_grid, n_iter = 100)
rsearch.fit(X, y)
print(rsearch)# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

至此我们已经看了整个使用Scikit-Learn库的过程,除了将结果再输出到一个文件中。这个就作为你的一个练习吧,和R相比Python的一大优点就是它有很棒的文档说明。

总结

The above is the detailed content of Detailed explanation of machine learning exploration with Python and Scikit-Learn. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn