A brief introduction to preprocessing and heatmaps in python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

A brief introduction to preprocessing and heatmaps in python

不言

Oct 11, 2018 pm 04:29 PM

python

This article brings you a brief introduction to preprocessing and heat maps in python. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

There are still a lot of things in data analysis. I am just giving a heuristic introduction here. After understanding this aspect, I can find solutions faster when using them. I hope that Helpful to everyone.

This time, we still use the iris data set in sklearn and display it through a heat map.

Preprocessing

sklearn.preprocessing is the preprocessing module in the machine learning library. It can standardize, regularize, etc. the data and use it according to needs. Here its standardized method is used to organize the data. Other methods can be queried by yourself.

Standardization: Adjust the distribution of feature data to a standard normal distribution, also called Gaussian distribution, which means that the mean of the data is 0 and the variance is 1.

The reason for standardization is that if the variance of some features is too large, it will dominate the objective function and prevent the parameter estimator from learning other features correctly.

The standardization process is two steps: decentralization of the mean (the mean becomes 0); scaling of the variance (the variance becomes 1).

A scale method is provided in sklearn.preprocessing to achieve the above functions.

Let’s take an example:

from sklearn import preprocessing
import numpy as np
# 创建一组特征数据，每一行表示一个样本，每一列表示一个特征
xx = np.array([[1., -1., 2.],
              [2., 0., 0.],
              [0., 1., -1.]])
# 将每一列特征标准化为标准正太分布，注意，标准化是针对每一列而言的
xx_scale = preprocessing.scale(xx)
xx_scale

The result after normalizing the data in each column is:

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

You can see that the data inside has occurred The changes, the numerical values are relatively small, maybe someone can see it at a glance, but it doesn’t matter if they can’t see it. Python can easily calculate some of their statistics.

 # 测试一下xx_scale每列的均值方差
 print(&#39;均值：&#39;, xx_scale.mean(axis=0))  # axis=0指列，axis=1指行
 print(&#39;方差：&#39;, xx_scale.std(axis=0))

The above has introduced what standardization is to be converted into, and the results are indeed consistent. The results of calculating the mean and variance by column are:

均值： [0. 0. 0.]
方差： [1. 1. 1.]

Of course, for standardization, the variance and mean It doesn’t have to be done together. For example, sometimes you just want to benefit from one of the methods. There is also a way:

with_mean,with_std. Both of these are boolean parameters, and they are both true by default, but It can also be customized to false. That is, do not want mean centering or variance scaling to 1.

heatmap

I will only briefly mention the heatmap here. Because there is already a lot of detailed information about it on the Internet.

In a heat map, the data exists in the form of a matrix, and the attribute range is represented by a gradient of color. Here, pcolor is used to draw the heat map.

小 Lizi

Still start from the import library, then load the data set, process the data, then draw the image, make some annotations and decorations on the image, etc. I am used to making comments in the code. If there is anything you don’t understand, you can leave a message and I will reply in time.

# 导入后续所需要的库
from sklearn.datasets import load_iris
from sklearn.preprocessing import scale
import numpy as np
import matplotlib.pyplot as plt
# 加载数据集
data = load_iris()
x = data[&#39;data&#39;]
y = data[&#39;target&#39;]
col_names = data[&#39;feature_names&#39;]
# 数据预处理
# 根据平均值对数据进行缩放
x = scale(x, with_std=False)
x_ = x[1:26,] # 选取其中25组数据
y_labels = range(1, 26)
# 绘制热图
plt.close(&#39;all&#39;)
plt.figure(1)
fig, ax = plt.subplots()
ax.pcolor(x_, cmap=plt.cm.Greens, edgecolors=&#39;k&#39;)
ax.set_xticks(np.arange(0, x_.shape[1])+0.5) # 设置横纵坐标
ax.set_yticks(np.arange(0, x_.shape[0])+0.5)
ax.xaxis.tick_top() # x轴提示显示在图形上方
ax.yaxis.tick_left() # y轴提示显示在图形的左侧
ax.set_xticklabels(col_names, minor=False, fontsize=10) # 传递标签数据
ax.set_yticklabels(y_labels, minor=False, fontsize=10)
plt.show()

So what does the drawn image look like:

Just follow these simple steps The data draws an intuitive image. Of course, it will not be so simple when it is actually used, and more knowledge needs to be expanded.

The above is the detailed content of A brief introduction to preprocessing and heatmaps in python. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:博客园. If there is any infringement, please contact admin@php.cn delete

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

Python vs. C : Memory Management and ControlApr 19, 2025 am 12:17 AM

Python and C have significant differences in memory management and control. 1. Python uses automatic memory management, based on reference counting and garbage collection, simplifying the work of programmers. 2.C requires manual management of memory, providing more control but increasing complexity and error risk. Which language to choose should be based on project requirements and team technology stack.

Python for Scientific Computing: A Detailed LookApr 19, 2025 am 12:15 AM

Python's applications in scientific computing include data analysis, machine learning, numerical simulation and visualization. 1.Numpy provides efficient multi-dimensional arrays and mathematical functions. 2. SciPy extends Numpy functionality and provides optimization and linear algebra tools. 3. Pandas is used for data processing and analysis. 4.Matplotlib is used to generate various graphs and visual results.

Python and C : Finding the Right ToolApr 19, 2025 am 12:04 AM

Whether to choose Python or C depends on project requirements: 1) Python is suitable for rapid development, data science, and scripting because of its concise syntax and rich libraries; 2) C is suitable for scenarios that require high performance and underlying control, such as system programming and game development, because of its compilation and manual memory management.

Python for Data Science and Machine LearningApr 19, 2025 am 12:02 AM

Python is widely used in data science and machine learning, mainly relying on its simplicity and a powerful library ecosystem. 1) Pandas is used for data processing and analysis, 2) Numpy provides efficient numerical calculations, and 3) Scikit-learn is used for machine learning model construction and optimization, these libraries make Python an ideal tool for data science and machine learning.

Learning Python: Is 2 Hours of Daily Study Sufficient?Apr 18, 2025 am 12:22 AM

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Python for Web Development: Key ApplicationsApr 18, 2025 am 12:20 AM

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python vs. C : Exploring Performance and EfficiencyApr 18, 2025 am 12:20 AM

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.

See all articles