We have learned about loops in almost all programming languages. So, by default, whenever there is a repetitive operation, we start implementing loops. But when we're dealing with a lot of iterations (millions/billions of rows), using loops is a real pain, and you might get stuck for hours, only to realize later that it doesn't work. This is where implementing vectorization in Python becomes super critical.
What is vectorization?
Vectorization is a technique for implementing (NumPy) array operations on data sets. Behind the scenes, it operates on all elements of the array or series at once (unlike a 'for' loop, which operates one row at a time).
In this blog, we will look at some use cases where we can easily replace Python loops with vectorization. This will help you save time and become more proficient at coding.
Use Case 1: Finding the Sum of Numbers
First, let’s look at a basic example of finding the sum of numbers in Python using loops and vectors.
Using loops
import time start = time.time() # 遍历之和 total = 0 # 遍历150万个数字 for item in range(0, 1500000): total = total + item print('sum is:' + str(total)) end = time.time() print(end - start) #1124999250000 #0.14 Seconds
Using vectorization
import numpy as np start = time.time() # 向量化和--使用numpy进行向量化 # np.range创建从0到1499999的数字序列 print(np.sum(np.arange(1500000))) end = time.time() print(end - start) ##1124999250000 ##0.008 Seconds
Execution of vectorization compared to iteration using range functions The time is about 18 times. This difference becomes even more apparent when working with Pandas DataFrame.
Use Case 2: DataFrame Mathematical Operations
In data science, when using Pandas DataFrame, developers use loops to create new derived columns for mathematical operations.
In the example below, we can see that in such use cases, loops can easily be replaced by vectorization.
Create DataFrame
DataFrame is tabular data in the form of rows and columns.
We are creating a pandas DataFrame with 5 million rows and 4 columns filled with random values between 0 and 50.
import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0, 50, size=(5000000, 4)), columns=('a','b','c','d')) df.shape # (5000000, 5) df.head()
We will create a new column 'ratio' to find the ratio of columns 'd' and 'c'.
Using loops
import time start = time.time() # Iterating through DataFrame using iterrows for idx, row in df.iterrows(): # creating a new column df.at[idx,'ratio'] = 100 * (row["d"] / row["c"]) end = time.time() print(end - start) ### 109 Seconds
Using vectorization
start = time.time() df["ratio"] = 100 * (df["d"] / df["c"]) end = time.time() print(end - start) ### 0.12 seconds
We can see that there is a significant improvement in DataFrame, with python Compared to the loop in , vectorization is almost 1000 times faster.
Use case 3: If-else statement on DataFrame
We have implemented many operations that require us to use "if-else" type logic. We can easily replace this logic with vectorized operations in python.
Have a look at the example below to understand it better (we will use the DataFrame created in use case 2).
Imagine how to create a new column 'e' based on some conditions of the exited column 'a'.
Using loops
import time start = time.time() # Iterating through DataFrame using iterrows for idx, row in df.iterrows(): if row.a == 0: df.at[idx,'e'] = row.d elif (row.a <= 25) & (row.a > 0): df.at[idx,'e'] = (row.b)-(row.c) else: df.at[idx,'e'] = row.b + row.c end = time.time() print(end - start) ### Time taken: 177 seconds
Using vectorization
start = time.time() df['e'] = df['b'] + df['c'] df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c'] df.loc[df['a']==0, 'e'] = df['d']end = time.time() print(end - start) ## 0.28007707595825195 sec
Compared to python loops with if-else statements, Vectorized operations are 600 times faster than loops.
Use Case 4: Solving Machine Learning/Deep Learning Networks
Deep learning requires us to solve multiple complex equations, and for millions and billions of rows of equations. Running loops in Python to solve these equations is very slow, at which point vectorization is the best solution.
For example, you want to calculate the y values for millions of rows in the following multiple linear regression equation.
We can use vectorization instead of looping.
The values of m1,m2,m3... are determined by solving the above equation using millions of values corresponding to x1,x2,x3... (for simplicity, only look at one Simple multiplication steps)
Create data
>>> import numpy as np >>> # 设置 m 的初始值 >>> m = np.random.rand(1,5) array([[0.49976103, 0.33991827, 0.60596021, 0.78518515, 0.5540753]]) >>> # 500万行的输入值 >>> x = np.random.rand(5000000,5)
import numpy as np m = np.random.rand(1,5) x = np.random.rand(5000000,5) total = 0 tic = time.process_time() for i in range(0,5000000): total = 0 for j in range(0,5): total = total + x[i][j]*m[0][j] zer[i] = total toc = time.process_time() print ("Computation time = " + str((toc - tic)) + "seconds") ####Computation time = 28.228 secondsMatrix multiplication of vectors is implemented in the backend using vectorization
tic = time.process_time() #dot product np.dot(x,m.T) toc = time.process_time() print ("Computation time = " + str((toc - tic)) + "seconds") ####Computation time = 0.107 secondsnp.dot. It's 165 times faster compared to loops in python. Written at the endVectorization in Python is very fast. When dealing with very large data sets, it is recommended that you should give priority to vectorization instead of loops. In this way, over time, you will gradually become accustomed to writing code according to vectorization ideas.
The above is the detailed content of goodbye! Python loops, vectorization is amazing. For more information, please follow other related articles on the PHP Chinese website!

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 English version
Recommended: Win version, supports code prompts!

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)
