Home >Backend Development >Python Tutorial >goodbye! Python loops, vectorization is amazing
We have learned about loops in almost all programming languages. So, by default, whenever there is a repetitive operation, we start implementing loops. But when we're dealing with a lot of iterations (millions/billions of rows), using loops is a real pain, and you might get stuck for hours, only to realize later that it doesn't work. This is where implementing vectorization in Python becomes super critical.
Vectorization is a technique for implementing (NumPy) array operations on data sets. Behind the scenes, it operates on all elements of the array or series at once (unlike a 'for' loop, which operates one row at a time).
In this blog, we will look at some use cases where we can easily replace Python loops with vectorization. This will help you save time and become more proficient at coding.
First, let’s look at a basic example of finding the sum of numbers in Python using loops and vectors.
import time start = time.time() # 遍历之和 total = 0 # 遍历150万个数字 for item in range(0, 1500000): total = total + item print('sum is:' + str(total)) end = time.time() print(end - start) #1124999250000 #0.14 Seconds
import numpy as np start = time.time() # 向量化和--使用numpy进行向量化 # np.range创建从0到1499999的数字序列 print(np.sum(np.arange(1500000))) end = time.time() print(end - start) ##1124999250000 ##0.008 Seconds
Execution of vectorization compared to iteration using range functions The time is about 18 times. This difference becomes even more apparent when working with Pandas DataFrame.
In data science, when using Pandas DataFrame, developers use loops to create new derived columns for mathematical operations.
In the example below, we can see that in such use cases, loops can easily be replaced by vectorization.
DataFrame is tabular data in the form of rows and columns.
We are creating a pandas DataFrame with 5 million rows and 4 columns filled with random values between 0 and 50.
import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0, 50, size=(5000000, 4)), columns=('a','b','c','d')) df.shape # (5000000, 5) df.head()
We will create a new column 'ratio' to find the ratio of columns 'd' and 'c'.
import time start = time.time() # Iterating through DataFrame using iterrows for idx, row in df.iterrows(): # creating a new column df.at[idx,'ratio'] = 100 * (row["d"] / row["c"]) end = time.time() print(end - start) ### 109 Seconds
start = time.time() df["ratio"] = 100 * (df["d"] / df["c"]) end = time.time() print(end - start) ### 0.12 seconds
We can see that there is a significant improvement in DataFrame, with python Compared to the loop in , vectorization is almost 1000 times faster.
We have implemented many operations that require us to use "if-else" type logic. We can easily replace this logic with vectorized operations in python.
Have a look at the example below to understand it better (we will use the DataFrame created in use case 2).
Imagine how to create a new column 'e' based on some conditions of the exited column 'a'.
import time start = time.time() # Iterating through DataFrame using iterrows for idx, row in df.iterrows(): if row.a == 0: df.at[idx,'e'] = row.d elif (row.a <= 25) & (row.a > 0): df.at[idx,'e'] = (row.b)-(row.c) else: df.at[idx,'e'] = row.b + row.c end = time.time() print(end - start) ### Time taken: 177 seconds
start = time.time() df['e'] = df['b'] + df['c'] df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c'] df.loc[df['a']==0, 'e'] = df['d']end = time.time() print(end - start) ## 0.28007707595825195 sec
Compared to python loops with if-else statements, Vectorized operations are 600 times faster than loops.
Deep learning requires us to solve multiple complex equations, and for millions and billions of rows of equations. Running loops in Python to solve these equations is very slow, at which point vectorization is the best solution.
For example, you want to calculate the y values for millions of rows in the following multiple linear regression equation.
We can use vectorization instead of looping.
The values of m1,m2,m3... are determined by solving the above equation using millions of values corresponding to x1,x2,x3... (for simplicity, only look at one Simple multiplication steps)
>>> import numpy as np >>> # 设置 m 的初始值 >>> m = np.random.rand(1,5) array([[0.49976103, 0.33991827, 0.60596021, 0.78518515, 0.5540753]]) >>> # 500万行的输入值 >>> x = np.random.rand(5000000,5)##Use a loop
import numpy as np m = np.random.rand(1,5) x = np.random.rand(5000000,5) total = 0 tic = time.process_time() for i in range(0,5000000): total = 0 for j in range(0,5): total = total + x[i][j]*m[0][j] zer[i] = total toc = time.process_time() print ("Computation time = " + str((toc - tic)) + "seconds") ####Computation time = 28.228 secondsMatrix multiplication of vectors is implemented in the backend using vectorization
tic = time.process_time() #dot product np.dot(x,m.T) toc = time.process_time() print ("Computation time = " + str((toc - tic)) + "seconds") ####Computation time = 0.107 secondsnp.dot. It's 165 times faster compared to loops in python. Written at the endVectorization in Python is very fast. When dealing with very large data sets, it is recommended that you should give priority to vectorization instead of loops. In this way, over time, you will gradually become accustomed to writing code according to vectorization ideas.
The above is the detailed content of goodbye! Python loops, vectorization is amazing. For more information, please follow other related articles on the PHP Chinese website!