Home >Backend Development >Python Tutorial >How to convert Scikit-learn's IRIS dataset into a dataset with only two features in Python?
Iris, a multivariate flower dataset, is one of the most useful Python scikit-learn datasets. It is divided into 3 categories of 50 instances each and contains measurements of the sepal and petal parts of three species of iris (Iris mountaina, Iris virginia and Iris variegated). Apart from this, the Iris dataset contains 50 instances of each of the three species and consists of four features, namely sepal_length (cm), sepal_width (cm), petal_length (cm), petal_width (cm).
We can use Principal Component Analysis (PCA) to transform the IRIS dataset into a new feature space with 2 features.
We can convert the IRIS dataset into a 2-feature dataset using PCA in Python by following the steps given below -
Step 1 - First, import the necessary packages from scikit-learn. We need to import the dataset and decomposition package.
Steps 2 - Load the IRIS dataset.
Steps 3 - Print detailed information about the dataset.
Steps 4 - Initialize Principal Component Analysis (PCA) and apply the fit() function to fit the data. p>
Step 5 - Convert the dataset into a new dimension, a 2-feature dataset.
In the example below, we will transform the scikit-learn IRIS plant dataset into 2 features via PCA using the above steps.
# Importing the necessary packages from sklearn import datasets from sklearn import decomposition # Load iris plant dataset iris = datasets.load_iris() # Print details about the dataset print('Features names : '+str(iris.feature_names)) print('\n') print('Features size : '+str(iris.data.shape)) print('\n') print('Target names : '+str(iris.target_names)) print('\n') X_iris, Y_iris = iris.data, iris.target # Initialize PCA and fit the data pca_2 = decomposition.PCA(n_components=2) pca_2.fit(X_iris) # Transforming iris data to new dimensions(with 2 features) X_iris_pca2 = pca_2.transform(X_iris) # Printing new dataset print('New Dataset size after transformations: ', X_iris_pca2.shape)
It will produce the following output -
Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Features size : (150, 4) Target names : ['setosa' 'versicolor' 'virginica'] New Dataset size after transformations: (150, 2)
We can transform the Iris dataset into a new feature space with 3 features using a statistical method called Principal Component Analysis (PCA). PCA essentially linearly projects the data into a new feature space by analyzing the features of the original data set.
The main concept behind PCA is to select the "main" features of the data and build features based on them. It will give us a new dataset that is smaller in size but has the same information as the original dataset.
In the example below, we will use PCA to transform the scikit-learn Iris plant dataset (initialized with 3 components).
# Importing the necessary packages from sklearn import datasets from sklearn import decomposition # Load iris plant dataset iris = datasets.load_iris() # Print details about the dataset print('Features names : '+str(iris.feature_names)) print('\n') print('Features size : '+str(iris.data.shape)) print('\n') print('Target names : '+str(iris.target_names)) print('\n') print('Target size : '+str(iris.target.shape)) X_iris, Y_iris = iris.data, iris.target # Initialize PCA and fit the data pca_3 = decomposition.PCA(n_components=3) pca_3.fit(X_iris) # Transforming iris data to new dimensions(with 2 features) X_iris_pca3 = pca_3.transform(X_iris) # Printing new dataset print('New Dataset size after transformations : ', X_iris_pca3.shape) print('\n') # Getting the direction of maximum variance in data print("Components : ", pca_3.components_) print('\n') # Getting the amount of variance explained by each component print("Explained Variance:",pca_3.explained_variance_) print('\n') # Getting the percentage of variance explained by each component print("Explained Variance Ratio:",pca_3.explained_variance_ratio_) print('\n') # Getting the singular values for each component print("Singular Values :",pca_3.singular_values_) print('\n') # Getting estimated noise covariance print("Noise Variance :",pca_3.noise_variance_)
It will produce the following output -
Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Features size : (150, 4) Target names : ['setosa' 'versicolor' 'virginica'] Target size : (150,) New Dataset size after transformations : (150, 3) Components : [[ 0.36138659 -0.08452251 0.85667061 0.3582892 ] [ 0.65658877 0.73016143 -0.17337266 -0.07548102] [-0.58202985 0.59791083 0.07623608 0.54583143]] Explained Variance: [4.22824171 0.24267075 0.0782095 ] Explained Variance Ratio: [0.92461872 0.05306648 0.01710261] Singular Values : [25.09996044 6.01314738 3.41368064] Noise Variance : 0.02383509297344944
The above is the detailed content of How to convert Scikit-learn's IRIS dataset into a dataset with only two features in Python?. For more information, please follow other related articles on the PHP Chinese website!