Home >Technology peripherals >It Industry >A Primer on Machine Learning with Python
Over the past decade, machine learning has moved from scientific research labs to everyday web and mobile applications. Machine learning enables your application to perform previously difficult tasks, such as detecting objects and faces in images, detecting spam and hate speech, and generating smart replies for email and message applications.
However, performing machine learning is fundamentally different from classical programming. In this article, you will learn the basics of machine learning and create a basic model that can predict flower species based on flower measurements.
Classic programming relies on well-defined problems that can be broken down into different classes, functions, and if-else commands. Machine learning, on the other hand, relies on developing its behavior based on experience. Instead of providing rules to machine learning models, you train them through examples.
There are different categories of machine learning algorithms, each of which can solve specific problems.
Supervised learning is suitable for questions you want to get from input data to the result. A common feature of all supervised learning problems is the existence of a real situation that can be used to test the model, such as marked images or historical sales data.
Supervised learning models can solve regression or classification problems. The regression model predicts quantity (e.g. the quantity of goods sold or the price of stock), while the classification problem attempts to determine the categories of input data (e.g. cat/dog/fish/bird, fraud/non-fraud).
Image classification, face detection, stock price prediction and sales prediction are examples of problems that supervised learning can solve.
Some popular supervised learning algorithms include linear regression and logistic regression, support vector machines, decision trees and artificial neural networks.
Unsupervised learning is suitable for problems where you have data but not results, but looking for patterns. For example, you might want to group them into segments based on your similarity. This is called clustering in unsupervised learning. Alternatively, you may want to detect malicious network traffic that deviates from the normal activities of your business. This is called anomaly detection, which is another unsupervised learning task. Unsupervised learning can also be used for dimensionality reduction, a technique to simplify machine learning tasks by removing irrelevant features.
Some popular unsupervised learning algorithms include K-mean clustering and principal component analysis (PCA).
Reinforcement learning is a branch of machine learning where agents try to achieve their goals by interacting with their environment. Reinforcement learning involves actions, status and rewards. Untrained reinforcement learning agents start with random action. Each action changes the state of the environment. If the agent finds himself in the desired state, he will receive a reward. The agent tries to find the sequence of actions and states that generate the most rewards.
Reinforcement learning is used in recommendation systems, robotics, and gaming robots, such as Google's AlphaGo and AlphaStar.
In this article, we will focus on supervised learning, as it is the most popular branch of machine learning and its results are easier to evaluate. We will use Python because it has many features and libraries that support machine learning applications. However, the general concept can be applied to any programming language with similar libraries.
(If you are not familiar with Python, freeCodeCamp provides a great crash course to get you started.)
One of the Python libraries commonly used in data science and machine learning is Scikit-learn, which provides implementations of popular machine learning algorithms. Scikit-learn is not part of a basic Python installation, you have to install it manually.
MacOS and Linux are pre-installed with Python. To install the Scikit-learn library, type the following command in the terminal window:
<code>pip install scikit-learn</code>
Or for Python 3:
<code>python3 -m pip install scikit-learn</code>
On Microsoft Windows, you must first install Python. You can get the latest version of Windows Python 3 installer from the official website. After Python is installed, type the following command in the command line window:
<code>python -m pip install scikit-learn</code>
Alternatively, you can install the Anaconda framework, which includes standalone Python 3 as well as Scikit-learn and many other libraries for data science and machine learning, such as Numpy, Scipy > and Matplotlib. You can find the installation instructions for the free personal version of Anaconda on its official website.
The first step in every machine learning project is to understand the problem you want to solve. Defining a question will help you determine the type of data you need to collect and give you an idea of which machine learning algorithm you need to use.
In our example, we want to create a model that predicts the type of flower based on measurements of petals and sepal length and width.
This is a supervision classification issue. We need to collect a list of measurements of different flower specimens and their corresponding species. We will then use this data to train and test a machine learning model that can map measurements to species.
One of the trickiest parts of machine learning is collecting data to train your model. You must find a source that can collect the amount of data needed to train the model. You also need to verify the quality of your data, make sure it represents the different situations the model will handle, and avoid collecting data that contains hidden biases.
Luckily, Scikit-learn contains several toy datasets that can be used to try different machine learning algorithms. The "Iris Dataset" happens to contain the exact data required for our question. We just need to load it from the library.
The following code loads the housing dataset:
<code>pip install scikit-learn</code>
The Iris data set contains 150 observations, each with four measurements (iris.data) and target flower species (iris.target). You can see the name of the data column in iris.feature_names:
<code>python3 -m pip install scikit-learn</code>
iris.target contains a numerical index (0-2) of one of the three flower species registered in the dataset. The names of the flower species can be found in iris.target_names:
<code>python -m pip install scikit-learn</code>
Before starting training, you must split the data into a training set and a test set. You will use the training set to train a machine learning model and use the test set to verify its accuracy.
This is done to ensure that your model does not overfit the training data. Overfitting is when your machine learning model performs well on training examples but not on unseen data. Overfitting may be caused by choosing a wrong machine learning algorithm, misconfiguring the model, poor training data, or too few training examples.
Depending on the type of problem you are solving and the amount of data you have, you must determine the amount of data you want to assign to the test set. Usually, when you have a lot of data (about tens of thousands of examples), even just about 1% of the small samples is enough to test your model. For the iris dataset containing a total of 150 records, we will select the 75-25 segmentation.
Scikit-learn has a train_test_split function that splits the dataset into a training dataset and a test dataset:
<code>from sklearn.datasets import load_iris iris = load_iris() </code>
train_test_split Gets the data and target datasets and returns two pairs of datasets used for training (X_train and y_train) and test (X_test and y_test). The test_size parameter determines the percentage of data to be assigned to the test (between 0 and 1). The stratify parameter ensures that the training array and the test array contain the number of balanced samples from each category. The random_state variable exists in many functions of Scikit-learn and is used to control the random number generator and achieve repeatability.
Now that our data is ready, we can create a machine learning model and train it on the training set. There are many different machine learning algorithms that can solve the classification problem we are dealing with. In our case, we will use the "logistic regression" algorithm, which is very fast and is suitable for simple classification problems that do not contain too many dimensions.
Scikit-learn's LogisticRegression class implements this algorithm. After instantiating it, we will train it on our training set (X_train and y_train) by calling the fit function. This will adjust the parameters of the model to find the mapping between the measured values and the flower species.
<code>pip install scikit-learn</code>
Now that we have trained the model, we want to measure its accuracy. The LogisticRegression class has a score method that returns the accuracy of the model. First, we will measure the accuracy of the model on the training data:
<code>python3 -m pip install scikit-learn</code>
This will return approximately 0.97, which means the model accurately predicts 97% of the training examples, which is pretty good considering that we only have about 37 training examples per species.
Next, we will check the accuracy of the model on the test set:
<code>python -m pip install scikit-learn</code>
This will give us about 95% of the results, slightly below training accuracy, which is natural because these are examples that the model has never seen before. By creating larger data sets or trying another machine learning algorithm (such as support vector machines), we may be able to further improve the accuracy of our models and bridge the gap between training and testing performance.
Finally, we want to see how to use the model we trained on the new example. The LogisticRegression class has a predict function that takes an array of observations as input and returns the predicted category. In the case of our flower classifier model, we need to provide it with an array of four measurements (sepal length, sepal width, petal length, petal width) which will return an integer representing the category of the flower:
<code>from sklearn.datasets import load_iris iris = load_iris() </code>
Congratulations! You created your first machine learning model. We can now combine it into an app that takes measurements from the user and returns the flower species:
<code>print(iris.feature_names) ''' ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] ''' </code>
Hope this is your first step to becoming a master of machine learning. From here, you can continue to learn other machine learning algorithms, learn more about the basic concepts of machine learning, and continue to learn more advanced topics such as neural networks and deep learning. With some learning and practice, you will be able to create extraordinary applications that can detect objects in images, process voice commands, and engage in conversations with users.
To start learning to use Python for machine learning, you need a basic understanding of Python programming. It is also beneficial to be familiar with libraries like NumPy, Pandas, and Matplotlib. Furthermore, a basic understanding of statistics and probability is crucial because they form the core of machine learning algorithms.
Python is one of the most popular machine learning languages due to its simplicity and readability. It has a wide range of libraries and frameworks such as Scikit-learn, TensorFlow, and PyTorch that simplify the development of machine learning models. Other languages like R and Java are also used in machine learning, but Python’s extensive ecosystem makes it the first choice for many.
Python's Scikit-learn library provides implementations of various machine learning algorithms. Some commonly used algorithms include linear regression, logistic regression, decision trees, random forests, support vector machines, and k-nearest neighbors. For deep learning, you can use libraries like TensorFlow and PyTorch.
You can use techniques such as cross-validation and training test splitting to verify the performance of your model. Python's Scikit-learn library provides functions for this. Additionally, you can use metrics such as accuracy, accuracy, recall, and F1 score to classify problems and use mean square error or R squared for regression problems.
Yes, Python supports supervised learning and unsupervised learning. Library such as Scikit-learn can be used to implement supervised learning algorithms such as regression and classification. For unsupervised learning, you can use clustering algorithms such as K-means, hierarchical clustering, and DBSCAN.
Techniques such as regularization, early stopping and neural network dropout can be used to handle overfitting. You can also use integrated methods such as bagging and boosting to reduce overfitting.
Data preprocessing is a key step in machine learning. It includes cleaning up data, processing missing values, encoding categorical variables, and scaling features. Python provides libraries such as Pandas and Scikit-learn, which can perform efficient data preprocessing.
You can use libraries such as Matplotlib and Seaborn to visualize the performance of your model. These libraries provide functions to plot graphs such as confusion matrix, ROC curve, and learning curve.
Yes, Python provides libraries such as NLTK and SpaCy for natural language processing. These libraries provide functions such as tokenization, part-of-speech annotation, named entity recognition, and sentiment analysis.
You can use web frameworks such as Flask or Django to deploy machine learning models. For large-scale deployments, you can use cloud platforms such as AWS, Google Cloud, or Azure. They provide services for model deployment, scaling and monitoring.
The above is the detailed content of A Primer on Machine Learning with Python. For more information, please follow other related articles on the PHP Chinese website!