Home > Article > Technology peripherals > Rewritten title: Exploring the application areas of semi-supervised learning and its related scenarios
With the development of the Internet, enterprises can obtain more and more data. This data helps companies better understand users, known as customer profiles, and can improve user experience. However, there may be a large amount of unlabeled data in these data. If all data is manually labeled, there will be two problems. First, manual labeling is time-consuming and inefficient. As the amount of data increases, more people will need to be hired and it will take longer, and the cost will be higher. Secondly, as the size of users increases, it is difficult to keep up with the growth of data through manual labeling
Semi-supervised learning refers to training a model using both labeled and unlabeled data. Semi-supervised learning usually constructs an attribute space based on labeled data, and then extracts effective information from unlabeled data to fill (or reconstruct) the attribute space. Therefore, the initial training set of semi-supervised learning is usually divided into labeled data set D1 and unlabeled data set D2, and then the semi-supervised learning model is trained through basic steps such as preprocessing and feature extraction, and then the trained model is used for Production environment to provide services to users.
In order to achieve effective label data supplementation with labeled data "useful" information in the data, making some assumptions about data segmentation and other aspects. The basic assumption of semi-supervised learning is that p(x) contains the information of p(y|x), that is, the unlabeled data should contain information that is useful for label prediction and is different from the labeled data or is difficult to obtain from the labeled data. information extracted from the data. In addition, there are some assumptions that serve the algorithm. For example, the similarity hypothesis (smoothness hypothesis) means that in the attribute space constructed by data samples, close or similar samples have the same label; the low-density separation hypothesis means that there is a decision boundary that can distinguish different labels where there are few data samples. The data.
The main purpose of the above assumption is to show that labeled data and unlabeled data come from the same data distribution.
There are many semi-supervised learning algorithms, which can be roughly divided into Transductive learning and Inductive learning (Inductive model) , the difference between the two lies in the selection of the test data set used for model evaluation. Direct push semi-supervised learning means that the data set that needs to predict the label is the unlabeled data set used for training. The purpose of learning is to further improve the accuracy of the prediction results. Inductive learning predicts labels for completely unknown data sets.
In addition, the steps of common semi-supervised learning algorithms are: the first step is to train the model on labeled data, and then use this model Pseudo-label the unlabeled data, then combine the pseudo-labels and labeled data into a new training set, train a new model on this training set, and finally use this model to label the prediction data set.
The biggest problem with semi-supervised learning is that in many cases, the performance of the model depends on labeled data set, and the quality requirements for labeled data sets are relatively high. Even the prediction accuracy of semi-supervised learning models is not much different from the results of supervised models based on labeled data sets. On the contrary, semi-supervised models are in order to effectively extract the features in unlabeled data. Effective information will consume more resources. Therefore, the development direction of semi-supervised learning is to improve the robustness of the algorithm and the effectiveness of data extraction.
Currently in the field of semi-supervised learning, PU-Learning (positive and negative sample learning) is a popular algorithm. This type of algorithm is mainly applied to data sets with only positive samples and unlabeled data. Its advantage is that in certain scenarios, we can obtain reliable positive sample data sets relatively easily, and the amount of data is relatively large. For example, in spam detection, we can easily obtain a large amount of normal email data
The above is the detailed content of Rewritten title: Exploring the application areas of semi-supervised learning and its related scenarios. For more information, please follow other related articles on the PHP Chinese website!