Classification is an important data mining technology. The purpose of classification is to construct a classification function or classification model (also often called a classifier) based on the characteristics of the data set, which can map samples of unknown categories to one of the given categories. Both classification and regression can be used for prediction. The difference from regression methods is that the output of classification is discrete category values, while the output of regression is continuous or ordered values.
#The process of constructing a model is generally divided into two stages: training and testing. Before constructing the model, the data set is required to be randomly divided into training data set and test data set. In the training phase, using the training data set, the model is constructed by analyzing database tuples described by attributes, assuming that each tuple belongs to a predefined class, determined by an attribute called the class label attribute. A single tuple in the training data set is also called a training sample. The form of a specific sample can be: (u1, u2,...un;c); where ui represents the attribute value and c represents the category. Since the class label of each training sample is provided, this stage is also called guided learning. Typically, the model is provided in the form of classification rules, decision trees, or mathematical formulas. In the testing phase, the test data set is used to evaluate the classification accuracy of the model. If the accuracy of the model is deemed acceptable, the model can be used to classify other data tuples. Generally speaking, the cost of the testing phase is much lower than that of the training phase. (Recommended learning: Python video tutorial)
In order to improve the accuracy, effectiveness and scalability of classification, data is usually preprocessed before classification, including:
(1) Data cleaning. Its purpose is to eliminate or reduce data noise and deal with missing values.
(2) Correlation analysis. Since many attributes in the dataset may not be relevant to the classification task, including these attributes will slow down and potentially mislead the learning process. The purpose of correlation analysis is to remove these irrelevant or redundant attributes.
(3) Data transformation. Data can be generalized to higher-level concepts. For example, the value of the continuous-valued attribute "income" can be generalized to discrete values: low, medium, and high. For another example, the nominal value attribute "city" can be generalized to the high-level concept "province". In addition, the data can also be normalized, which scales the value of a given attribute into a smaller interval, such as [0,1], etc.
Types and characteristics of classification algorithms
The construction methods of classification models include decision trees, statistical methods, machine learning methods, neural network methods, etc. According to the general direction, they mainly include: decision tree, association rules, Bayesian, neural network, rule learning, k-nearby method, genetic algorithm, rough set and fuzzy logic technology.
Decision tree classification algorithm
The decision tree is an inductive learning algorithm based on examples. It infers classification rules in the form of a decision tree representation from a set of unordered, rule-less tuples. It uses a top-down recursive method to compare attribute values at the internal nodes of the decision tree, and branches downward from the node according to different attribute values. The leaf nodes are the classes to be learned to divide. A path from the root to the leaf node corresponds to a conjunctive rule, and the entire decision tree corresponds to a set of disjunctive expression rules. In 1986, Quinlan proposed the famous ID3 algorithm. Based on the ID3 algorithm, Quinlan proposed the C4.5 algorithm in 1993. In order to meet the needs of processing large-scale data sets, several improved algorithms were later proposed, among which SLIQ (super-vised learning in quest) and SPRINT (scalable parallelizable induction of decision trees) are two of the more representative algorithms.
Bayesian classification algorithm
Bayesian classification algorithm is a classification method in statistics. It is a type of algorithm that uses probability and statistical knowledge for classification. On many occasions, the Naïve Bayes (NB) classification algorithm can be comparable to the decision tree and neural network classification algorithms. This algorithm can be applied to large databases, and the method is simple, the classification accuracy is high, and the speed is fast.
Because Bayes' theorem assumes that the impact of an attribute value on a given class is independent of the values of other attributes, and this assumption is often not true in actual situations, its classification accuracy may decrease. For this reason, many Bayesian classification algorithms that reduce the independence assumption have been derived, such as the TAN (tree augmented Bayes network) algorithm.
For more Python related technical articles, please visit the Python Tutorial column to learn!
The above is the detailed content of Classification algorithm. For more information, please follow other related articles on the PHP Chinese website!