Data mining technologies include: 1. Statistical technology; 2. Association rules; 3. History-based analysis; 4. Genetic algorithm; 5. Aggregation detection; 6. Connection analysis; 7. Decision tree; 8. Neural network; 9. Rough set; 10. Fuzzy set; 11. Regression analysis; 12. Difference analysis; 13. Concept description, etc.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
Data mining is to extract information and knowledge implicit in it that people do not know in advance but are potentially useful from a large amount of incomplete, noisy, fuzzy, and random data. the process of.
The task of data mining is to discover patterns from data sets. There are many types of patterns that can be discovered. They can be divided into two categories according to their functions: predictive (Predictive) patterns and descriptive (Descriptive) patterns.
There are many kinds of data mining technologies, and there are different classification methods according to different classifications. The following focuses on some techniques commonly used in data mining: statistical techniques, association rules, history-based analysis, genetic algorithms, aggregation detection, connection analysis, decision trees, neural networks, rough sets, fuzzy sets, regression analysis, differential analysis, Concept description and other thirteen commonly used data mining techniques.
1. Statistical technology
Data mining involves many scientific fields and technologies, such as statistical technology. The main idea of using statistical technology to mine data sets is that statistical methods assume a distribution or probability model (such as a normal distribution) for a given data set and then use corresponding methods to mine according to the model.
2. Association rules
Data association is an important type of discoverable knowledge that exists in the database. If there is some regularity in the values of two or more variables, it is called correlation. Associations can be divided into simple associations, temporal associations, and causal associations. The purpose of correlation analysis is to find the hidden correlation network in the database. Sometimes the correlation function of the data in the database is not known, and even if it is known, it is uncertain, so the rules generated by correlation analysis have credibility.
3. Historical MBR (Memory-based Reasoning) analysis
First look for similar situations based on empirical knowledge, and then apply the information from these situations to the current situation Example. This is the essence of MBR (Memory Based Reasoning). MBR first looks for neighbors that are similar to the new record, and then uses these neighbors to classify and value the new data. There are three main issues in using MBR, finding certain historical data; deciding the most efficient way to represent the historical data; and deciding the distance function, joint function and number of neighbors.
4. Genetic Algorithms GA (Genetic Algorithms)
is an optimization technology based on evolutionary theory and uses design methods such as genetic combination, genetic variation, and natural selection. The main idea is: according to the principle of survival of the fittest, form a new group composed of the most suitable rules in the current group, and the descendants of these rules. Typically, the fitness of a rule is evaluated by its classification accuracy on the training sample set.
5. Aggregation Detection
The process of grouping a collection of physical or abstract objects into multiple classes composed of similar objects is called clustering. A cluster generated by clustering is a collection of data objects that are similar to each other in the same cluster and different from objects in other clusters. The degree of dissimilarity is calculated based on the attribute value of the described object, and distance is a commonly used measurement method.
6. Link analysis
Link analysis, its basic theory is graph theory. The idea of graph theory is to find an algorithm that can get good results but not perfect results, rather than to find an algorithm that has a perfect solution. Connection analysis uses the idea that if imperfect results are feasible, then such analysis is a good analysis. Using connection analysis, some patterns can be analyzed from the behavior of some users; at the same time, the generated concepts can be applied to a wider user group.
7. Decision tree
Decision tree provides a way to display rules such as what value will be obtained under what conditions.
8. Neural network
In terms of structure, a neural network can be divided into input layer, output layer and hidden layer. Each node in the input layer corresponds to a predictor variable. The nodes of the output layer correspond to the target variables, and there can be multiple nodes. Between the input layer and the output layer is the hidden layer (invisible to neural network users). The number of hidden layers and the number of nodes in each layer determines the complexity of the neural network.
In addition to the nodes of the input layer, each node of the neural network is connected to many nodes in front of it (called the input nodes of this node). Each connection corresponds to a weight Wxy, the value of this node. It is obtained by taking the sum of the products of the values of all its input nodes and the corresponding connection weights as the input of a function. We call this function the activity function or the squeeze function.
9. Rough set
Rough set theory is based on the establishment of equivalence classes within given training data. All data samples forming an equivalence class are indiscriminate, that is, these samples are equivalent for the attributes that describe the data. Given real-world data, there are often classes that cannot be distinguished by the available attributes. Rough sets are used to approximate or roughly define this class.
10. Fuzzy set
Fuzzy set theory introduces fuzzy logic into the data mining classification system, allowing the definition of "fuzzy" domain values or boundaries. Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree to which a particular value is a given member, rather than exact cutoffs for classes or sets. Fuzzy logic provides the facility of processing at a high level of abstraction.
11. Regression analysis
Regression analysis is divided into linear regression, multiple regression and nonlinear regression. In linear regression, the data is modeled with a straight line, while multiple regression is an extension of linear regression involving multiple predictor variables. Nonlinear regression is to add polynomial terms to the basic linear model to form a nonlinear model.
12. Differential analysis
The purpose of differential analysis is to try to find anomalies in the data, such as noise data, fraud data and other abnormal data, so as to obtain useful information.
13. Concept description
Concept description is to describe the connotation of a certain type of object and summarize the relevant characteristics of this type of object. Concept description is divided into characteristic description and differential description. The former describes the common characteristics of a certain type of objects, while the latter describes the differences between objects of different types. Generating a characteristic description of a class only involves the common characteristics of all objects in that type of object.
For more related knowledge, please visit the FAQ column!
The above is the detailed content of What are the data mining techniques?. For more information, please follow other related articles on the PHP Chinese website!