Relation between Data Mining and Machine Learning


This article belongs to forwarding, on the basis of this article, I will add some of my own ideas in the future.

I recently looked at schools abroad, separating machine learning from data mining. Data mining mainly deals with databases, learning what data warehouses, and using Oracle software. Machine learning seems to be closer to statistics.

Statistics department and computer department do very different work in data mining. My feeling is that statistics department integrates a subset of statistical methods into data mining (you can see that most of the data mining books only talk about dozens of methods), and computer department does database mining (software, database algorithm).

As for machine learning, I quote professor Ripley’s joke:

To paraphrase provocatively, ‘machine learning is statistics minus any
checking of models and assumptions’. — Brian D. Ripley (about the
difference between machine learning and statistics)
useR! 2004, Vienna (May 2004)

Generally speaking, the classical statistical boost formula proves the properties of various models. Machine learning, regardless of these, aims to predict better algorithms. These models (machines) have a characteristic that they can learn by themselves and improve prediction performance. Literally, they should be interpreted in this way, but in fact not all machine learning algorithms have the characteristics of “learning”. 。 So I think you are just putting on a jacket and packing statistics.

After reading an article, I think that machine learning refers to those specific algorithms, while data mining also includes the establishment and maintenance of databases, data cleaning, data visualization and the use of results, which should be integrated into database, human-computer interaction, statistical analysis and machine learning technology.

The relationship between machine learning and data mining and statistics is similar on the surface, but there are also very big differences.

The similarity is that they are all tools for data analysis. There are ways to analyze the same data in three fields, and the basic principles are very simple.

The difference is that:

Statistical requirements for models are more stringent. As Xie cited, we must consider the various properties of models, such as large samples, small samples, unbiased, how large variance, whether C-R bound is achieved, whether consistent, and finally, model checking is best. Machine learning seldom cares about how the models work with large samples, nor does it care about the traditional properties of estimator. It may be because their models are too complex to be mathematically proven. This also reflects the use of normal distribution in statistics so much that it is easy to study with the nature of many models. It may also be because they are usually used in large amounts of data – but machine learning is concerned about something else. – error, including empirical error and structural error. For a simple example, we see that the two network models of neural network and support vector machine are very popular, easy to understand and useful, but many people do not know where they come from, and why they can be widely used for a wide range of data. Why is the accuracy very high? The reason behind this is very simple. They optimize these two kinds of errors separately. Machine learning focuses on studying these two kinds of errors. Through the research of these two kinds of errors, it turns out to be a subject with strong mathematical flavor, which uses a lot of analysis, which is also the essential difference between machine learning and data mining. Data mining only needs to design a fishnet (algorithm), and it can reach the pattern it needs in a large amount of data nets, which is quite similar in many cases. It takes luck. So many people say it’s a fisherman’s job.

There is not much difference between statistical learning and machine learning. There are some differences between statistical modeling and machine learning. Brieman, 2001 I wrote an article called statistical modeling: two cultures, which introduced the differences between them. Statistical modeling is based on the probability distribution of data. So inference is very important in statistical models. These inferences, such as hypothesis test and confidence interval, are based on some distribution hypothesis. The most recent problem of machine learning is to minimize some measure of prediction error. The two approaches differ in their perceptions of the world. Statistical modeling, the ultimate goal is to obtain the probability distribution of data, if the distribution of data generated is known, then the world is in good shape. Statistical modeling considers that the world can be approximated by probability distribution. Machine learning doesn’t think so. It doesn’t care what the data comes from, and thinks that the way the world operates can’t be explained purely by probability distribution, such as neural network. Therefore, its purpose is to predict the accuracy. This is the two way of modeling, and in the final analysis, is the way to recognize the world.

Statistical learning tends to be model-based, usually based on a known model. Machine learning tends to mine information from the data itself by some algorithms (decision tree, clustering, support vector machine, neural network, etc.).

According to Encyclopedia of Machine Learning, statistical learning is a subclass of machine learning:

Inductive Learning

Synonyms Statistical learning

Definition Inductive learning is a subclass of machine learning that
studies algorithms for learning knowledge based on statistical
regularities. The learned knowledge typically has no deductive
guarantees of correctness, though there may be statistical forms of

Of course, this classification is meaningless and means are infiltrating each other. Statisticians like to call statistical learning. Computer science people like to call machine learning, even if they do almost the same thing.