Chapter 6 spark mllib machine learning (1)


Mllib is a machine learning library provided by spark. By calling the algorithm encapsulated by mllib, machine learning applications can be easily constructed. It provides a wealth of machine learning algorithms, such as classification, regression, clustering and recommendation algorithms. In addition, mllib standardizes the API for machine learning algorithms, making it easier to combine multiple algorithms into a single pipeline or workflow. Through this article, you can learn:

  • What is machine learning
  • Big data and machine learning
  • Machine learning classification
  • Introduction to spark mllib

Machine learning is a branch of artificial intelligence. It is an interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, computational complexity theory and so on. Machine learning theory is mainly to design and analyze some algorithms that let the computer “learn” automatically. Because learning algorithm involves a lot of statistical theory, machine learning and inferential statistics are closely related, also known as statistical learning theory. In terms of algorithm design, machine learning theory focuses on the realizable and effective learning algorithms.

Source: Mitchell, T. (1997). Machine learning. McGraw Hill

What is machine learning

Chapter 6 spark mllib machine learning (1)

Machine learning has been widely used in various branches of artificial intelligence, such as expert system, automatic reasoning, natural language understanding, pattern recognition, computer vision, intelligent robot and other fields. Machine learning is a branch of artificial intelligence. Its main research is to let the machine learn from the past experience, model the uncertainty of data and predict the future. Machine learning has many applications, such as search, recommendation system, spam filtering, face recognition, speech recognition and so on.

Big data and machine learning

In the era of big data, the speed of data generation is amazing. Internet, mobile Internet, Internet of things, GPS and so on will produce data all the time. The capacity of storage and calculation required for processing these data is also growing at a geometric level. As a result, a series of big data technologies represented by Hadoop have been born, which provide reliable guarantee for processing and storing these data.

Data, information and knowledge are three levels from large to small. Simple data is difficult to explain some problems. We need to add some experience to convert it into information. The so-called information, that is, in order to eliminate uncertainty, we often say that information asymmetry means that it is difficult to eliminate some uncertain factors when sufficient information is not available. Knowledge is the highest stage, so data mining is also called knowledge discovery.

The task of machine learning is to use some algorithms to act on big data, and then mine the potential knowledge behind it. The more data is trained, the more advantages machine learning can show. The problems that machine learning could not solve before can be solved by big data technology, and the performance will be greatly improved, such as voice recognition, image recognition, etc.

Machine learning classification

Machine learning is mainly divided into the following categories:

  • Supervised learning

    It is basically a synonym of classification. LearningsuperviseExamples from tags in the training dataset. For example, in the postcode recognition problem, a set of handwritten postcode images and their corresponding machine-readable transformations are used as training examples to supervise the learning of classification models. Common supervised learning algorithms include linear regression, logical regression, decision tree, naive Bayes, support vector machine and so on.

  • Unsupervised learning

    It is essentially a synonym of clustering. The learning process is unsupervised because the input instance has no class tag. The task of unsupervised learning is to mine potential structures from a given data set. For example, the photos of cats and dogs are given to the machine without any labels, but it is hoped that the machine can classify these photos. Finally, the machine will divide these photos into two categories, but it does not know which are the photos of cats and which are the photos of dogs. For machines, it is equivalent to dividing them into two categories: A and B. Common unsupervised learning algorithms include K-means clustering, principal component analysis (PCA), etc.

  • Semi supervised learning

    Semi supervised learning is a kind of machine learning technology. When learning models, it uses marked and unlabeled instances. Semi supervised learning is to let learners use unlabeled samples to improve learning performance without relying on external interaction.

    The practical demand of semi supervised learning is very strong, because in practical applications, it is easy to collect a large number of unlabeled samplessignBut it costs manpower and material resources. For example, in computer-aided medical image analysis, a large number of medical images can be obtained from the hospital, but it is unrealistic to expect medical experts to identify all the lesions in the imagesLess marked data, more unlabeled dataThis phenomenon is more obvious in Internet applications. For example, when making web page recommendation, users need to mark the pages they are interested in. However, few users are willing to spend a lot of time to provide tags. Therefore, there are few samples of tagged pages, but there are numerous web pages on the Internet that can be used as unlabeled samples.

  • Reinforcement learning

    Also known as reinforcement learning and evaluation learning, it is an important machine learning method, which has many applications in intelligent control robot and analysis and prediction. The common model of reinforcement learning is standard Markov decision process (MDP).

Introduction to spark mllib

Mllib is spark’s machine learning library, which can simplify the engineering practice of machine learning. Mllib contains a wealth of machine learning algorithms: classification, regression, clustering, collaborative filtering, principal component analysis and so on. Currently, mllib is divided into two code


Spark mllib is an important part of spark, which is a machine learning library initially provided. The library has a disadvantage: if the data set is very complex and needs to be processed for many times, or the new data needs to be calculated with multiple trained single models, using spark mllib will make the program structure more complex, even difficult to understand and implement.

spark.mllib It is the original algorithm API based on RDD and is currently in maintenance state. The library contains four kinds of common machine learning algorithmsclassificationregressionclusteringCollaborative filtering。 Note that RDD based APIs do not add new functionality.

Spark 1.2 introduces ml pipeline. After the development of multiple versions, spark ml overcomes some shortcomings of mllib in dealing with machine learning problems (complex and unclear process), and provides users with machine learning library based on dataframe API, which makes the whole process of machine learning application simple and efficient.

Spark MLNot an official name, used to refer to the mllib Library Based on the dataframe API. Dataframe provides a more friendly API than RDD. Many of the benefits of dataframe include spark data sources, SQL / dataframe queries, tungsten and catalyst optimizations, and a unified cross language API.

Spark ml API provides many data feature processing functions, such as feature selection, feature transformation, category digitization, regularization, dimension reduction, etc. In addition, the ML Library Based on the dataframe API supports the construction of machine learning pipeline, which organizes some tasks in the machine learning process in order to facilitate operation and migration. Officially recommended by spark Library.

Data transformation

Data transformation is an important work of data preprocessing, such as normalizing, discretizing, deriving indexes and so on. Spark ml provides a wealth of data conversion algorithms. For details, please refer to the official website, which is summarized as follows:

Chapter 6 spark mllib machine learning (1)

Among the above conversion algorithms, TF-IDF, word2vec and PCA are common. If you have done text mining, it should not be unfamiliar.

data reduction

Big data is the basis of machine learning, providing sufficient data training set for machine learning. When the amount of data is very large, it is necessary to delete or reduce redundant dimension attributes through data reduction technology to achieve the purpose of reducing the data set. Similar to the idea of sampling, although the data capacity is reduced, the integrity of the data is not changed. The feature selection and dimension reduction methods provided by spark ml are shown in the following table:

Chapter 6 spark mllib machine learning (1)

Feature selection and dimensionality reduction are commonly used in machine learning. These methods can be used to reduce feature selection, eliminate noise and maintain the original data structure features. In particular, principal component analysis (PCA), whether in the field of statistics or machine learning, has played a very important role.

Machine learning algorithm

Spark supports classification, regression, clustering, recommendation and other common machine learning algorithms. See the table below:

Chapter 6 spark mllib machine learning (1)


This paper gives a general introduction to machine learning, including the basic concepts of machine learning, the basic classification of machine learning, and the introduction of spark machine learning library. In the next part, I will share an application of machine learning based on spark ml library, mainly involving LDA topic model and K-means clustering.

Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.