Mars is a matrix based unified distributed computing framework. In the previous article, we have introduced what Mars is and its distributed execution, and Mars has been open-source in GitHub. When you finish reading the introduction of Mars, you may ask what it can do, which almost depends on what you want to do, because Mars, as the underlying operation library, implements 70% of the common interfaces of numpy. This article will show you how to use Mars to do what you want to do.
Singular value decomposition (SVD)
When dealing with complex data, as a data processor, the first thing to think about is dimension reduction. SVD is one of the more common methods of dimension reduction
numpy.linalgIn the module
svdMethod, when we have 20000 100 dimensional data to process, call SVD interface:
In : import numpy as np In : a = np.random.rand(20000, 100) In : %time U, s, V = np.linalg.svd(a) CPU times: user 4min 3s, sys: 10.2 s, total: 4min 13s Wall time: 1min 18s
It can be seen that even if numpy uses MKL acceleration, it will take more than one minute to run. When the data volume is larger, the memory of the single machine can no longer be processed.
Mars also implements SVD, but it has faster speed than numpy, because the algorithm of matrix block computing can be used for parallel computing:
In : import mars.tensor as mt In : a = mt.random.rand(20000, 100, chunk_size=100) In : %time U, s, V = mt.linalg.svd(a).execute() CPU times: user 5.42 s, sys: 1.49 s, total: 6.91 s Wall time: 1.87 s
It can be seen that under the same amount of data, Mars has dozens of times of speed improvement. It only takes more than one second to solve the problem of dimension reduction of 20000 data. Imagine the value of distributed matrix operation when Taobao user data is decomposed into matrix.
Principal component analysis (PCA)
When it comes to dimensionality reduction, principal component analysis is also an important means. PCA will choose the direction that contains the most information to project the data, and its projection direction can be understood from two aspects: maximizing variance or minimizing projection error. That is to say, through the vector and eigenvector matrix represented by low dimension, the corresponding original high dimension vector can be basically reconstructed. The main formula is as follows:
Xi is the data of each sample, μ J is the new projection direction, our goal is to maximize the projection variance, so as to find the main feature. The matrix C in the above formula can be expressed by covariance matrix in mathematics. Of course, first of all, the input samples should be adjusted centrally. We can use randomly generated arrays to see how numpy implements PCA dimension reduction operation:
import numpy as np a = np.random.randint(0, 256, size=(10000, 100)) a_mean = a.mean(axis=1, keepdims=True) a_new = a - a_mean cov_a = (a_new.dot(a_new.T)) / (a.shape - 1) #Using SVD to find the first 20 eigenvalues of covariance matrix U, s, V = np.linalg.svd(cov_a) V = V.T vecs = V[:, :20] #Using eigenvectors of low latitude to represent the original data a_transformed = a.dot(vecs)
Because the randomly generated data itself does not have strong characteristics, the first 20 dimensions can be taken out symbolically in the 100 dimension data. Generally, the first 99% of the total value can be taken by the proportion of characteristic values.
Let’s see how Mars is implemented:
import mars.tensor as mt a = mt.random.randint(0, 256, size=(10000, 100)) a_mean = a.mean(axis=1, keepdims=True) a_new = a - a_mean cov_a = (a_new.dot(a_new.T)) / (a.shape - 1) #Using SVD to find the first 20 eigenvalues of covariance matrix U, s, V = mt.linalg.svd(cov_a) V = V.T vecs = V[:, :20] #Using eigenvectors of low latitude to represent the original data a_transformed = a.dot(vecs).execute()
It can be seen that in addition to the difference of import, another is the call to the last variable that needs data
executeMethod, even after we finish eager mode in the future
executeAll of them can be omitted. The algorithm written by numpy in the past can be almost seamlessly converted into multi process and distributed programs, and MapReduce can no longer be written manually.
When Mars implements the basic algorithm, it can be used in the actual algorithm scenario. The most famous application of PCA is face feature extraction and face recognition. The dimension of a single face image is very large, so it is difficult to deal with the classifier. In the early days, the well-known face recognition algorithm, eigenface algorithm, is PCA algorithm. This paper takes a simple face recognition program as an example to see how Mars realizes the algorithm.
In this paper, the face database is ORL face database. There are 400 face images of 40 different people, each of which is a gray image of 92 * 112 pixels. Here, the first face picture of each group of pictures is selected as the test picture, and the other nine pictures are selected as the training set.
First, we use Python’s opencv library to read all the pictures into a large matrix, that is, a matrix of 360 * 10304 size. Each row is the gray value of each face, and there are 360 training samples in total. Using PCA training data,
data_matIs the input matrix,
kIs the dimension that needs to be preserved.
import mars.tensor as mt from mars.session import new_session session = new_session() def cov(x): x_new = x - x.mean(axis=1, keepdims=True) return x_new.dot(x_new.T) / (x_new.shape - 1) def pca_compress(data_mat, k): data_mean = mt.mean(data_mat, axis=0, keepdims=True) data_new = data_mat - data_mean cov_data = cov(data_new) U, s, V = mt.linalg.svd(cov_data) V = V.T vecs = V[:, :k] data_transformed = vecs.T.dot(data_new) return session.run(data_transformed, data_mean, vecs)
Because of the subsequent prediction and recognition, in addition to converting the data into low-dimensional data, the average value and low-dimensional space vector need to be returned. You can see the appearance of the average face in the middle process. The average face in the fire areas in the past few years can be obtained in this way. Of course, there are few dimensions and samples here, so you can only see the appearance of the individual face.
data_transformedThe shape of the feature face can also be seen after the saved feature faces are arranged according to the pixels. There are 15 feature faces in the picture, which can be used as a face classifier.
In addition, in the function
session.runThis function is due to the fact that the three results to be returned are not independent of each other. In the current delayed execution mode, submitting three operations will increase the amount of operation, but the same submission will not. Of course, we are also in the process of executing the mode immediately and pruning part of the graph that has been calculated.
After the training, we can use the reduced dimension data for face recognition. The image input from the previous non training samples is transformed into the dimension representation after dimension reduction. Here we use a simple Euclidean distance to judge the difference between each face data in the previous training samples. The smallest distance is the recognized face. Of course, we can also set a threshold value. If the minimum value exceeds the threshold value, the recognition fails. Finally, the accuracy of running out of this data set is 92.5%, which means that a simple face recognition algorithm is built.
#Calculate Euclidean distance def compare(vec1, vec2): distance = mt.dot(vec1, vec2) / (mt.linalg.norm(vec1) * mt.linalg.norm(vec2)) return distance.execute()
The above shows how to use Mars to complete the small algorithm of face recognition step by step. It can be seen that the interface of Mars like numpy is very friendly to algorithm developers. When the algorithm scale exceeds the capacity of a single machine, you no longer need to pay attention to all the parallel logic behind if it is extended to the distributed environment.
Of course, Mars has a lot to improve. For example, the decomposition of covariance matrix in PCA can be calculated by eigenvalues and eigenvectors, and the amount of calculation will be far less than SVD method. However, at present, the linear algebra module has not realized the method of calculating eigenvectors, which we will improve step by step, including the implementation of various upper algorithm interfaces in SciPy. If you have needs, you can raise the issue on GitHub or help us build Mars.
As a newly open source project, Mars is very welcome to put forward any other ideas and suggestions. We need everyone to join us to make Mars better and better.
Read the original text
This is the original content of yunqi community, which can not be reproduced without permission.