How does spark work with the deep learning framework to handle unstructured data


With the continuous integration of big data and AI services, there are more and more business scenarios for big data processing of unstructured data (such as pictures, audio and text) through in-depth learning technology in the process of big data analysis and processing. This paper will introduce how spark works with the deep learning framework to process unstructured data in the process of big data processing.

Spark introduction

Spark is a de facto standard for large-scale data processing, including the operation of machine learning. It hopes to integrate big data processing and machine learning pipeline.

Spark uses the functional programming paradigm to extend the MapReduce model to support more computing types and can cover a wide range of workflows. Spark uses memory caching to improve performance, so interactive analysis is fast enough (just like interacting with a cluster using a Python interpreter).Caching also improves the performance of iterative algorithms, which makes spark very suitable for machine learning.

The spark library provides APIs written in Python, scale and Java, as well as built-in machine learning, stream data, graph algorithm, SQL like query and other modules; Spark has rapidly become one of the most important distributed computing frameworks today. Combined with yarn, spark provides increment rather than replacing existing Hadoop clusters. In the latest spark version, spark adds support for k8s, providing better support for the integration of spark and AI capabilities.How does spark work with the deep learning framework to handle unstructured data

Introduction to deep learning framework


Tensorflow was originally developed by the Google brain team of Google’s machine intelligence research department and built based on the deep learning infrastructure distbelief developed by Google in 2011. Due to Google’s great influence and strong promotion ability in the field of in-depth learning, tensorflow has received great attention once it was launched, and has quickly become the in-depth learning framework with the largest number of users today.

Tensorflow is a very basic system, so it can also be applied to many fields. However, due to the too complex system design, it is an extremely painful process for readers to learn the underlying operation mechanism of tensorflow. Tensorflow’s interface has been in rapid iteration, and backward compatibility is not well considered, which leads to that many open source codes can no longer run on the new version of tensorflow. At the same time, it also indirectly leads to bugs in many third-party frameworks based on tensorflow.


Keras was first released in March 2015. It has an “API designed for humans rather than machines”, which is supported by Google. It is a high-level neural network library used to quickly build deep learning prototypes. It is written in pure python. It takes tensorflow, cntk, theano and mxnet as the underlying engine and provides simple and easy-to-use API interfaces, which can greatly reduce the workload of users in general applications.

Strictly speaking, keras can not be called a deep learning framework. It is more like a deep learning interface, which is built on a third-party framework. The disadvantage of keras is obvious: over encapsulation leads to loss of activity. Keras was originally born as theano’s advanced API, and then tensorflow and cntk were added as the back end. Learning keras is easy, but it will soon encounter a bottleneck because it lacks flexibility. In addition, most of the time when using keras, users are mainly calling the interface, so it is difficult to really learn the content of deep learning.


Pytorch, released in October 2016, is a low-level API that focuses on directly processing array expressions. Formerly known as torch (a deep learning library based on Lua language). The Facebook Artificial Intelligence Institute provides strong support for pytorch. Pytorch supports dynamic calculation graph and provides lower level methods and more flexibility for users with more mathematical tendency. At present, many newly published papers use pytorch as a tool for paper implementation and become the preferred solution for academic research.


The full name of cafe is revolutionary architecture for fast feature embedding. It is a clear and efficient in-depth learning framework. It was developed by the University of California, Berkeley at the end of 2013. The core language is C + +. It supports command line, python, and MATLAB interfaces. An important feature of Caffe is the ability to train and deploy models without writing code. If you are a skilled user of C + + and can calculate CUDA easily, you can consider choosing Caffe.

Using deep learning framework in spark big data processing

A pre trained model is used in Spark Program and applied to data processing of large data sets in parallel. For example, given a classification model that can recognize images, it has been trained through a standard data set (such as Imagenet). A framework, such as TensorFlow or Keras, can be invoked in a Spark program for distributed prediction. By calling the pre training model in the process of big data processing, the unstructured data can be directly processed.

We focus on using keras + tensorflow for model reasoning in Spark Program.

The first step in using deep learning to process pictures is to load pictures. The new imageschema in spark 2.3 contains practical functions to load millions of images into spark dataframe, and automatically decodes them in a distributed manner, allowing extended operation.

Use Spark’s imageschema:

from import ImageSchema
image_df = ImageSchema.readImages("/data/myimages")

You can also use keras’s image processing library:

from keras.preprocessing import image
img = image.load_img("/data/myimages/daisy.jpg", target_size=(299, 299))

Spark dataframe can be constructed through image path:

def get_image_paths_df(sqlContext, dirpath, colName):
    files = [os.path.abspath(os.path.join(dirpath, f)) for f in os.listdir(dirpath) if f.endswith('.jpg')]
    return sqlContext.createDataFrame(files, StringType()).toDF(colName)

Load the pre training model using the keras interface:

from keras.applications import InceptionV3
model = InceptionV3(weights="imagenet")'/tmp/model-full.h5')
model = load_model('/tmp/model-full.h5')

Define picture recognition reasoning method:

        def iv3_predict(fpath):
            model = load_model('/tmp/model-full.h5')
            img = image.load_img(fpath, target_size=(299, 299))
            x = image.img_to_array(img)
            x = np.expand_dims(x, axis=0)
            x = preprocess_input(x)
            preds = model.predict(x)
            preds_decode_list = decode_predictions(preds, top=3)
            tmp = preds_decode_list[0]
            res_list = []
            for x in tmp:
                res = [x[0], x[1], float(x[2])]
            return res_list

Define inference input result schema:

def get_labels_type():    
    ele_type = StructType()    
    ele_type.add("class", data_type=StringType())    
    ele_type.add("description", data_type=StringType())    
    ele_type.add("probability", data_type=FloatType())    
    return ArrayType(ele_type)

Define the reasoning method as spark UDF:

spark.udf.register("iv3_predict", iv3_predict, returnType=get_labels_type())

The loaded picture is defined as a data table:

df = get_image_paths_df(self.sql)

Use SQL statements to process pictures:

df_images = spark.sql("select fpath, iv3_predict(fpath) as predicted_labels from _test_image_paths_df")


In the big data spark engine, there are many application scenarios to use the deep learning framework to load the preprocessing model for unstructured data processing. However, there are many deep learning frameworks, and the model and framework are deeply coupled. It will be very complex to install and deploy the deep learning framework software and its dependent software in the big data environment. At the same time, it is not conducive to the management and maintenance of big data clusters and increases labor costs.

Huawei cloud DLI service adopts the big data serverless architecture. Users do not need to perceive the actual physical cluster. Meanwhile, DLI service has built-in AI computing framework and underlying dependency Library (keras / tensorflow / scikit learn / pandas / numpy, etc.) in the big data cluster. In the latest version of DLI, k8s + docker ecology is supported, and the user-defined docker image capability is open to users to expand their AI framework, model and algorithm package. On the basis of serverless, it provides users with more open user-defined extension capabilities.

DLI supports multi-mode engine. Enterprises can easily complete batch processing and stream processing of heterogeneous data sources by using SQL or programs, mine and explore data information, reveal the laws therein and find the potential value of data. Huawei cloud has made great benefits in 618, big data + AI special session and historical low price, helping enterprises to be “intelligent” and business to be “data”.

Click focus to learn about Huawei cloud’s new technologies for the first time~

How does spark work with the deep learning framework to handle unstructured data