Big data Hadoop — spark SQL + spark streaming



1、 Spark SQL overview

Spark SQL is a module used by spark to process structured data. It provides two programming abstractions called dataframe and dataset. As a distributed SQL query engine, spark SQL is also a re encapsulation of RDD.Big data Hadoop — computing engine spark, official documents:

2、 Sparksql version

1) Evolution of sparksql

  • Before 1.0: Shark (entry: sqlcontext and hivecontext)

    1. SQLContext: mainly refers to the construction and execution of dataframe. Sqlcontext refers to the program entry of SQL module in spark.
    2. HiveContext: is a subclass of sqlcontext, which is specially used for integration with hive, such as reading hive metadata, storing data into hive table, hive window analysis function, etc.
  • 1.1. X start: sparksql (only for testing)

  • 1.3. x: Sparksql (official version) + dataframe

  • 1.5. x: Sparksql tungsten plan

  • 1.6. x: Sparksql + dataframe + dataset (beta version)

  • 2.x:

    1. entrance:SparkSession(an integral entry of spark application), which combines sqlcontext and hivecontext
    2. SparkSQL+DataFrame+DataSet(official version)
    3. Spark Streaming-》Structured Streaming(DataSet)

2) Comparison between shark and sparksql

  • shark
    1. The implementation of plan optimization completely depends on hive, so it is inconvenient to add new optimization strategies;
    2. Spark is thread level parallelism, while MapReduce is process level parallelism.
    3. Spark has thread safety problems in hive compatible implementation, which leads to shark
      Have to use another set of independently maintained patched hive source code branch;
  • Spark SQL
    1. Continue to develop as a member of spark ecology instead of being limited by hive,
    2. Only compatible with hive; Hive on spark is one of hive’s underlying engines
    3. Hive can use map reduce, tez, spark and other engines


  • Sparksession is a new concept introduced by spark 2.0. Sparksession provides users with a unified entry point to learn various functions of spark.
  • In the early versions of spark, sparkcontext is the main entry point of spark. Since RDD is the main API, we use sparkcontext to create and operate RDD. For each other API, we need to use a different context.

[for example] for streaming, we need to use streamingcontext; For SQL, use sqlcontext; For hive, use hivecontext. However, as the APIs of dataset and dataframe gradually become standard APIs, it is necessary to establish access points for them. So in spark 2 0, sparksession is introduced as the entry point of dataset and dataframe API. Sparksession encapsulates sparkconf, sparkcontext and sqlcontext. For backward compatibility, sqlcontext and hivecontext are also saved.

  • Sparksession is essentially a combination of sqlcontext and hivecontext (streamingcontext may be added in the future), so the APIs available on sqlcontext and hivecontext can also be used on sparksession.Sparkcontext is actually encapsulated by sparkcontext.spark2 X does not recommend using sparkcontext objects to read data, but sparksession

3、 RDD, dataframes and dataset

1) Relationship between the three

Dataframe and dataset are RDD based structured data abstractions provided by spark SQL. It not only has the characteristics of immutable RDD, partition, storage dependency, but also has structured information similar to relational database. Therefore, the programs developed based on dataframe and dataset APIs will be automatically optimized, so that developers do not need to operate the underlying RDD API for manual optimization, which greatly improves the development efficiency. howeverRDD API has unique advantages for unstructured data processing, such as text stream data, and it is more convenient for us to do the underlying operation


RDD (resilient distributed dataset) is calledElastic distributed data set, is the most basic in sparkData abstraction, which represents an immutable, partitioned set in which the elements can be calculated in parallel. RDD has the characteristics of data flow model: automatic fault tolerance, location aware scheduling and scalability. RDD allows users to explicitly cache the working set in memory when executing multiple queries, and subsequent queries can reuse the working set, which greatly improves the query speed.

1. Core concept

  • A partition: the basic unit of a data set. For RDD, each fragment will be processed by a computing task and determine the granularity of parallel computing. You can specify the number of RDD fragments when creating an RDD. If it is not specified, the default value will be adopted. The default value is the number of CPU cores allocated by the program.

  • A function that calculates each partition。 The calculation of RDD in spark is based on fragmentation, and each RDD will implement the compute function to achieve this purpose. The compute function composes the iterator without saving the results of each calculation.

  • Dependencies between RDDS: each conversion of RDD will generate a new RDD, so there will be a pipeline like dependency between RDDS. When some partition data is lost, spark can recalculate the lost partition data through this dependency instead of recalculating all partitions of RDD.

  • A partitioner: that is, the partition function of RDD. Currently, spark implements two types of slicing functions, one is hash based hashpartitioner, and the other is range based rangepartitioner. Only for RDDS with key value can there be a partitioner, and the value of the partitioner of RDDS with non key value is none. The partitioner function determines not only the number of partitions of the RDD itself, but also the number of partitions when the parent RDD shuffle is output.

  • A list: stores the preferred location for accessing each partition. For an HDFS file, this list saves the location of the block where each partition is located. According to the concept of “mobile data is not as good as mobile computing”, spark will allocate computing tasks to the storage location of data blocks to be processed as much as possible when scheduling tasks.

2. RDD simple operation

Start spark shell. In fact, the lower layer of spark shell also calls spark submit. It needs to be configured first. Of course, it can also be written on the command line, but it is not recommended. The configuration is as follows for reference only (yarn mode is used here):

$ cat spark-defaults.conf

Start spark shell (explained in detail below)

$ spark-shell

[question] I found a warn: warn yen Client: Neither spark. yarn. jars nor spark. yarn. archive is set, falling back to uploading libraries under SPARK_ HOME.
[reason] because spark submits the task to the yarn cluster, it needs to upload the jar package of relevant spark to HDFS.
[solution] by uploading to the HDFS cluster in advance and specifying the file path in the spark configuration file, you can avoid the need to upload files repeatedly every time you submit a task to Yan. Here are the specific steps to solve the problem:

###Package jars, jar related parameter description
#-C create a jar package
#- t displays a list of contents in the jar
#-X unzip the jar package
#-U add files to jar package
#-F specifies the file name of the jar package
#-V generate detailed manufacturing report and output it to standard equipment
#-M specify manifest MF file (you can set the jar package and its contents in the file)
#-0 does not compress the contents of the jar package when it is generated
#-M does not generate a manifest file (manifest. MF) for all files. This parameter is the same as ignoring the setting of the - M parameter
#-I create an index file for the specified jar file
#-C means to go to the corresponding directory and execute the jar command, which is equivalent to CD to that directory, and then execute the jar command without - C
$ cd /opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2
$ jar cv0f spark-libs.jar -C ./jars/ .
$ ll
###Create a directory for storing jar packages on HDFS
$ hdfs dfs -mkdir -p /spark/jars
##Upload jars to HDFS
$ hdfs dfs -put spark-libs.jar /spark/jars/
##Add and configure spark defaults conf 

Then start spark shell

In the spark shell, a special sparkcontext has been created for you. The variable name is SC, and the self created sparkcontext will not work.

$ spark-shell

###Created by an existing Scala collection.
val array = Array(1,2,3,4,5)
#Spark uses the parallelize method to create an RDD
val rdd = sc.parallelize(array)

Here is just a simple operation to create RDD. There will be more demonstration operations related to RDD later.


Spark supports two types of operations (operators):Transformation and action


The main thing to do is to generate another RDD from an existing RDD.Transformation has lazy feature (delayed loading)。 The code of the transformation operator will not actually be executed. Only when our programWhen an action operator is encountered, the code will be executed。 This design makes spark run more efficiently.

Common transformations:

transformation meaning
map(func) Returns a new RDD, which consists of each input element converted by func function
filter(func) Returns a new RDD, which is composed of input elements whose return value is true after being calculated by func function
flatMap(func) Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence, not a single element)
mapPartitions(func) It is similar to map, but runs independently on each fragment of RDD. Therefore, when running on RDD of type T, the function type of func must be iterator [t] = > iterator [u]
mapPartitionsWithIndex(func) Similar to mappartitions, but func takes an integer parameter to represent the index value of the partition. Therefore, when running on an RDD of type T, the function type of func must be (int, interlator [t]) = > iterator [u]
sample(withReplacement, fraction, seed) Sample the data according to the proportion specified by fraction. You can choose whether to use random number for replacement. Seed is used to specify the seed of random number generator
union(otherDataset) A new RDD is returned after combining the source RDD and the parameter RDD
intersection(otherDataset) Returns a new RDD after intersecting the source RDD and the parameter RDD
distinct([numTasks])) After de duplication of the source RDD, a new RDD is returned
groupByKey([numTasks]) Call on a (k, V) RDD and return an (k, iterator [v]) RDD
reduceByKey(func, [numTasks]) Call on a (k, V) RDD and return a (k, V) RDD. Use the specified reduce function to aggregate the values of the same key. Similar to groupbykey, the number of reduce tasks can be set through the second optional parameter
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) Aggregate by partition first, then aggregate by total, and communicate with the initial value every time, for example: aggregatebykey (0)(+,+)Operate on the RDD of K / Y
sortByKey([ascending], [numTasks]) When called on a (k, V) RDD, K must implement the ordered interface and return a (k, V) RDD sorted by key
sortBy(func,[ascending], [numTasks]) Similar to sortbykey, but more flexible. The first parameter is how to sort. The second parameter is how to sort. False reverse order. The number of partitions after sorting is the same as the original RDD by default
join(otherDataset, [numTasks]) When called on RDDS of types (k, V) and (k, w), the RDD of (k, (V, w)) that returns all element pairs corresponding to the same key is equivalent to inner connection (intersection)
cogroup(otherDataset, [numTasks]) Call on RDDS of types (k, V) and (k, w) and return a (k, (iteratable),Iterable))RDD of type
cartesian(otherDataset) The Cartesian product of two RDDS is divided into many K / v
pipe(command, [envVars]) External program call
coalesce(numPartitions) Repartition: the first parameter is how many partitions to divide, and the second parameter is whether to shuffle. By default, false. Less partitions become more partitions. True. More partitions become less partitions. False
The shuffle parameter of repartition must be that the number of partitions to be divided is less and more
repartitionAndSortWithinPartitions(partitioner) Repartition + sorting is more efficient than partitioning and sorting first. It operates on the RDD of K / v
foldByKey(zeroValue)(seqOp) This function is used for K / V folding and merging. Similar to aggregate, the parameter of the first bracket is applied to each V value, and the function of the second bracket is aggregation. For example:+
combineByKey Merge the same key value rdd1 combineByKey(x => x, (a: Int, b: Int) => a + b, (m: Int, n: Int) => m + n)
partitionBy(partitioner) Partition the RDD. The partitioner is a partition, such as new hashpartition (2)
cache/persist RDD cache can avoid repeated calculation and reduce time. The difference is that the persistent operator is called inside the cache. By default, the cache has a cache level of memory-only, while the persistent can choose the cache level
Subtract(rdd) Returns the RDD whose previous RDD element is not in the following RDD
leftOuterJoin Leftouterjoin is similar to the left outer join in SQL. The returned results are mainly based on the previous RDD, and the records that cannot be associated are empty. It can only be used for the association between two RDDS. If you want to associate multiple RDDS, you can associate them several times.
rightOuterJoin Rightouterjoin is similar to the right outer join with external association in SQL. The returned result is mainly the RDD in the parameter, and the records that cannot be associated are empty. It can only be used for the association between two RDDS. If you want to associate multiple RDDS, you can associate them several times
subtractByKey Substractbykey is similar to the substract in the basic conversion operation, except that it is for K and returns the elements that appear in the main RDD and do not appear in the other RDD

To trigger the operation of the code, we need at least one action operation in a spark code.

Common actions:

action meaning
reduce(func) Gather all elements in RDD through func function. This function must be interchangeable and parallelable
collect() In the driver, all elements of the dataset are returned as an array
count() Returns the number of RDD elements
first() Returns the first element of the RDD (similar to take (1))
take(n) Returns an array of the first n elements of a dataset
takeSample(withReplacement,num, [seed]) Returns an array consisting of num elements randomly sampled from the data set. You can choose whether to replace the insufficient part with a random number. Seed is used to specify the seed of the random number generator
takeOrdered(n, [ordering]) Returns the array composed of the first n elements after the original RDD sorting (default ascending sorting)
saveAsTextFile(path) Save the elements of the dataset in the form of textfile to the HDFS file system or other supported file systems. For each element, spark will call the toString method to replace it with the text in the file
saveAsSequenceFile(path) Save the elements in the dataset to the specified directory in the format of Hadoop sequencefile, which can enable HDFS or other file systems supported by Hadoop.
saveAsObjectFile(path) Saveasobjectfile is used to serialize the elements in RDD into objects and store them in a file. The method of use is similar to that of saveastextfile
countByKey() For an RDD of type (k, V), a (k, int) map is returned, indicating the number of elements corresponding to each key.
foreach(func) On each element of the dataset, run the function func to update.
aggregate First operate the partition, and then the overall operation
reduceByKeyLocally Return a dict object, which is also used to aggregate the elements of the same key
lookup Lookup is used for RDD of type (k, V). It specifies K value and returns all V values corresponding to this K in RDD.
top The top function is used to return the first num elements from RDD according to the default (descending) or specified sorting rules.
fold Fold is a simplification of aggregate. Seqop and combop in aggregate use the same function Op.
foreachPartition Traverse the result set of the original RDD element after func function operation, and partition the foreachpartition operator

4. Actual operation

1. Conversion operation for each element

The most commonly used transformation operations should be map () and filter (). The transformation operation map () receives a function, applies this function to each element in the RDD, and takes the return result of the function as the value of the corresponding element in the resulting RDD. The transformation operation filter () receives a function and returns the elements in the RDD that meet the function in the new RDD.

Let’s look at a simple example, using map () to square all the numbers in RDD

#Create RDD objects by parallelize
val input = sc.parallelize(List(1, 2, 3, 4))
val result = => x * x)

2. Perform basic RDD conversion (de duplication) on an RDD with data {1, 2, 3, 3}

var rdd = sc.parallelize(List(1,2,3,3))

3. The RDDS with data of {1, 2, 3} and {3, 4, 5} are transformed into two RDDS

var rdd = sc.parallelize(List(1,2,3))
var other = sc.parallelize(List(3,4,5))
#Generate an RDD that contains all the elements in the two RDDS
#Find the common element RDD of two RDDS

4. Action operation

Action operation reduce (), which takes a function as a parameter. This function operates the data of element types of two RDDS and returns a new element of the same type. A simple example is function +, which can be used to accumulate our RDD. Using reduce (), you can easily calculate the sum of all elements in RDD, the number of elements, and other types of aggregation operations.

var rdd = sc.parallelize(List(1,2,3,4,5,6,7))
var sum = rdd.reduce((x, y) => x + y)
#Find the number of elements
var sum = rdd.count()

#Aggregation operation
var rdd = sc.parallelize(List(1,2,3,4,5,6,7))
var result = rdd.aggregate((0,0))((acc,value) => (acc._1 + value,acc._2 + 1),(acc1,acc2) => (acc1._1 + acc2._1 , acc1._2 + acc2._2))
var avg = result._1/result._2.toDouble

Here are just a few simple examples. For more RDD operations, please refer toOfficial documentsStudy.


In spark, dataframe provides aDomain specific language (DSL) and SQLDataframe is a distributed data set based on RDD, which is similar to the two-dimensional table in the traditional database.

  • RDD, because there is no way to know the specific internal structure of the stored data elements, spark core can only carry out simple and general pipeline optimization at the stage level.
  • The underlying layer of dataframe is a distributed dataset based on RDD. The main difference from RDD is that there is no schema information in RDD, while each row of data in dataframe contains schema. DataFrame = RDD + shcema

1. DSL style syntax operation

1) Dataframe creation

There are two basic ways to create a dataframe:

  • The existing RDD calls the todf () method to get the dataframe.
  • Read the data source through spark and directly create the dataframe.

Create datafarm object directly

If you use sparksession to create a dataframe, you can use spark Read loads data from different types of files to create a dataframe. spark. The specific operation of read is as follows.

Method name describe“people.txt”) Read TXT format file and create dataframe (“people.csv”) Read CSV format file and create dataframe“people.json”) Read JSON format file and create dataframe“people.parquet”) Read parquet format file and create dataframe

1. Create a person locally TXT text document for reading: run spark shell:

# person.txt,Name,Age,Height
#Start spark shell, and a spark session object with spark name will be created by default
$ spark-shell
#Define variables, [note] all nodes must create this person file, otherwise the scheduling machine without this file will report an error
var inputFile = "file:///opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/test/person.txt"
#Read local file
val personDF ="file:///opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/test/person.txt")
val personDF =
#Put files on HDFS
#Read HDFS file (recommended)
val psersonDF ="hdfs:///person.txt")

2. RDD is converted to dataframe

action meaning
show() View specific content information in dataframe
printSchema() View schema information of dataframe
select() View the data of some selected columns in the dataframe and rename them
filter() Implement conditional query and filter out the desired results
groupBy() Group records
sort() Sort specific fields
toDF() Convert RDD data type to datafarm
#Read text documents, separated by commas
val lineRDD = sc.textFile("hdfs:///person.txt").map(_.split(","))
case class Person(name:String, age:Int, height:Int)
#Divide RDD data into maps according to style classes
val personRDD = => Person(x(0).toString, x(1).toInt, x(2).toInt))
#Convert RDD data type to datafarm
val personDF = personRDD.toDF()
#Look at this table
#View schema data
#View column"name")).show
#Filter those younger than 25
personDF.filter(col("age") >= 25).show

Common spark dataframe methods are provided here:

Method name meaning
collect() The return value is an array that returns all rows of the dataframe collection
collectAsList() The return value is an array of Java type, which returns all rows of the dataframe collection
count() Returns a number type and returns the number of rows in the dataframe collection
describe(cols: String*) Returns a mathematically calculated class table value (count, mean, StdDev, min, and max). This can pass multiple parameters separated by commas. If any field is empty, it will not participate in the operation, but only this pair of numeric fields. For example, DF describe(“age”, “height”). show()
first() Returns the first row of row type
head() Returns the first row of row type
head(n:Int) Return n rows of row type
show() The default value of the returned dataframe set is 20 rows, and the return type is unit
show(n:Int) Return n rows, and the return value type is unit
table(n:Int) Return n rows of row type
cache() Memory for synchronizing data
columns Returns an array of string type. The return value is the names of all columns
dtypes Returns a two-dimensional array of string type. The return value is the names and types of all columns
explan() Print physical of execution plan
explain(n:Boolean) The input value is false or true, the return value is unit, and the default is false. If you enter true, logical and physical will be printed
isLocal The return value is of boolean type. If the allowed mode is local, return true; otherwise, return false
persist(newlevel:StorageLevel) Returns a dataframe this. Type enter the storage model type
printSchema() The printed field name and type are printed according to the tree structure
registerTempTable(tablename:String) Return to unit, and put the object of DF in only one table, which is deleted with the deletion of the object
schema Return the structtype type, and return the field name and type according to the structure type
toDF() Returns a new dataframe type
toDF(colnames:String*) Return several fields in the parameter to a new dataframe type
unpersist() Return dataframe this. Type type to remove data from the pattern
unpersist(blocking:Boolean) Return dataframe this. Type type: true has the same function as unpersist. False removes RDD
agg(expers:column*) Return dataframe type, the same as mathematical calculation and evaluation
agg(exprs: Map[String, String]) Return dataframe type, which is the same as that of mathematical calculation evaluation map type
agg(aggExpr: (String, String), aggExprs: (String, String)*) Return dataframe type, the same as mathematical calculation and evaluation
apply(colName: String) Return the column type and capture the object entered into the column
as(alias: String) Return a new dataframe type, which is the original alias
col(colName: String) Return the column type and capture the object entered into the column
cube(col1: String, cols: String*) Returns a groupeddata type, which is summarized according to some fields
distinct Go back to a dataframe type
drop(col: Column) Delete a column and return dataframe type
dropDuplicates(colNames: Array[String]) Delete the same column and return a dataframe
except(other: DataFrame) Returns a dataframe that exists in the current collection and does not exist in other collections
filter(conditionExpr: String) Swipe some data and return dataframe type
groupBy(col1: String, cols: String*) Summarize according to a write field and return the type of groupedate
intersect(other: DataFrame) Returns a dataframe, which is an element that exists in both dataframes
join(right: DataFrame, joinExprs: Column, joinType: String) One is associated dataframe, the second is associated condition, and the third is associated type: inner, outer, left_ outer, right_ outer, leftsemi
limit(n: Int) Return dataframe type to N pieces of data
orderBy(sortExprs: Column*) Do Alice sort
sort(sortExprs: Column*) Sort DF sort(df(“age”). desc). show(); The default is ASC
select(cols:string*) Dataframe for field selection DF select($”colA”, $”colB” + 1)
withColumnRenamed(existingName: String, newName: String) Modify list DF withColumnRenamed(“name”,”names”). show();
withColumn(colName: String, col: Column) Add a column DF withColumn(“aa”,df(“name”)). show();

Many common methods have been listed here, which basically covers most operations. Of course, you can also refer to themOfficial documents

2. SQL style syntax operation

One of the strengths of dataframe is that we can think of it as a relational data table, and then we can use spark SQL () to execute SQL query, and the result will be returned as a dataframe. Because spark session contains hive context, spark SQL () will automatically start the connection to hive. The default mode is in hiveLocal mode (embedded Derby)

Start spark shell

$ spark-shell

Two files will be generated in the current directory of executing spark shell: Derby log,metastore_ db

Next, you can write SQL with happy. Here are a few commands. Like the previous hive, put the SQL statement in spark SQL () method. If you don’t know about hive SQL, please refer to my previous article:Big data Hadoop — data warehouse hive

#There is a default library
$ spark.sql("show databases").show
#The default current library is default
$ spark.sql("show tables").show

Launch spark shell through spark SQL

The operation is more like SQL syntax, which is almost like hive. Next, we will demonstrate several commands, which will be clear to everyone.

$ spark-sql
show databases;
create database test007

Similarly, two files will be automatically created in the current directory: Derby log,metastore_ db


Dataset is a distributed data set. Dataset provides strong type support and adds type constraints to each row of data in RDD. Dataset is in spark1 6. It combines the advantages of RDD (strong typing and powerful lambda functions) and the execution engine optimized by spark SQL. Dataset can be built through the object of JVM, and various operations can be carried out with functional transformation (map / flatmap / filter).

1. Through spark Createdataset creates a dataset through a collection

val ds1 = spark.createDataset(1 to 10)

2. Building a dataset from an existing RDD

Official documents

val ds2 = spark.createDataset(sc.textFile("hdfs:////person.txt"))

3. Create dataset with sample class

case class Person(name:String,age:Int)
val personDataList = List(Person("zhangsan",18),Person("lisi",28))
val personDS = personDataList.toDS

4. Generated by dataframe transformation
Music. The JSON file is as follows:

{“name”: “Shanghai beach”, “singer”: “Ye Liyi”, “album”: “theme song of Hong Kong TV series”, “path”: “MP3 / shanghaitan. Mp3”}
{“name”: “what do you want in life”, “singer”: “Chen Baiqiang”, “album”: “theme song of Hong Kong TV series”, “path”: “MP3 / shanghaitan. Mp3”}
{“name”: “red sun”, “singer”: “Li Keqin”, “album”: “nostalgic album”, “path”: “MP3 / shanghaitan. Mp3”}
{“name”: “love is like tide”, “singer”: “Zhang Xinzhe”, “album”: “nostalgic album”, “path”: “MP3 / airucaoshun. Mp3”}
{“name”: “red teahouse”, “singer”: “Chen Huixian”, “album”: “nostalgic album”, “path”: “MP3 / redteabar. Mp3”}

case class Music(name:String,singer:String,album:String,path:String)
#Pay attention to test JSON to HDFS
val jsonDF ="hdfs:///Music.json")
val jsonDS =[Music]

RDD, dataframe and dataset are mutually transformed

4、 Similarities and differences among RDD, dataframe and dataset

  • RDD [person]: takes person as the type parameter, but does not understand its internal structure.

  • Dataframe: provides detailed structure information, name and type of schema column. It looks like a watch

  • Dataset [person]: not only schema (structure) information, but also type information

1) Commonness

  • All three are distributed elastic data sets under spark platform, which provide convenience for processing super large data
  • All three have inert mechanisms. During creation and transformation (such as map), it will not be executed immediately. Trigger calculation will be started only when action operator (such as foreach) is encountered. In extreme cases, if there are only creation and transformation in the code, but the corresponding results are not used in the subsequent actions, they will be skipped during execution.
  • All three have the concept of partition, including cache operation and checkpoint operation
  • All three have many common functions (such as map, filter, sorted, etc.).
    Implicit transformation (SSc. Implicit.)

2) Distinction

  • DataFrame: dataframe is a special case of dataset, that is, the alias of dataset [row]; DataFrame = RDD + schema
    1. The fixed type of each row of dataframe is row, and the value of each field can be obtained only through parsing
    2. Dataframe and dataset are usually used together with spark ml
    3. Both dataframe and dataset support sparksql operations, such as select and groupby. They can also be registered as temporary tables for SQL statement operations
    4. Dataframe and dateset support some convenient saving methods, such as CSV. You can bring the header so that the field name of each column can be seen at a glance
  • DataSet:DataSet = RDD + case class
    1. Dataset and dataframe have the same member functions, except that the data type of each row is different.
    2. Each row of the dataset is a case class. After customizing the case class, you can easily obtain the information of each row

5、 Spark shell

Spark’s shell, as a powerful interactive data analysis tool, provides a simple way to learn API. It can use Scala (a good way to run existing Java libraries on a Java virtual machine) or python. The essence of spark shell is to call spark submit script in the background to start the application. A sparkcontext object named SC will be created in spark shell.

[note] spark shell can only be started in client mode.

view help

$ spark-shell --help

Spark shell common options

--master MASTER_ URL assignment mode( spark://host:port ,  mesos://host:port , yarn,
                              k8s://https://host:port, or local (Default: local[*]))
--Executor memory MEM specifies the memory of each executor. The default is 1GB
--Total executor cores num specifies the number of cores occupied by all executors
--Num executors num specifies the number of executors
--Help, - H displays help information
--Version displays the version number

From the help above, spark has five operation modes: spark, mesos, yarn, k8s and local. Here we mainly talk about the local and yarn modes

Master URL meaning
local Running locally, there is only one working process and no parallel computing power
local[K] Running locally, there are k working processes, and K is usually set as the number of CPU cores of the machine
local[*] Running locally, the number of worker processes is equal to the number of CPU cores of the machine.
spark://HOST:PORT Run in standalone mode, which is the cluster operation mode provided by spark itself. The default port number is 7077
mesos://HOST:PORT Running on the mesos cluster, the driver process and worker process run on the mesos cluster. The deployment mode must use a fixed value: – deploy mode cluster
yarn Running on the yarn cluster depends on the Hadoop cluster. The yarn resource scheduling framework submits the application to yarn, runs the driver in the applactionmaster (equivalent to the master in the stand alone mode), schedules resources on the cluster, and starts the excutor to execute tasks.
k8s Running on a k8s cluster


In spark shell, a special sparkcontext has been created for you. The variable name is SC. Self created sparkcontext will not work. You can use the — master parameter to set the cluster to which the sparkcontext is connected, and — jars to set the jar packages to be added to the classpath. If there are multiple jar packages, you can use the comma separator to connect them. For example, run spark shell on an environment with 2 cores, using:

#The location of resource storage is local by default, and what scheduling framework is used. The default is standalone, the built-in resource management and scheduling framework of spark 
#Local stand-alone version only takes one thread, local [*] takes all current threads, and local [2]: two CPU cores run
$ spark-shell --master local[2]
#-- Master defaults to local [*] 
#The maximum memory size of the cluster is used by default
#The maximum number of cores is used by default
$ spark-shell --master local[*] --executor-memory 1g --total-executor-cores 1

Web UI address:http://hadoop-node1:4040

Then, you can use the Scala language in spark shell to complete certain operations. Here are some simple operations. If you are interested, you can understand them by yourselfscala

val textFile = sc.textFile("file:///opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/")

Where, count represents the total number of data pieces in RDD; First represents the first row of data in the RDD.

2) On Yan (recommended)

#On yarn, you can also modify this field spark in the configuration file master
$ spark-shell --master yarn

–Master is used to set the resource master node that the context will connect to and use. The value of master is the cluster address of spark in standalone mode, the URL of yarn or mesos cluster, or a local address.

6、 Integration of sparksql and hive (spark on hive)

1) Create soft link

$ ln -s /opt/bigdata/hadoop/server/apache-hive-3.1.2-bin/conf/hive-site.xml /opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/conf/hive-site.xml

2) Copy the MySQL connection jar package under hive lib directory to jars in spark

$ cp /opt/bigdata/hadoop/server/apache-hive-3.1.2-bin/lib/mysql-connector-java-5.1.49-bin.jar /opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/jars/

3) Disposition

#Create spark logs in the HDFS storage directory
$ hadoop fs -mkdir -p /tmp/spark
$ cd /opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/conf
$ cp spark-defaults.conf.template spark-defaults.conf

In spark defaults Conf adds the following configuration:

#Use yarn mode
spark.master                     yarn
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://hadoop-node1:8082/tmp/spark
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              512m
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

4) Start spark shell operation hive (local)

Start Metastore service to support multiple users

$ nohup hive --service metastore &
$ ss -atnlp|grep 9083

At hive site XML adds the following configuration:


Start spark SQL

#Yarn mode, - Master yarn can be omitted, because the yarn mode has been configured in the configuration file above
$ spark-sql --master yarn
show databases;

It can be seen from the above figure that the library I created earlier has been found, indicating that the integration is OK.

7、 Spark beeline

Spark thrift server is a thrift service implemented by spark community based on hiveserver2. Designed for seamless compatibility
HiveServer2。 becauseThe interface and protocol of spark thrift server are completely consistent with hiveserver2,So after we deploy spark thrift server,You can directly use hive’s beeline to access spark thrift server and execute related statements。 Spark thrift server is only intended to replace hiveserver2, so it can still interact with hive Metastore to obtain hive metadata.

1) Comparison between spark thrift server architecture and hiveserver2 architecture

2) Differences between spark thrift server and hiveserver2

Hive on Spark Spark Thrift Server
Task submission mode Each session will create a remotedriver, that is, for an application. After that, the SQL is parsed into a physical plan for execution, serialized and sent to the remotedriver for execution The server service itself is a driver, which directly receives SQL execution. That is, all sessions share an application
performance Average performance If the storage format is Orc or parquet, the performance will be several times higher than hive, and some statements will even be dozens of times higher. For other formats, the performance difference is not great. Sometimes hive performance will be better
Concurrent If the task execution is not asynchronous, it is executed in the worker thread of thrift, which is limited by the number of worker threads. If asynchronous, it will be executed in the thread pool, and the concurrency is limited by the size of asynchronous thread pool. The mode of processing tasks is the same as hive.
SQL compatibility It mainly supports ANSI SQL 2003, but it does not fully comply with it, but most of it. And expanded a lot of their own grammar Spark SQL also has its own implementation standards, so it will not be fully compatible with hive. Which statements will be incompatible? You need to test to know
HA Ha can be implemented through ZK There is no built-in ha implementation, but the spark community has proposed an issue and brought a patch, which can be used:

[summary] spark thrift server is a small change to hiveserver2, and the amount of code is not much. Although the interface is exactly the same as hiveserver2, it takesSingle applicationThe way of cluster operation is still quite wonderful. Perhaps the official also did not do more optimization to achieve simplicity.

3) Configure and start spark thrift server

1. Configure hive site xml


2. Start spark thriftserver service (cannot start HS2, because the configuration is the same, there will be conflicts)

$ cd /opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/sbin
$ ./
$ ss -tanlp|grep 11000

3. Start beeline operation

#In order to distinguish from hive, absolute path startup is used here
$ cd /opt/bigdata/hadoop/server/spark-3.2.0-bin-hadoop3.2/bin
#The operation is the same as hive operation, but the computing engine is different. Spark is replaced
$ ./beeline
!connect jdbc:hive2://hadoop-node1:11000
show databases;

To access the HDFS Web UI:http://hadoop-node1:8088/cluster/apps

8、 Spark streaming

Spark StreamingLike other big data frameworks storm and Flink, spark streaming is a framework for processing real-time computing services based on spark core. Its implementation is to segment the input stream data according to time, and the segmented data blocks are processed in parallel by off-line batch processing. The principle is shown in the figure below:

Support multiple data sources to obtain data:

Spark processes batch data (offline data). In fact, spark streaming does not process one piece of data like strom, but splits the received real-time stream data according to a certain time interval and gives it to spark engine to get batch results.

Considering the length of this article is too long, it is only mentioned here. If you have time, you will continue to supplement the knowledge points related to spark streaming, please wait patiently

Official documents:

Recommended Today

Front end development: webp adaptively improves development performance

Webp introduction WebPIt is a picture format launched by Google that provides both lossy and lossless compression methods. Its advantage is reflected in its excellent image compression algorithm, which can bring smaller picture volume and higher image quality. According to the official instructions, webp can reduce the volume of PNG by 26% in the case […]