Spark SQL is a module used for structured data processing in spark. Unlike spark RDD API, spark SQL related interfaces provide more information about data structure and calculation execution process. Spark SQL internally performs additional optimization operations based on this information. Spark SQL integrates Spark’s functional programming API with SQL query. It supports querying data through SQL or hive language. At the same time, it provides a structure called dataframe and dataset to abstract structured data, and has corresponding dataframe API and datasets API to interact with spark SQL.
Spark SQL features
Spark SQL has the following characteristics:
- Easy integration
Seamlessly integrates SQL query and spark programming.
- Unified data access
Use the same method (dataframes) to access different data sources, including hive, Avro, parquet, JSON, JDBC, etc.
- Compatible with hive
Run SQL or hiveql directly on the existing hive warehouse
- Standard database connection
Connect spark SQL via JDBC or ODBC
Spark SQL uses the same engine for both interactive and long queries.
- performance optimization
In spark SQL, the query optimization engine will convert each SQL statement into a logical plan, and then convert it into a physical execution plan. In the execution phase, it will select the optimal plan to execute, so as to ensure faster execution speed than hive query.
In spark, dataframe is a distributed data set based on RDD, which is similar to the two-dimensional table in traditional relational database. Dataframe can be constructed from a variety of data sources, such as structural data, hive tables, external databases, or existing RDDS. The dataframe API supports multiple programming languages, including Scala, Java, python, and r. The main difference between dataframe and RDD is that the former has schema meta information, that is, each column of the two-dimensional table dataset represented by dataframe has a name and type. This enables spark SQL to gain insight into more structural information, so as to optimize the data sources hidden behind the dataframe and the transformations acting on the dataframe, and finally achieve the goal of greatly improving runtime efficiency. In contrast to RDD, because there is no way to know the specific internal structure of the stored data elements, spark core can only carry out simple and general pipeline optimization at the stage level.
The characteristics of dataframe are as follows:
- It can process KB to Pb data on single node and cluster
- Support different data formats (Avro, CSV, elastic, etc.) and different storage systems (HDFS, hive table, mysql, etc.).
- Code optimization and generation through catalyst optimizer of spark SQL
- Spark core can be easily integrated into all big data tools and frameworks
- Provide API interfaces in multiple languages, including python, Java, Scala, etc.
In spark core, if you want to execute an application, you need to first build the context object spark context. Spark SQL can be understood as an encapsulation of spark core, not only on the model, but also on the context object.
Sparksession is the starting point of Spark’s latest SQL query. Sparksession encapsulates sparkcontext internally, so the calculation is actually completed by sparkcontext. When we use spark shell, the spark framework will automatically create a sparksession object named “spark”, which we can use directly. As shown in the following figure:
Dataframe code practice
Create dataframe from data source
Relevant data are prepared for us by default in the spark installation package directory, including people JSON as an example, the path is “/ spark-x.x.x-bin-hadoop 2.7 / examples / SRC / main / resources / people. JSON”, and the content is:
Use the following code to load the JSON file:
scala> val df = spark.read.json("./examples/src/main/resources/people.json") df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> df.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
Print out the structure information of dataframe, and the code is as follows:
scala> df.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true)
You can see that the first attribute “age” of each row of data in the dataframe is of type long, and the second attribute “name” is of type string.
We can use two styles of syntax to operate data, namely SQL statement and DSL (domain specific language) syntax. If we use DSL syntax style, we don’t have to create temporary views. They are introduced below.
For example, if we want to select the specified “age” column, we can use the select method. The code is as follows:
scala> df.select("age").show() +----+ | age| +----+ |null| | 30| | 19| +----+
You can see that only the “age” column is output.
Use the filter operation to filter the data. For example, we need to filter out people older than 27 years old. The code is as follows:
scala> df.filter(df("age") > 27).show() +---+----+ |age|name| +---+----+ | 30|Andy| +---+----+ scala> df.filter($"age" > 27).show() +---+----+ |age|name| +---+----+ | 30|Andy| +---+----+ scala> df.filter('age > 27).show() +---+----+ |age|name| +---+----+ | 30|Andy| +---+----+
The above code shows three writing methods, and the results are the same.
Use the groupby operator to group data according to a column and view the number of entries. The code is as follows:
scala> df.groupBy("name").count.show() +-------+-----+ | name|count| +-------+-----+ |Michael| 1| | Andy| 1| | Justin| 1| +-------+-----+
The above code groups the data according to the attribute “name” and counts the number of each group.
SQL syntax style means that we use SQL statements to query data. This style of query must be assisted by temporary view or global view, that is, SQL is for view, not dataframe.
Take the dataframe created above as an example. In order to query the data using SQL syntax, we first create a temporary table. The code is as follows:
Next, we can perform SQL query on this view. First select all the tables and then print them. The code is as follows:
scala> val sqldf = spark.sql("SELECT * FROM people") sqldf: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> sqldf.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
Filter out the specified “name” column:
scala> val sqldf = spark.sql("SELECT name FROM people").show() +-------+ | name| +-------+ |Michael| | Andy| | Justin| +-------+
Filter the data and select the data older than 27 years old:
scala> val sqldf = spark.sql("SELECT * FROM people WHERE age >= 27").show() +---+----+ |age|name| +---+----+ | 30|Andy| +---+----+
Group the data according to name, and then count the number of each group:
scala> val sqldf = spark.sql("SELECT name, COUNT(*) FROM people GROUP BY name").show() +-------+--------+ | name|count(1)| +-------+--------+ |Michael| 1| | Andy| 1| | Justin| 1| +-------+--------+
Ordinary temporary tables are within the scope of sparksession. If you want to be effective within the application scope, you can use global temporary tables. Full path access is required when using global temporary tables, such as global_ temp. people。
The global table is created as follows:
df.createGlobalTempView("people1") scala> spark.sql("SELECT * FROM global_temp.people1").show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ scala> spark. newSession(). sql("SELECT * FROM global_temp.people1"). Show() // create a new session for SQL query +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
If you use the second method to access the previously created temporary table “people”, an error will be reported, as follows:
scala> spark.newSession().sql("SELECT * FROM people").show() org.apache.spark.sql.AnalysisException: Table or view not found: people; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [people]
Error: no corresponding table found.
Dataset is a distributed data set. It is a new interface in spark version 1.6. It combines the benefits of RDD (strong type, which can use powerful lambda expression functions) and the optimized execution engine of spark SQL. Datasets can be constructed from JVM objects, and then can be operated using functional transformations (map, flatmap, filter, etc.). Dataset is a strongly typed data set, and the corresponding type information needs to be provided during construction.
To create a dataset in Scala, you need to define Scala’s case class, which is a class with the following characteristics:
- Get class properties through decomposable pattern matching
- Allows structure based comparisons rather than reference based comparisons
- Easy to use and operate
- Invariance eliminates the need to track object changes and time and location
- Compare by value allows you to directly compare the values of instances as if they were primitive v a l u E, which avoids the uncertainty caused by confusing value based or reference based comparison of class instances.
- Pattern matching simplifies branching logic, resulting in fewer bugs and better readable code.
To create a dataset, first we need to define a case class for the dataset to use people JSON file as an example, define case class as follows:
scala> case class employee(name: String, salary: Long) defined class employee
After defining the case class, it represents a single record in the data. We can directly obtain the corresponding value through the properties of the sample class. Here is the code to create a dataset from a file:
//Note that the data we read out by default is dataframe, which is converted to the specified row type through the as method. scala> val employeeDS = spark.read.json("./examples/src/main/resources/employees.json").as[employee] employeeDS: org.apache.spark.sql.Dataset[employee] = [name: string, salary: bigint] scala> employeeDS.show() +-------+------+ | name|salary| +-------+------+ |Michael| 3000| | Andy| 4500| | Justin| 3500| | Berta| 4000| +-------+------+
Filter the data and filter out the data with salary greater than 4000:
scala> employeeDS.filter(line => line.salary > 4000).show +----+------+ |name|salary| +----+------+ |Andy| 4500| +----+------+
Map the data, and only take the name field of each row of data, as follows:
scala> employeeDS.map(line => line.name).show +-------+ | value| +-------+ |Michael| | Andy| | Justin| | Berta| +-------+
Group and aggregate data by name:
scala> val fc = employeeDS.groupBy("name").count() fc: org.apache.spark.sql.DataFrame = [name: string, count: bigint] scala> fc.show +-------+-----+ | name|count| +-------+-----+ |Michael| 1| | Andy| 1| | Berta| 1| | Justin| 1| +-------+-----+
Group according to the key corresponding to each row of data:
scala> val ec = employeeDS.groupByKey(x => x.name).count() ec: org.apache.spark.sql.Dataset[(String, Long)] = [key: string, count(1): bigint] scala> ec.show +-------+--------+ | key|count(1)| +-------+--------+ |Michael| 1| | Andy| 1| | Berta| 1| | Justin| 1| +-------+--------+
Connect two different datasets and integrate them into one. Taking the dataset formed above as an example, we construct a fake dataset and then use the join method to connect.
//Construct sample class scala> case class EmployeeMetadata(name: String, number: BigInt) defined class EmployeeMetadata scala> val nameList = List("Michael","Andy","Berta","Justin") nameList: List[String] = List(Michael, Andy, Berta, Justin) scala> val metaData = spark.range(4).map(x => (nameList(x.toInt), x)).withColumnRenamed("_1", "name").withColumnRenamed("_2", "number").as[EmployeeMetadata] metaData: org.apache.spark.sql.Dataset[EmployeeMetadata] = [name: string, number: bigint] scala> metaData.join(fc, metaData.col("name") === fc.col("name")).show +-------+------+-------+-----+ | name|number| name|count| +-------+------+-------+-----+ |Michael| 0|Michael| 1| | Andy| 1| Andy| 1| | Berta| 2| Berta| 1| | Justin| 3| Justin| 1| +-------+------+-------+-----+
Dataframe, RDD and dataset are converted to each other
When actually developing programs, we often need to operate among RDD, dataframe and dataset. At this time, we need to introduce:
However, if we use spark shell, it has been introduced for us by default, so we don’t need to add this line of code separately. The spark here is not the package name in Scala, but the variable name of the created sparksession object, so you must create a sparksession object before importing. The spark object here cannot be declared with VaR, because Scala only supports the introduction of Val decorated objects.
Take the above employee JSON as an example, we first create a dataframe, as follows:
scala> val df = spark.read.json("./examples/src/main/resources/employees.json") rdd: org.apache.spark.sql.DataFrame = [name: string, salary: bigint] scala> df.show() +-------+------+ | name|salary| +-------+------+ |Michael| 3000| | Andy| 4500| | Justin| 3500| | Berta| 4000| +-------+------+
Convert it to RDD using the RDD method:
scala> val rdd = df.rdd rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD at rdd at <console>:25 scala> val array = rdd.collect array: Array[org.apache.spark.sql.Row] = Array([Michael,3000], [Andy,4500], [Justin,3500], [Berta,4000]) scala> array(0) res157: org.apache.spark.sql.Row = [Michael,3000] scala> array(0)(0) res158: Any = Michael scala> array(0)(1) res159: Any = 3000
Define the sample class and use the as method to convert the dataframe into a dataset:
scala> case class Employee(name:String, salary:BigInt) defined class Employee scala> val ds = df.as[Employee] ds: org.apache.spark.sql.Dataset[Employee] = [name: string, salary: bigint] scala> ds.show +-------+------+ | name|salary| +-------+------+ |Michael| 3000| | Andy| 4500| | Justin| 3500| | Berta| 4000| +-------+------+
Conversely, you can use the todf () method to convert the dataset back to the dataframe:
scala> val ddf = ds.toDF() ddf: org.apache.spark.sql.DataFrame = [name: string, salary: bigint] scala> ddf.show +-------+------+ | name|salary| +-------+------+ |Michael| 3000| | Andy| 4500| | Justin| 3500| | Berta| 4000| +-------+------+
Convert dataset to RDD using RDD method:
scala> val rddd = ds.rdd rddd: org.apache.spark.rdd.RDD[Employee] = MapPartitionsRDD at rdd at <console>:25 scala> rddd.first() res165: Employee = Employee(Michael,3000)
Let’s define another RDD and convert it to dataframe:
scala> val rdd = sc.makeRDD(List(("halo", 27), ("ice",18))) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD at makeRDD at <console>:24 scala> rdd.collect res169: Array[(String, Int)] = Array((halo,27), (ice,18))
Use the todf () method to convert it into a dataframe. Pay attention to passing in the corresponding column name, otherwise the default column name is “” Numbered as follows:
scala> rdd.toDF().show() +----+---+ | _1| _2| +----+---+ |halo| 27| | ice| 18| +----+---+ scala> rdd.toDF("name", "age").show +----+---+ |name|age| +----+---+ |halo| 27| | ice| 18| +----+---+
Sparksql can automatically convert the RDD containing the case class into a dataset. The case class defines the structure of the table, and the case class attribute becomes the column name of the table through reflection. Case classes can contain complex structures such as SEQ or array.
We first define the case class, and then use tods method to convert RDD to dataset:
case class User(name:String, age:Int)
scala> val ds = rdd.map(t=>User(t._1, t._2)).toDS() ds: org.apache.spark.sql.Dataset[User] = [name: string, age: int] scala> ds.show +----+---+ |name|age| +----+---+ |halo| 27| | ice| 18| +----+---+
Use a diagram to illustrate the conversion relationship between the three:
Relationship among dataframe, RDD and dataset
Before comparing the three, first review the definitions of these three data types in spark:
- Spark RDD APIs
RDD is also called elastic distributed data set. It is a read-only partitioned collection of data. RDD is the most basic data structure of spark. It allows programmers to perform memory computing operations on large clusters in a fault-tolerant manner. For more information, please refer tolink。
- Spark DataFrame APIs
Unlike RDD, dataframe’s data is organized according to named columns, similar to tables in a relational database. It is also an immutable distributed data set. The dataframe in spark allows programmers to append structures to distributed data sets, allowing higher levels of abstraction.
- Spark DataSet APIs
Datasets is an extension of dataframe APIs, which provides a type safe, object-oriented programming interface. Datasets makes full use of Spark’s catalyst optimizer. For more information, please refer tolink。
The following is a comprehensive comparison and summary of the three from these 15 aspects:
1. Release time
Spark 1.0 release
Spark 1.3 release
Spark 1.6 release
2. Data representation
RDD is a distributed collection of data elements, which is distributed on multiple machines in a computer cluster.
Dataframe is also a distributed data collection, but it passes data throughNamed columnOrganize. It is similar to a table in a relational database.
It is an extension of dataframe, which has the characteristics of strong typing and object-oriented, and makes full use of the advantages of catalyst optimizer and off heap storage mechanism.
3. Data format
RDD can easily handle structured and unstructured data. However, RDD does not specify a specific structure for the data, which needs to be explicitly specified by the user.
Dataframe is only applicable to structured or semi-structured data. It organizes data in the form of named columns, and allows spark to manage the schema of data.
Dataset can also effectively handle structured and unstructured data. It takesJVM row objectperhapsRow object collectionTo represent data.
4. Data source API
RDD can read data from any data source, such as text files, databases connected through JDBC, etc., and can easily process these data without specifying a predefined structure.
Dataframe APIs allows programmers to process data in different formats, including ARVO, CSV, JSON, and different storage systems, including HDFS, hive tables, mysql, etc. Dataframe can read and write data from these data sources.
Dataset APIs also supports reading data from different data sources.
5. Invariance and operability
RDD contains data sets that have been partitioned. The basic unit that embodies parallelism in RDD is called partition. Each partition is a logical partition of data. The partition is immutable and created after some transformation of the existing partition. Invariance is conducive to the consistency of calculation. We can usetoDF()Method to convert an RDD into a dataframe object; On the contrary, it can also be passedrddMethod is converted back.
Domain objects cannot be regenerated after conversion to dataframe objects. For example, if you convert testdf from testrdd, you will not be able to recover the original RDD object.
Dataset overcomes the limitations of dataframe and can regenerate RDD from dataframe. Dataset enables you to convert existing RDDS and dataframes to datasets.
6. Compile time type safety
RDD provides familiar object-oriented programming style and compile time type security.
If you try to access a column that does not exist in the table, the dataframe cannot generate an error during compilation. It can detect the error only during runtime.
It provides compile time type safety
There is no built-in optimization engine available in RDD. When dealing with structured data, RDD cannot take advantage of Spark’s built-in advanced optimizer. Such as catalyst optimizer and tungsten execution engine. Developers can optimize the rows according to the attributes of each RDD.
The dataframe can be optimized using the catalyst tree transformation framework. The optimization is divided into four steps: a) analyze the logical plan according to the reference. b) Logical plan optimization. c) Physical execution plan. d) Code generation: generate Java bytecode from a part of query. The overall optimization process is as follows:Spark SQL optimization
Similar to dataframe, catalyst optimization is also provided.
Whenever spark needs to distribute data in the cluster or write data to disk, it uses Java serialization. Serializing a single Java and scala object is very expensive and requires sending data and structure information between nodes at the same time.
Spark dataframe can serialize data into off heap storage space in binary format, and then directly perform transformation operation on off heap memory, because spark can read these formats. This eliminates the need to encode data through Java serialization.
When serializing data, dataset APIs has the concept of encoder, which can handle the conversion between JVM objects and table representations. It uses the tungsten binary format inside spark to store table data. Dataset can directly perform operations on serialized data to improve memory efficiency. It also allows on-demand access to a single property without deserializing the entire object.
9. Waste recycling
Creating and destroying a single RDD object incurs GC overhead.
When creating an independent object for each row in the dataset, this avoids the overhead of GC.
Dataset uses out of heap data serialization, so GC operation is not required when the object is destroyed.
10. Efficiency / memory usage
When serializing a single Java and scala object, this takes a lot of time and reduces efficiency.
Using out of heap memory for serialization reduces the load on GC. It can dynamically generate bytecode to perform many operations on the serialized data. For some small operations, it can be performed without deserialization.
It allows operations to be performed on serialized data, which improves memory efficiency.
11. Inertia mechanism
RDDS are executed lazily, which means they are not calculated immediately. On the contrary, it first remembers various transformation operations on the original RDD. Spark will start the calculation process only when an action operation requires to obtain results from the driver node. The specific schematic diagram is as follows:
Spark also adopts the inert evaluation method for dataframe, that is, spark will start the real calculation process only when an action operation occurs.
Consistent with RDD and dataframe.
12. Programming language support
RDD APIs supports Java, Scala, python, and R languages.
Consistent with RDD
The dataset API currently only supports Java and scala.
13. Aggregation operation
RDD APIs are relatively slow to perform simple group and aggregation operations.
The dataframe API is very easy to use. It is faster on large datasets.
Aggregation operations are faster on large datasets.
14. Application areas
1) RDD can be used when performing some low-level transformations and actions on your dataset.
2) RDD can also be used when high-level abstraction is required.
- Dataframe and dataset
1) When we need higher-level abstraction, we can use dataframe and dataset.
2) For unstructured data, such as media streams and text streams
3) When you need to use domain specific APIs, you can use dataframe and dataset
4) When you need to use advanced expressions, such as filter, map, aggregation and SQL, you can consider using dataframe and dataset
5) If you need a higher level of compile time type security checking, you can use dataframe and dataset