1、 Introduction to spark SQL
Spark SQL is a module used by spark to process structured data. It provides a programming abstraction called dataframe and acts as a distributed SQL query engine.
Why learn spark SQL? We have learned about hive, which converts hive SQL into MapReduce and then submits it to the cluster for execution, which greatly simplifies the complexity of the program writing MapReduce. Due to the slow execution efficiency of the computing model MapReduce. Therefore, spark SQL came into being. It converts spark SQL into RDD and then submits it to the cluster for execution. The execution efficiency is very fast! Spark SQL also supports reading data from hive.
2、 Features of spark SQL
- Seamless integration in spark, mixing SQL query with spark program. Spark SQL allows you to query structured data in spark programs using SQL or familiar dataframe APIs. For Java, Scala, python, and R languages.
- Provide unified data access and connect to any data source in the same way. Dataframes and SQL provide a common way to access various data sources, including hive, Avro, parquet, ORC, JSON and JDBC. You can even connect data through these sources.
- Hive integration is supported. Run SQL or hiveql queries on an existing warehouse. Spark SQL supports hiveql syntax and hive SerDes and UDF, allowing you to access existing hive repositories.
- Support standard connection through JDBC or ODBC. Server mode provides industry standard JDBC and ODBC connections for business intelligence tools.
3、 Core concepts: dataframes and datasets
A dataframe is a data set organized into ordered columns. It is conceptually equivalent to a table in a relational database, but it has richer optimizations at the bottom. Dataframes can be built from various sources, such as:
- Structured data file
- Table in hive
- External databases or existing RDDS
The dataframe API supports languages such as Scala, Java, Python and r.
As can be seen from the above figure, dataframe has more data structure information, that is, schema. RDD is a collection of distributed Java objects. Dataframe is a collection of distributed row objects. In addition to providing richer operators than RDD, the more important feature of dataframe is to improve execution efficiency, reduce data reading and optimize execution plan.
A dataset is a distributed collection of data. Dataset is a new interface added in spark 1.6 and a higher level of abstraction above dataframe. It provides the advantages of RDD (strong typing, the ability to use powerful lambda functions) and the advantages of spark SQL optimized execution engine. A dataset can be constructed from a JVM object and then operated using function transformations (map, flatmap, filter, etc.). The dataset API supports Scala and Java. Python does not support the dataset API.
4、 Create dataframes
- The test data are as follows: employee table
- Define case class (equivalent to table structure: schema)
case class Emp(empno:Int,ename:String,job:String,mgr:Int,hiredate:String,sal:Int,comm:Int,deptno:Int)
- Read the data on HDFS into RDD and associate RDD with case class
val lines = sc.textFile("hdfs://bigdata111:9000/input/emp.csv").map(_.split(","))
- Map each array to an EMP object
val emp = lines.map(x => Emp(x(0).toInt,x(1),x(2),x(3).toInt,x(4),x(5).toInt,x(6).toInt,x(7).toInt))
- Generate dataframe
val allEmpDF = emp.toDF
- Query data through dataframes
- Register dataframe as a table (view)
- Execute SQL query
spark.sql("select * from emp").show