Spark authority Guide – what is spark? (qbit)



  • This is the study note of spark authority Guide
#English original
《Spark: The Definitive Guide》
By bill chambers / Matei zaharia
First edition in February 2018

#Chinese Translation
Spark authority Guide
Translated by Zhang Yanfeng / Wang Fangjing / Chen Jingjing
First edition April 2020
  • Most of the contents of spark authority guide are written in spark 2.2


Part I overview of big data and spark
Chapter 1 what is spark? (this paper
Chapter 2 spark analysis
Chapter 3 Introduction to spark Toolset

Part II structured API -- dataframe, SQL and dataset
Chapter 4 overview of structured API
Chapter 5 basic structured operations
Chapter 6 dealing with different data types
Chapter 7 aggregation operation
Chapter 8 connection operation
Chapter 9 data sources
Chapter 10 spark SQL
Chapter 11 dataset

Part iii low level API
Chapter 12 elastic distributed data set
Chapter 13 advanced RDD
Chapter 14 distributed shared variables

Part IV production and Application
Chapter 15 how spark runs on a cluster
Chapter 16 developing spark applications
Chapter 17 deploy spark
Chapter 18 monitoring and commissioning
Chapter 19 performance tuning

Part V flow processing
Chapter 20 flow processing basis
Chapter 21 structured flow processing foundation
Chapter 22 event time and stateful handling
Chapter 23 organization flow treatment in production

Part VI advanced analysis and machine learning
Chapter 24 overview of advanced analysis and machine learning
Chapter 25 pretreatment and Feature Engineering
Chapter 26 classification
Chapter 27 return
Chapter 28 recommendation system
Chapter 29 unsupervised learning
Chapter 30 graph analysis
Chapter 31 deep learning

Part VII ecosystem
Chapter 32 language support: python (pyspark) and R (sparkr and sparklyr)
The thirty-third chapter ecosystem and community

Chapter 1 what is spark?

  • Most of the book is written in spark 2.2, so you should download 2.2 or later.
  • Running spark on a cloud platform

If you want to have a simpler interactive experience to learn about spark, you may prefer to use databricks Community Edition. As mentioned earlier, databricks is a company founded by the Berkeley team, which founded spark and provides free community edition based on cloud services as a learning environment. Databricks Community Edition contains all the data and code samples in this book, and you can run it quickly. To use databricks Community Edition, follow the… You will use Scala, python, SQL or R to run spark program through web interface, and you can also get the visualization of processing results.

  • Data used in this book

We will use some datasets as examples in this book. If you want to run the code locally, you can… Download them. You need to download the data first, then put it in a folder and run the code snippets in this book.

This article is from qbit snap