Through DF, spark can interact with a large number of various data sources (files / databases / big data). We have seen that DF can generate views, which is a very useful function.
The simple read-write process is as follows:
The dataframereader object is obtained through the read method. Similarly, the dataframewriter object is obtained through the write method of DF, and the data is saved to a file or database through its save method.
The supported data formats officially listed by spark are:
- Parquet, which is a serialization format of Apache, I haven’t used it
- CSV, comma, or other separator delimited text
- Orc, which is also a data format of Apache, has not been used
- Avro, which is also a data format of Apache, has not been used
- JDBC and spark are also Java. It is natural to support JDBC data sources
- Hive, it did it
Let’s try a few examples.
Our JSON file is still in the previous non-standard format. I expect it to become a standard format after reading DF:
Dataset json = session.read().json(“spark-core/src/main/resources/people.json”);
In this way, an error will be reported, saying that the file already exists.
So change the file people1 JSON, which will generate a folder people1 JSON, and the task reports an error:
I found an answer on the Internet and tried it:Exception in thread “main” java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z – Stack Overflow
reachhttps://github.com/steveloughran/winutils/blob/master/hadoop-3.0.0/binDownload Hadoop DLL to the bin directory of Hadoop. After executing it, there is no error (it seems that it is only the reason for the win system, and Linux should not report an error). A folder is generated:
It’s strange. Why does spark have to use parquet? But how to use it after saving it like this?
According toParquet Files – Spark 3.2.0 Documentation (apache.org)Parquet is a column storage data file of Apache. Spark will automatically parse its format (which fields there are) and treat each column as nullable. It is mainly used in Hadoop related environments.
The parquet file generated above can be read directly. Like reading JSON files, spark provides the parquet () method:
In addition to the save method, spark also supports direct saving through the parquet method:
This method may be the most used for us. Read the data from the database, process it, and then write it back to the database.
There are two methods to connect using JDBC. The first method is to pass in the connection parameters through option:
DataFrameReader jdbc = session.read().format(“jdbc”);
Dataset jdbcDf = jdbc.load();
Direct execution will report an error because the database driver cannot be found
You can successfully import the driver through Maven (if you do not use Maven project in actual development, you need to put the driver jar package on the server and specify the classpath)
In addition to the option parameter, spark also provides an explicit process of generating DF through the JDBC method, so there is no load:
You can see that the code is shorter and more object-oriented, so the second one is recommended.
In addition, the library name can be placed in the URL or in front of the table name. The following is also OK. This is the capability provided by the driver and has nothing to do with coding
Now to save a DF to the database, use write:
Note that the table to be saved cannot exist in advance, otherwise it will be said that the table already exists. How does spark create its own tables? It will create a table with empty fields according to the inferred type:
What if you want to add data? You can’t create a new table every time. You can specify it using the mode method, and you can see that it has been inserted twice:
Another problem is the coding of Chinese characters. We need to specify the following:
An existing table is used here. The table definition is the copied original table: