The term ETL is often used in data warehouse, but its object is not limited to data warehouse. ETL refers to the process of extracting, transforming and loading data from the source system to the target data store. Common ETL tools include Oracle data integrator, Informatica powercenter, Datastage, kettle, datasprider, etc.
In big data applications, massive data and support for potential applications are very important aspects, and reflect the huge difference with traditional application development. Therefore, when choosing an appropriate ETL tool, in addition to the correctness of data processing, integrity, ease of use of tools and support for different data formats, the efficiency of data processing, scalability of processing capacity and fault tolerance must be considered.
Spark is a general parallel computing framework similar to Hadoop MapReduce, which is open-source by UC Berkeley amp lab. It is a new big data processing engine. Its main feature is to provide a cluster distributed memory abstraction. Compared with Hadoop, spark puts the intermediate data in the memory to avoid frequent disk writing, so it is more efficient and more suitable for iterative calculation; it is richer in operation type and development language support; it realizes fault tolerance through checkpoint in distributed data set calculation. Moreover, due to the distributed nature of spark, the expansion of processing capacity is easier and more economical. Therefore, spark, as an ETL tool, can help enterprises achieve a win-win situation in technology and finance.
Sequoiadb is a new generation of newsql database and a typical representative of document distributed data. Sequoiadb Enterprise Edition realizes batch analysis, stream processing and other functions close to applications by deeply integrating the latest spark memory computing framework. The two-tier architecture of separation of storage layer and computing layer is complementary in technology, which is the mainstream of the new architecture of big data in Silicon Valley, which brings the capabilities of distributed computing and distributed storage into full play respectively. In the latest version of spark, spark SQL supports standard SQL more and more, which reflects the maturity of spark products. Therefore, in sequoiadb application, using spark for data processing and analysis is an ideal choice.
2、 Function overview
As an ETL tool, it must have the support of multiple data sources, such as HDFS, HBase, Amazon S3, mongodb, etc. In this regard, spark supports docking with a variety of data sources, including HDFS, Cassandra, HBase, hive, alluxio (i.e., Tachyon) and Amazon S3; spark can also read and write data from elasticsearch, a full-text retrieval tool. Spark, as an ETL tool, can meet the requirements of versatility.
The data flow diagram with spark as ETL is shown in Figure 1
Figure 1 spark is the ETL data flow diagram
In the above data flow diagram, the stock data stored in HDFS, Cassandra and other systems can be extracted into spark through the interface provided by spark, and then processed with spark’s rapid processing capacity, such as data De duplication and update, and finally the structural data is stored in the tsutsuga database. In the whole processing process, it is not necessary to save the data in the form of data files, which accelerates the processing speed.
The data stored in the Tsuga database can also be processed in spark, and the processed data can be dropped into the database.
3、 Environment construction
3.1 spark environment construction
Spark operation modes include standalone, spark on yarn and spark on mesos. The main difference between the three modes is that they use different resource management and scheduling tools. Here, we take the stand alone mode as an example to illustrate.
Before deployment, configure the trust relationship between the machines that need to deploy spark, and install and configure JDK according to the requirements of spark version for JDK version. Then you can start installing spark.
First, get the latest version of spark installation files from Spark’s official website. After the download is complete, unzip it to the destination folder.
tar -zxvf spark-2.0.0-bin-hadoop2.6.tgz
From the extracted file directory, we can see that compared with version 1.6, the directory structure of version 2.0 has some slight changes. The Lib directory has been deleted and the jars directory has been added.
Then modify the configuration file. The configuration file that usually needs to be modified contains spark- env.sh However, for the convenience of subsequent use, hive needs to be modified or added- site.xml 、spark- defaults.conf 、log4j.properties。 It is explained below.
Configure spark environment variables, including:
SPARK_ MASTER_ IP: the IP address of the master node of spark cluster;
SPARK_ MASTER_ Port: the port number of the master node, which is 7077 by default;
SPARK_ WORKER_ Instances: the number of worker processes started by each node;
SPARK_ WORKER_ Cores: the number of cores available to the worker on the machine;
SPARK_ WORKER_ Memory: the total memory that the worker can allocate to the executor;
SPARK_ WORKER_ Dir: worker working directory;
SPARK_ LOCAL_ Dirs: node shuffle data storage directory;
SPARK_ Classpath: Spark’s default classpath.
Configure the node running worker in the spark cluster. The value is the host name, one host name for each line.
It is mainly used for Metadatabase configuration. Spark uses Derby as the database management metadata by default. When we need to configure other databases as meta databases, we need to add and modify this configuration file. An example is as follows:
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:postgresql://192.168.111.129:5432/metastore</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.postgresql.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mypassword</value> <description>password to use against metastore database</description> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> </configuration>
Spark default configuration. This configuration can be configured spark.master , spark.driver.memory , spark.executor.extraJavaOptions Etc. When we need to use sparksql through JDBC, we need to start thriftserver first and specify master when starting_ URL, this master_ The URL can be configured to spark- defaults.conf Medium spark.master Parameter, omit the need to enter master when starting from the command line_ URL trouble.
Configure spark log logs.
Finally, start the spark cluster. After the configuration file is modified, spark can be started. Since the master and worker information has been configured, you can start the spark cluster through the following command:
3.2 configure the connection between spark and sequoiadb
Sequoiadb has open source spark connector, and the corresponding code can be found on GitHub website（https://github.com/SequoiaDB/…）After packaging, the connector is named spark-sequoiadb-2.0.0.jar. Connect the connector and sequoiadb Java driver package sequoiadb.jar Together, copy to jars directory, and in spark- env.sh Configure spark in_ Classpath to configure the full path of the connector and driver package to spark_ In the classpath environment variable, such as
After configuration, start thriftserver with the following command:
After successful startup, you can see the thriftserver related processes through the JPS command
Figure 2 thriftserver related processes
So far, the environment of spark and sequoiadb has been set up, and you can start to use spark to process data.
4、 Sequoiadb data processing
4.1 treatment process
After the construction of sequoiadb and spark environment is completed, the mapping for different data sources can be created in spark in different ways according to different data sources. Then the data source and sequoiadb can be connected through spark to complete ETL processing.
This section takes the ETL of data in sequoiadb by sparksql as an example to illustrate the combined application of spark and sequoiadb. Similar processing logic can be used when the source data is other.
To create a mapping table to the set in sequoiadb in spark, the syntax is as follows:
create table tablename ( f1 string, f2 string, f3 int, f4 double, f5 string, f6 long ) using com.sequoiadb.spark OPTIONS ( host 'sdbserver1:11810,sdbserver2:11810,sdbserver3:11810', collectionspace 'foo', collection 'bar');
Where, host is the access address of sequoiadb in the format of hostname:svcname Can contain multiple addresses. Collectionspace and collection represent the collection space and collection in sequoiadb respectively.
This example is the scenario of updating the existing stock data with daily incremental data. The table involved is: account information table acct_ Info is the result table and account information transfer table repo_ acct_ Info is daily incremental data, acct_ info_ His is the existing stock data. Since sparksql does not support update and delete operations, scenarios involving update and delete can be completed by saving the result data in a new table. Update is divided into two steps:
Step 1: insert the latest data from the transit table into the result table. Through this step, it is ensured that the first incoming data and existing updated data are entered into the result table. The execution statement is:
insert into table dst.acct_info select * from src.repo_acct_info where tx_date = '2016-06-23' ;"
Step 2: insert the data without any update into the result table. The execution statement is:
insert into table dst.acct_info select distinct a.* from src.acct_info_his a left join src.repo_acct_info b on a.id = b.id and b.tx_date = '2016-06-23' where b.id is null ;"
Where id is acct_ The primary key of the info table, which uniquely identifies a record by ID. Through the above two steps, the result table acct_ The updated data is the past data.
The delete operation only needs to insert the data that does not meet the deletion conditions into the new table.
4.2 performance results
1. System configuration
2. Update scenario
3. Delete scenario
Spark can easily read multiple data sources. As a mature new framework, spark not only supports relatively new data sources such as HDFS, Cassandra, hive and Amazon S3, but also supports spark, which is common in traditional industries such as Oracle, DB2 and Teradata. Spark supports SQL2003, and the application of spark can give full play to the strengths of traditional enterprises in SQL processing. In big data applications, using spark as ETL tool can give full play to the performance advantages of spark, a distributed computing framework.
As one of the 14 publishers certified by databricks worldwide, sequoiadb Enterprise Edition deeply integrates the latest spark memory computing framework. The architecture and technology of storage layer and computing layer are separated and complementary. It is the mainstream of the new big data architecture in Silicon Valley, and it brings the distributed computing and distributed storage capabilities into full play. Nowadays, spark technology has been widely used in different fields such as real-time stream processing and analysis, and background data processing can also be realized by spark technology.