Efficient data exchange between hashdata and HDFS

Time:2021-7-29

Background and challenges

Before the emergence and popularization of object storage technology, HDFS (Hadoop distributed file system) was one of the few open source, free and cost-effective Pb storage systems in the market (compared with expensive San systems), which was widely used in enterprise data archiving scenarios. At the same time, many distributed computing frameworks for different technical requirements are derived from HDFS, which makes it suitable for the cleaning and regularization of unstructured data, flow computing, machine learning and other scenarios. On the other hand, as the master data system, data warehouse keeps the most valuable historical data within the enterprise, and supports daily business analysis and business decision-making. In many large organizations, these two systems coexist. Therefore, how to achieve efficient data exchange with the big data platform based on HDFS is a problem that every modern data warehouse product needs to consider and solve.

As a leading open source enterprise data warehouse product in the industry, Greenplum database (hereinafter referred to as GPDB) mainly provides two methods: PXF and gphdfs. Although both make use of the external table function of GPDB, the former requires additional installation and deployment of PXF server process. In a complex IT environment, the process is cumbersome, error prone and poor end-user experience. Therefore, in the initial planning and implementation of the function of hashdata data warehouse accessing HDFS, the technical route of gphdfs is adopted: by adding an external table protocol to access HDFS, each computing node is directly connected to the HDFS cluster without any intermediate node or system, which greatly reduces the use threshold and ensures the efficiency of data exchange between the two systems.

Before further elaborating on the implementation details, let’s briefly review the challenges faced by gphdfs built in GPDB in actual use (this is the real feedback of a large bank customer):

1. Additional software installation is required

  • Install Java on each node;
  • Install Kerberos client on each node;
  • Install Hadoop client on each node;

2. Configure complexMiscellaneous and error prone

  • Configure Java environment variables of gpadmin user;
  • Change database parameters;
  • For each HDFS cluster, each node is configured with Hadoop core-site.xml, yarn-site.xml and hdfs-site.xml;

3. Cannot access multiple HDFS systems at the same time

  • Each database session can only access one set of HDFS systems (related to environment variable settings), and cannot access multiple sets of HDFS systems at the same time (for example, associating data on different HDFS).

Gphdfs implementation of hashdata

Inheriting from GPDB, hashdata natively supports a variety of external table protocols. In addition to the above gphdfs, it also includes file (file system), gpfdist (file server), OSS (object storage), etc., which can be used to realize high-speed loading and unloading of data. The following is a schematic diagram of the gphdfs external table:

Efficient data exchange between hashdata and HDFS

At the technical architecture level, the gphdfs implementation of hashdata is consistent with that of GPDB, and more differences are reflected in the implementation details. First of all, we use libhdfs3, which is implemented natively in C + +, as the client to access HDFS. While avoiding the complex and error prone links such as installing, deploying and configuring the Java running environment and Hadoop client, we also reduce the CPU and memory utilization of the system.

Secondly, gphdfs.conf file similar to Oracle data source configuration file is introduced to centralize the access information related to multiple HDFS systems and simplify the management of access configuration; Modify the syntax of HDFS external table definition, omit a large number of configuration options (put them in gphdfs.conf file), and greatly reduce the difficulty of users. Because the corresponding relationship between Hadoop client (including the configuration of environment variables) and HDFS system is decoupled, the new gphdfs can simultaneously access multiple HDFS (these HDFS systems can be provided by multiple different Hadoop manufacturers) external tables in the same SQL statement, which greatly facilitates the multi-source data fusion in complex big data systems.

Finally, benefiting from the flexible and elegant extension framework of PostgreSQL (of course, including the external table framework of GPDB), this new gphdfs function can be easily implemented in the form of an extension plug-in without modifying the database kernel code to replace the original implementation.

Application practice

HadoopThe cluster is configured with Kerberos authentication

  • Install Kinit (installed on each node):
yum install krb5-libs krb5-workstation
  • Configure krb5.conf (per node configuration):
[realms] 
HADOOP.COM = { 
admin_server = host1 
kdc = host1 
kdc = host2 
}
  • Copy the KeyTab file of the Kerberos authenticated user to each node:
gpscp -f hostfile user.keytab =:/home/gpadmin/key_tab/ 
  • Configure gphdfs.confFile (configured for each node):
hadoop_cluster1:
hdfs_namenode_host: pac_cluster_master 
hdfs_namenode_port: 9000
hdfs_auth_method: kerberos 
krb_principal: gpadmin/[email protected] 
krb_principal_keytab: /home/gpadmin/hadoop.keytab 
hadoop_rpc_protection: privacy 
is_ha_supported: true
dfs.nameservices: mycluster
dfs.ha.namenodes.mycluster: nn1,nn2 
dfs.namenode.rpc-address.mycluster.nn1: 192.168.111.70:8020 
dfs.namenode.rpc-address.mycluster.nn2: 192.168.111.71:8020 
dfs.namenode.http-address.mycluster.nn1: 192.168.111.70:50070 
dfs.namenode.http-address.mycluster.nn2: 192.168.111.71:50070 
dfs.client.failover.proxy.provider.mycluster: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailover... 
hadoop_cluster2: 
... 

HadoopThe cluster is configured without Kerberos authentication

  • Configure gphdfs.conf file (configured for each node):
hadoop_cluster1: 
hdfs_namenode_host: pac_cluster_master 
hdfs_namenode_port: 9000
hdfs_auth_method: simple
krb_principal: gpadmin/[email protected] 
krb_principal_keytab: /home/gpadmin/hadoop.keytab
hadoop_rpc_protection: privacy
is_ha_supported: true 
dfs.nameservices: mycluster 
dfs.ha.namenodes.mycluster: nn1,nn2 
dfs.namenode.rpc-address.mycluster.nn1: 192.168.111.70:8020
dfs.namenode.rpc-address.mycluster.nn2: 192.168.111.71:8020 
dfs.nameno`de.http-address.mycluster.nn1: 192.168.111.70:50070 
dfs.namenode.http-address.mycluster.nn2: 192.168.111.71:50070
dfs.client.failover.proxy.provider.mycluster: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailover... 
hadoop_cluster2: 
... 

Efficient data exchange between hashdata and HDFS

Accessing Hadoop_ Cluster1 cluster:

Write data to HDFS:

CREATE WRITABLE EXTERNAL TABLE ext_w_t1(id int,name text) LOCATION(‘gphdfs://tmp/test1/ hdfs_cluster_name=hadoop_cluster1’) format ‘csv’; 
INSERT INTO ext_w_t1 VALUES(1,'hashdata'); 

Read HDFS data:

CREATE READABLE EXTERNAL TABLE ext_r_t1(id int,name text) LOCATION(‘gphdfs://tmp/test1/ hdfs_cluster_name=hadoop_cluster1’) format ‘csv’; 
SELECT * FROM ext_r_t1; 

To access Hadoop_ For cluster2 cluster, you need to set HDFS when creating external tables_ cluster_ name=hadoop_ cluster2。

Floor use of hashdata gphdfs

Before 2019, as the largest, most complex and most loaded customer of GPDB in the world, a large state-owned bank runs dozens of GPDB clusters of various versions on X86 physical servers and Hadoop clusters provided by a single manufacturer. Starting in 2019, with the implementation of the big data cloud platform project, the customer began to gradually migrate the big data analysis business to cloud Hadoop and cloud data warehouse system (hashdata data warehouse).So far, more than 20 sets of hashdata computing clusters have been launched, as well as several sets of Hadoop clusters provided by at least two different manufacturers. By using the new gphdfs function provided by hashdata, customers can easily, quickly and efficiently complete thousands of jobs accessing multiple HDFS systems in nearly 100 MPP production clusters (including the original GPDB cluster and the new hashdata cluster) every day.