Summary of sqoop learning

Time:2020-10-27

Sqoop is an open source project of Apache. It belongs to the Hadoop family. Its core function is the data conversion and transmission between relational database and Hadoop. From the name, you can probably see: SQL + Hadoop = sqoop. At present, there are two main versions of sqoop, which are sqoop1 and sqoop2. The version number of sqoop1 is 1.4. X, and that of sqoop2 is 1.99. X. This number is a bit interesting, but it is not clear why it is so common. Sqoop2 and sqoop1 are completely incompatible, and they are completely different in terms of software architecture and usage.

Sqoop1 does not have the concept of server. As a client of Hadoop cluster, sqoop1 calls MapReduce to complete the work. Sqoop2 has a layer of server architecture, which integrates Tomcat. Users log in to the sqoop2 shell through the client, where they can create a link. The link is based on the connector. The connector does not need to be created, but is provided by sqoop2 itself. Common connectors include HDFS connector and JDBC connector, which correspond to HDFS and relational database links respectively. The latest 1.99.6 version also supports Kafka and other connectors. The process of creating a link is equivalent to the instantiation of the connector, which specifies the address, user name, password and other information of the link. After you have a link, you can create a job. The job corresponds to two links, one connecting to the data source and one connecting to the data destination. After the job is created, there is no actual data transmission, and a successful job can be started at any time. For simple operation of sqoop2, please refer to the official Demo:Sqoop5MinutesDemo

My understanding is: sqoop2 server is mainly used to save and manage data source connection information, data transmission tasks, etc. of course, it also provides advanced functions such as operating tasks through rest interface. Compared with sqoop1, the configuration of sqoop2 is more troublesome due to the addition of a server layer. When starting the server, some strange errors are often encountered. Even if the server is successfully started, the subsequent operations may also encounter problems. At present, I have encountered a problem: the link is successfully created, the job is successfully created, the job is stuck after starting, no error message is prompted, and no error message is found in the log file. It didn’t work out in the end, so I turned to sqoop1.

Sqoop1 is much simpler. Decompress to the specified directory and configure Hadoop in / etc / profile_ COMMON_ Home and Hadoop_ MAPRED_ The home environment variable is OK. If you use HBase and hive, you also need to configure the corresponding environment variables. The use of sqoop1 is also very simple, which is the general use of shell commands: the command line knocks sqoop + some parameters. My usage scenario is: import all data from a PostgreSQL database into HDFS. The following is a shell script:

#!/bin/bash
# transport data from PG to HDFS

# parse database names 
psql -h 10.192.33.55 -p 5432 -U postgres -c '\l' > ./database_info
cut -f 1 -d '|'  ./database_info | grep '^\s[a-zA-Z]' | sed s/[[:space:]]//g > database_names
rm -f  ./database_info

# get one database name every time and conduct the transformation by sqoop
cat ./database_names | while read DBNAME
do
if [[ $DBNAME != "postgres" && $DBNAME != "template0" && $DBNAME != "template1" ]]
then
  # make dir on HDFS for each database
  hadoop fs -mkdir /pgdata/$DBNAME
  # make code dir for each database
  mkdir ./gencode/$DBNAME
  SCHEMA=`echo $DBNAME | tr a-z A-Z`
  echo "start working on the database $DBNAME ********************************** "
  sqoop-import-all-tables --connect jdbc:postgresql://10.192.33.55:5432/$DBNAME \
        --direct --username postgres --password postgres --fields-terminated-by '|' \
        --enclosed-by "'" --escaped-by '\' --warehouse-dir /pgdata/$DBNAME -m 1 \
        --outdir ./gencode/$DBNAME   -- --schema $SCHEMA
  echo "finished working on database $DBNAME ==================================="
fi
done

What I use here is sqoop import all tables, which is used to import all tables of a library (compared with Hadoop). If you need to import a single table, you can use sqoop import, and then use — table tablename to specify the table to be imported. In addition to importing the entire table, sqoop also supports filtering imported data according to certain conditions, which is implemented by the — where option. The following explains the parameters used in the script:

--Connect: JDBC link string
--User name: database user name
--Password: database login password. This is the way to specify the password explicitly. You can also specify the password interactively with - P. you can also load the password by specifying a file with password through -- password file.
--Direct: if you can, bypass the normal JDBC linking method and use the database's own unique direct link method to import data, such as MySQL's mysqldump and PG's PSQL. Not all database products support this approach.
--Fields terminated by: the separator between fields. The default is a comma.
--Enclosed by: the character used to wrap the fields. The mode is empty. 
--Escape by: escape character. For example, I have specified the separator '|' escape character '\' here. If the character '|' appears in the field, the escape character should be written as' \ \ |.
--Warehouse dir: import HDFS file storage directory
 -m: The number of map tasks. Some tables cannot be split for multiple map tasks due to no primary key or other reasons. I have specified one map task for convenience.
--Outdir: save directory of Java classes generated during import
--Schema: PG database schema. Note that this kind of parameter is passed to a specific database connection tool (not sqoop). For example, PSQL is used here, so it needs to appear after a separate "-".

For more parameters, please refer to the official document:SqoopUserGuide
The import result is as follows: each record in the database corresponds to a line in the file:

'3663684'|'2016-05-12 08:06:00'|'2016-05-13 11:00:00'|'3'|'669'|'62.24'|'2016051200'|'187'
'3663685'|'2016-05-12 08:06:00'|'2016-05-13 17:00:00'|'3'|'669'|'9.71'|'2016051200'|'187'
'3663686'|'2016-05-12 08:06:00'|'2016-05-13 10:00:00'|'3'|'669'|'72.50'|'2016051200'|'187'
'3663687'|'2016-05-12 08:06:00'|'2016-05-13 04:00:00'|'3'|'669'|'1.00'|'2016051200'|'187'
'3663688'|'2016-05-12 08:06:00'|'2016-05-13 00:00:00'|'3'|'669'|'1.00'|'2016051200'|'187'
'3663689'|'2016-05-12 08:06:00'|'2016-05-13 09:00:00'|'3'|'669'|'110.57'|'2016051200'|'187'
'3663690'|'2016-05-12 08:06:00'|'2016-05-13 22:00:00'|'3'|'669'|'13.86'|'2016051200'|'187'
'3663691'|'2016-05-12 08:06:00'|'2016-05-13 08:00:00'|'3'|'669'|'109.19'|'2016051200'|'187'
'3663692'|'2016-05-12 08:06:00'|'2016-05-13 07:00:00'|'3'|'669'|'104.67'|'2016051200'|'187'

How to import and export to Hadoop.

Recommended Today

Comparison and analysis of Py = > redis and python operation redis syntax

preface R: For redis cli P: Redis for Python get ready pip install redis pool = redis.ConnectionPool(host=’39.107.86.223′, port=6379, db=1) redis = redis.Redis(connection_pool=pool) Redis. All commands I have omitted all the following commands. If there are conflicts with Python built-in functions, I will add redis Global command Dbsize (number of returned keys) R: dbsize P: print(redis.dbsize()) […]