Apache sqoop

Time:2021-12-4

Source: dark horse big data

Apache sqoop

1.png

From the standpoint of Apache, data flow can be divided into data import and export:

Import: data import. RDBMS----->Hadoop

Export: data export. Hadoop---->RDBMS

1.2 sqoop installation

The prerequisite for installing sqoop is that you already have a Java and Hadoop environment.

Latest stable version: 1.4.6
cd $SQOOP_HOME/conf
mv sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh
Modify the following three positions:
export HADOOP_COMMON_HOME= /export/servers/hadoop-2.7.5 
export HADOOP_MAPRED_HOME= /export/servers/hadoop-2.7.5
export HIVE_HOME= /export/servers/hive
    1. Add the jdbc driver package of MySQL
cp /hive/lib/mysql-connector-java-5.1.32.jar $SQOOP_HOME/lib/
  • Verify startup:
bin/sqoop list-databases \
 --connect jdbc:mysql://localhost:3306/ \
 --username root --password hadoop
 
 This command will list all MySQL databases. If it can be displayed, the whole sqoop installation is completed.

2. Sqoop full import

The import tool imports a single table from RDBMS to HDFS. Each row in the table is treated as a record of HDFS. All records are stored as text data in a text file
  • Import syntax: bin / sqoop import (generic args) (import args)

Sqoop test table data:

Create the database userdb in mysql, and then execute the SQL script in resources:

    Create three tables: EMP employee table, EMP_ Add employee address table, EMP_ Conn employee contact form.

2.1 import full MySQL table data to HDFS

The following command is used to import HDFS from the EMP table in the MySQL database server.
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--delete-target-dir \
--target-dir /sqoopresult \
--table emp --m 1

Attribute description: 
Import: perform import
--Connect: connection database address
--Username: user name
--Password: password
--Delete target dir: if the -- target dir specified directory exists, delete it first
--Target dir: specify the export path (HDFS)
--Table: Specifies the table name of the database to export
--M: specify how many maps to execute. If multiple maps are specified, they need to be used with -- split by
--Split by: used to specify which field to divide by
--Fields terminated by: Specifies the separator between fields after export. The default is comma
To verify the data imported in HDFS, use the following command to view the imported data:

    hdfs dfs -cat /sqoopresult/part-m-00000

It can be seen that it will separate the data and fields of EMP table with commas by default on HDFS. Can pass

    --Fields terminated by '\ t' to specify the delimiter.

2.2 import full MySQL table data to hive

There are two ways to import MySQL data into hive, 

    One is to copy the table structure first and then export its data

    One is to export the table structure and data directly

2.2.1 copy the table structure before importing

    1. Copy the table structure of relational data to hive
bin/sqoop create-hive-table \
--connect jdbc:mysql://node-1:3306/sqoopdb \
--table emp_add \
--username root \
--password hadoop \
--hive-table test.emp_add_sp

Attribute description:
Create hive table: specify the operation: create a hive table
--Hive table: import to the hive table (Note: it is recommended that the library name. Table name, otherwise the table will be placed in the default database)
    1. Importing files from a relational database into hive
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/sqoopdb \
--username root \
--password hadoop \
--table emp_add \
--hive-table test.emp_add_sp \
--hive-import \
--m 1

--Hive import: import identified as hive

2.2.2 directly copy table structure and data

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table emp_conn \
--hive-import \
--m 1 \
--hive-database test;

In this import scheme, when imported into hive, the default table name is the same as that of the relational database
Apache sqoop

1568082858914.png

2.3 import table data subset (where filtering)

**--Where can specify query criteria * * when importing data from a relational database. It executes the corresponding SQL query on the database server and stores the results in the target directory of HDFS.
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/sqoopdb \
--username root \
--password hadoop \
--where "city ='sec-bad'" \
--target-dir /wherequery \
--table emp_add \
--m 1

[image upload failed… (image-3f57de-1618541984179)]

2.4 import table data subset (query query)

matters needing attention:

  • The query SQL statement is used to search, and the parameter — table cannot be added;

  • And the where condition must be added;

  • And the where condition must be followed by a string of $conditions;

  • And the SQL statement must use single quotation marks, not double quotation marks;

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--target-dir /wherequery12 \
--query 'select id,name,deg from emp WHERE  id>1203 and $CONDITIONS' \
--split-by id \
--fields-terminated-by '\t' \
--m 2

3 incremental import of sqoop

In the actual work, the data import often only needs to import incremental data. It is not necessary to import all the data in the table into hive or HDFS every time, which will cause the problem of data duplication. Therefore, some fields are generally selected for incremental import, and sqoop supports incremental import of data.

Incremental import is a technique that imports only rows from newly added tables.

Parameters involved in incremental import:

--check-column (col)        
    It is used to specify some columns. These columns are used to check whether these data are imported as incremental data during incremental import. They are similar to self incrementing fields and timestamps in relational databases. 
    Note: the types of these specified columns cannot make any character types, such as char and varchar, impossible. At the same time -- check column can specify multiple columns.
    
--incremental (mode)    
    Append: append, such as appending and importing records greater than the value specified in last value. Last modified: the last modified time. Records after the date specified by last value are appended
    
--last-value (value)
    Specify the maximum value of the column since the last import (greater than the specified value), or you can set a value yourself

3.1 append mode

Operation steps:

    1. First, execute the following instructions to import all our previous data:
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--target-dir /appendresult \
--table emp --m 1
    1. Use Hadoop FS – cat to view the production data file and find that the data has been imported into HDFS
    1. Then insert 2 pieces of incremental data in MySQL’s EMP summary
insert into `userdb`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) values ('1206', 'allen', 'admin', '30000', 'tp');
insert into `userdb`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) values ('1207', 'woon', 'admin', '40000', 'tp');
    1. Execute the following instructions to realize incremental import:
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root  --password hadoop \
--table emp --m 1 \
--target-dir /appendresult \
--incremental append \
--check-column id \
--last-value  1205


--Incremental: the incremental mode of installation. You can select append and LastModified
--Check column: check whether the data is updated according to the column name
--Last value: last value
    1. Finally, after verifying the imported data directory, you can find that there is one more file, which is incremental data

3.2 LastModified mode

Operation steps:

    1. First, create a customer table and specify a timestamp field
create table customertest(id int,name varchar(20),last_mod timestamp default current_timestamp on update current_timestamp);
The timestamp here is set to change when data is generated and updated
    1. Insert the following records respectively:
insert into customertest(id,name) values(1,'neil');
insert into customertest(id,name) values(2,'jack');
insert into customertest(id,name) values(3,'martin');
insert into customertest(id,name) values(4,'tony');
insert into customertest(id,name) values(5,'eric');

Note: do not execute them together, one by one, to ensure that there is an interval between them
    1. Insert a piece of data into the customer test table again
insert into customertest(id,name) values(6,'james')
    1. Incremental import is used
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table customertest \
--target-dir /lastmodifiedresult \
--check-column last_mod \
--incremental lastmodified \
--last-value "2019-05-28 18:42:06" \
--m 1 \
--append

Attribute description:
--Append: only appears in LastModified. The optional value is append | merge key
The last record we inserted will be imported here, but we find that two pieces of data are inserted here. Why?
This is because when the last modified mode is used to process the increment, the data greater than or equal to the last value will be inserted as the increment.

3.3 LastModified mode: append and merge key

Use LastModified mode for incremental processing. Specify whether incremental data is added in * * append * * mode or * * merge key * * mode

The following is a demonstration of incremental update using the merge by mode. Let's update the name field with ID 1.
update customertest set name = 'Neil' where id = 1;
After updating, the timestamp of this data will be updated to the system time when the data is updated
Execute the following instructions to take the ID field as the merge key:
bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table customertest \
--target-dir /lastmodifiedresult \
--check-column last_mod \
--incremental lastmodified \
--last-value "2019-05-28 18:42:06" \
--m 1 \
--merge-key id

Because the merge key mode performs a complete MapReduce operation,
Therefore, we can finally see the generated file part-r-00000 in the LastModified result folder. We will find that the name with id = 1 has been modified and the data with id = 6 has been added.

4. Sqoop export

Before exporting data from Hadoop ecosystem to RDBMS database, the target table must exist in the target database.

Export has three modes:

The default operation is to insert the data from the file into the table using the insert statement.

Update mode: sqoop will generate an update statement to replace existing records in the database.

Call mode: sqoop will create a stored procedure call for each record.

The following is the export command syntax:

$ bin/sqoop export (generic-args) (export-args)

4.1 exporting HDFS data in default mode

By default, sqoop export converts each row of input records into an insert statement and adds it to the target database table. If a table in the database has constraints (for example, its value must be a unique primary key column) and data already exists, care must be taken to avoid inserting records that violate these constraints. If the insert statement fails, the export process fails** This mode is mainly used to export records to an empty table that can receive these results * *. It is usually used for full table data export.

When exporting, you can export all records in hive table or HDFS data (all fields or some fields) to MySQL target table.
    1. Preparing HDFS data
Create a file EMP under the "/ EMP /" directory in the HDFS file system_ data.txt:
The data contents are as follows:

1201,gopal,manager,50000,TP
1202,manisha,preader,50000,TP
1203,kalil,php dev,30000,AC
1204,prasanth,php dev,30000,AC
1205,kranthi,admin,20000,TP
1206,satishp,grpdes,20000,GR
    1. Manually create the target table in MySQL
mysql> USE userdb;
mysql> CREATE TABLE employee ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT,
   dept VARCHAR(10));
    1. Execute export command
bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table employee \
--export-dir /emp/emp_data
    1. Related configuration parameters
--input-fields-terminated-by '\t'  
    Specifies the separator in the file
    
--columns 
    Select columns and control their sorting. When the field column order of the exported data file and the target table are exactly the same, it can not be written. Otherwise, select and arrange columns at comma intervals. Column names or fields that are not included after - columns either have default values or allow null values to be inserted. Otherwise, the database will refuse to accept the data exported by sqoop, resulting in the failure of sqoop job
    
--export-dir 
    Export directory. This parameter must be specified when exporting. At the same time, it needs to have one of the -- table or -- call parameters, - table refers to the corresponding table in the exported database,

--call
    Refers to a stored procedure.
--input-null-string --input-null-non-string
    If the first parameter is not specified, the string "null" will be translated into a null value for the column of string type. If the second parameter is not used, whether it is a "null" string or an empty string, for the field of non string type, the empty strings of these two types will be translated into a null value. For example:
--input-null-string "\\N" --input-null-non-string "\\N"

4.2 update export (updateonly)

    1. Parameter description
--update-key,
    Update ID, that is, update according to a field, such as ID. you can specify multiple update ID fields, which are separated by commas.

--updatemod,
    Specify updateonly (default mode) to update only existing data records without inserting new records.
    1. Preparing HDFS data
Create a file updateonly under the directory "/ updateonly_1 /" of HDFS_ 1.txt:
1201,gopal,manager,50000
1202,manisha,preader,50000
1203,kalil,php dev,30000
    1. Manually create the target table in MySQL
mysql> USE userdb;
mysql> CREATE TABLE updateonly ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT);
    1. Perform all export operations first
bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table updateonly \
--export-dir /updateonly_1/
    1. View the data in MySQL at this time
    1. Add a new file
updateonly_ 2.txt。 The first three pieces of data are modified and a new record is added. Upload to / updateonly_ 2 / under the directory:
1201,gopal,manager,1212
1202,manisha,preader,1313
1203,kalil,php dev,1414
1204,allen,java,1515
    1. Perform update export
bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root --password hadoop \
--table updateonly \
--export-dir /updateonly_2/ \
--update-key id \
--update-mode updateonly
    1. View final results

    Although the log during export shows 4 records exported:

    But in the end, only the update operation was carried out

4.3 update export (allow insert)

    1. Parameter description
--Update key, update ID, that is, update according to a field, such as ID. you can specify multiple update ID fields, which are separated by commas.
--Updatemod, specify allowusert to update existing data records and insert new records at the same time. It is essentially an insert & update operation.
    1. Preparing HDFS data
Create a file allowusert under the directory "/ allowusert_1 /" of HDFS_ 1.txt:
1201,gopal,manager,50000
1202,manisha,preader,50000
1203,kalil,php dev,30000
    1. Manually create the target table in MySQL
mysql> USE userdb;
mysql> CREATE TABLE allowinsert ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT);
    1. Perform all export operations first
bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table allowinsert \
--export-dir /allowinsert_1/
    1. View the data in MySQL at this time
    1. Add a new file
allowinsert_ 2.txt。 The first three pieces of data are modified and a new record is added. Upload to / allowusert_ 2 / under the directory:
1201,gopal,manager,1212
1202,manisha,preader,1313
1203,kalil,php dev,1414
1204,allen,java,1515
    1. Perform update export
bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root --password hadoop \
--table allowinsert \
--export-dir /allowinsert_2/ \
--update-key id \
--update-mode allowinsert
    1. View final results

    The log during export displays 4 records:

    The data is updated and new data is added at the same time

5. Sqoop job

The job job job of sqoop means that some job tasks that need to be exported frequently can be declared as a job job. Later, when the task needs to be executed, you only need to call the job job to avoid writing too much parameter information each time
    1. Basic syntax of job
$ sqoop job (generic-args) (job-args)
   [-- [subtool-name] (subtool-args)]

$ sqoop-job (generic-args) (job-args)
   [-- [subtool-name] (subtool-args)]
    1. Create job

    Here, we create a job named itcastjob, which can import data from RDBMS table into HDFS job.

    The following command is used to create a job that imports the EMP table from the DB database into the HDFS file.

bin/sqoop job --create itcastjob -- import --connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--target-dir /sqoopresult333 \
--table emp --m 1

Note that there should be a space before import
    1. Validate job

    ‘–list’Parameters are used to validate saved jobs. The following command is used to verify the list of saved sqoop jobs.

bin/sqoop job --list
    1. Check job

    ‘–show’Parameters are used to check or verify a specific job and its details. The following command and sample output are used to validate a job named itcastjob

bin/sqoop job --show itcastjob
    1. Execute job

    ‘–exec’Option is used to execute saved jobs. The following command is used to execute the saved job, which is called itcastjob

bin/sqoop job --exec itcastjob
    1. Secret free execution job

    When sqoop creates a job, use the — password file parameter to avoid entering the MySQL password. If — password is used, a warning will appear, and the password must be entered manually every time to execute the job. Sqoop stipulates that the password file must be stored on HDFS and the permission must be 400. Check whether sqoop-site.xml has the following configuration:

    <property>
        <name>sqoop.metastore.client.record.password</name>
        <value>true</value>
        <description>If true, allow saved passwords in the metastore.
        </description>
    </property>
    

    Execute command:

    bin/sqoop job --create itcastjob1 -- import --connect jdbc:mysql://cdh-1:3306/userdb \
    --username root \
    --password-file /input/sqoop/pwd/itcastmysql.pwd \
    --target-dir /sqoopresult333 \
    --table emp --m 1