Hive’s solution of importing more data from MySQL

Time:2022-7-31

Original derivative command:


bin/sqoop import -connect jdbc:mysql://192.168.169.128:3306/yubei -username root -password 123456 -table yl_city_mgr_evt_info --split-by rec_id -m 4 --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite -create-hive-table -delete-target-dir -hive-database default -hive-table yl_city_mgr_evt_info

Cause analysis: it may be that there are ‘\n’ and other separators in the fields in MySQL. When importing hive, it defaults to ‘n’ as the line feed character, resulting in more records in hive.

resolvent:

Adding the –hive drop import delims option when importing data will delete \n \r \01 in the field.

Final derivative command:


bin/sqoop import -connect jdbc:mysql://192.168.169.128:3306/yubei -username root -password 123456 -table yl_city_mgr_evt_info --split-by rec_id -m 4 --hive-drop-import-delims --fields-terminated-by "\t" --lines-terminated-by "\n" --hive-import --hive-overwrite -create-hive-table -delete-target-dir -hive-database default -hive-table yl_city_mgr_evt_info

Refer to official documents: https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html

Add: sqoop imports MySQL data into the pit encountered by hive

1. Import sqoop into HDFS

1.1 after executing the sqoop job, the last value will be automatically updated

#Sqoop incremental import script
bin/sqoop job --create sqoop_hdfs_test02 -- import \
--connect jdbc:mysql://localhost:3306/pactera_test \
--username root \
--password 123456 \
--table student \
--target-dir /user/sqoop/test002/ \
--fields-terminated-by "\t" \
--check-column last_modified \
--incremental lastmodified \
--last-value "2018-12-12 00:03:00" \
--append

Description: – the -append parameter is required, otherwise an error will be reported when running the job for the second time, as follows:

So far, the construction of sqoop job has been completed!

2. Hive creates a table and reads the data imported by sqoop


create external table if not exists student_hive (SId int,Sname string ,Sage string,Ssex string , last_modified Timestamp) 
row format delimited fields terminated by '\t' location 'hdfs://node01:8020/user/sqoop/test002/';

Note: the format of time in hive here is timestamp, and the data set to date DB cannot be loaded normally.

For the first full load, the whole route is completely OK, and the hive table can query the data.

———————–Key dividing line———————–

*Incremental loading in sqoop LastModified format will save the last value as the system time of job execution. If the check column of the test database is less than the current system time (that is, the last value of the previous job), the data will not be loaded.

If sid=6, it was not loaded, so it was changed to today’s time (2018-12-26 17:05) for data test, and the data was successfully loaded! Yo!!

Summary:

When using LastModified format to perform incremental import of sqoop,

1. Pay attention to the use of append;

2. Last value is the system time when the job runs. During data testing, it is necessary to ensure the accuracy of the data and the self growth of the data.

3. Everything is fixed. Check the data and accurately locate the problems encountered in your system

The above is my personal experience. I hope I can give you a reference, and I hope you can support developeppaer. If there is any mistake or failure to consider completely, please don’t hesitate to comment.