That little thing about distcp

Time:2021-2-23

[TOC]

Soul torture: do you really understand distcp? This is about distcp

background

Today, when I was sorting out my notes, I found that several temporary records were recorded, and file copying between clusters needed attention. Although the recorded things and key points were different, the core things were distcp related. Therefore, I feel it is necessary to summarize. The content of this article is mainly a little detail, more emphasisHow to find your answer quickly when you are in doubt
Reference address

summary

First of all, what is distcp, literallyDistributed copyThat is to say, the original work done by one person is allocated to many people for parallel processing. Of course, the granularity of this task division isFile basedIn other words, if there is only one file, the copy can only be done by one person at most

Basic Usage

#You can use HDFS protocol (the same version of Hadoop) or HFTP protocol (different versions of Hadoop are available)
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
#At the same time, you can specify multiple data sources to copy
hadoop distcp hdfs://nn1:8020/foo/bar1 hdfs://nn2:8020/foo/bar2 hdfs://nn3:8020/bar/foo
#You can also use - F, which literally means - file, which is my data source. Source is stored in a file as an absolute path list

Parameter description

I don’t want to translate the Chinese meaning of each parameter one by one, but I can basically translate it literally. Here I mainly talk about the details that I think we may need to pay attention to in use

[[email protected] ~]$ hadoop distcp --help
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                Reuse existing data in target files and append new
                        data to them if possible
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -diff <arg>            Use snapshot diff report to identify the
                        difference between source and target
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugpcaxt)(replication,
                        block-size, user, group, permission,
                        checksum-type, ACL, XATTR, timestamps). If -p is
                        specified with no <arg>, then preserves
                        replication, block size, user, group, permission,
                        checksum type and timestamps. raw.* xattrs are
                        preserved when both the source and destination
                        paths are in the /.reserved/raw hierarchy (HDFS
                        only). raw.* xattrpreservation is independent of
                        the -p flag. Refer to the DistCp documentation for
                        more details.
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories
  1. –The relationship among append, – rewrite, – Update
parameter explain remarks
append Append, reuse the existing data of sink file, and try to append the data to judge the standard todo
overwrite Overlay, whether it existed before or not, is a rebuild
update Update, judge whether the source and sink file size is consistent
  1. -m

This is a good explanation. We should use the MapReduce model for distcp, which is a bit similar to sqoop. So – M is – map, which means parallelism. StartmostHow many maps can be copied at the same time? Why is it the most? Because copying is based on files (strictly speaking, it should be block). It must be a little more difficult to split a file into multiple copies. Therefore, if the source has only one file, then – M specifies how many, and only one map task will perform the copy task

  1. -i

Ignore failure means that if the copy task is heavy and the resources are tight, it is likely to fail in the middle of the process. However, unlike the full copy of the task every time the task is restarted, we can consider ignoring failure and incremental copy during subsequent execution

  1. -strategy

Copy policy problem. By default, the task is split according to the size of the file. The optional parameter isdynamic|uniform, the default is the same number of bytes for each copy task

  1. -p

The literal meaning is the status of the file saved to the target system, including the copy, block size, user permissions, etc. of course, the default must be consistent with the target system

  1. -bandwidth

Obviously, it is the bandwidth size. Because distcp has no calculation logic, it is an IO intensive task. During cluster migration, it is necessary to strictly control the use of bandwidth. This parameter is to control the use of bandwidth of map. Then, limiting the number of distcp tasks and the number of distcp maps can control the use of bandwidth of the whole migration program

QA

Here is a record of the small problems I encountered in using. It is not necessarily a problem of principle optimization, but a place where there may be doubts or ambiguities in use

Q1: if there is data conflict when copying, what will be the result?
A1: if a file with the same name appears in soure, the distcp task will fail and print an error log. If there are files to be copied in the target directory, the copy task of the source file will be ignored by default. Of course, you can also set an error report. If another process writes data to the target file, it will also report an error and print an error log

Q2: is there any requirement for the task deployment location of distcp?
A2: only the node running the distcp task or the task can interact with the upstream and downstream, and the deployment location is not required. In fact, it is generally deployed on the node of the target cluster

Q3: what do distcp tasks need to pay attention to when doing big data migration?
A3: the distcp task is a big IO task, so bandwidth is the limiting factor. You can write a script to monitor the cluster machine bandwidth (shell / Python is preferred), and then start the migration task in idle time

appendix

#Example 1: when copying a directory, the downstream will automatically generate a directory without adding it manually, as follows
time hadoop distcp hdfs://nn1:8020/user/hive/warehouse/${database}.db/${table}/dt=${partition}  hdfs://nn2:8020/user/hive/warehouse/${database}.db/${table} >> /logs/distcp/${database}.log

#Example 2: copy of multiple parameters
hadoop distcp \
    - Dmapred.jobtracker.maxtasks . per.job=1800000  \Maximum map number of tasks (data is divided into multiple map tasks)
    - Dmapred.job.max . map.running=4000  \Maximum map concurrency
    - Ddistcp.bandwidth=150000000  \Bandwidth
    - Ddfs.replication=2  \Replication factor, two copies
    - Ddistcp.skip.dir= $skippath \ # filtered directory (directory not copied)
    - Dmapred.map.max . attempts = 9 \ # maximum attempts per task
    - Dmapred.fairscheduler.pool=distcp  \Specifies the pool for the task to run
    -Pugp # reserved attributes (user, group, permission)
    -I \ \ ignore failed tasks
    -Skipcrccheck # ignore CRC check (prevent task failure caused by inconsistent HDFS versions of source and target clusters.)
    hdfs://clusterA : 9000 / AAA / data source address
    hdfs://clusterB : 9000 / BBB / data # destination address
    
#Example 3: Cross Version copy, parameters and dfs.http.address bring into correspondence with
hadoop distcp -numListstatusThreads 40 -update -delete -prbugpaxtq hftp://nn1:50070/source hdfs://cluster2/target

Reference link:

  1. Hadoop migration
  2. Distributed copy

The record is a little hasty. Please correct any mistakes