Skillfully use the juicefs sync command to migrate and synchronize data across the cloud

Time:2022-5-14

In recent years, cloud computing has become the mainstream. Enterprises, proceeding from their own interests, or unwilling to be locked by a single cloud service provider, or business and data redundancy, or for cost optimization, will try to migrate some or all businesses from offline computer rooms to the cloud or from one cloud platform to another. Business migration involves data migration. As it happens, juicefs has connected various object storage APIs and realized the logic of data synchronization. Let’s understand the sync command of juicefs.

What is juicefs sync

The sync subcommand of juicefs is a fully functional data synchronization utility, which can simultaneously synchronize or migrate data between all object stores supported by juicefs. It supports not only data migration between “object store” and “juicefs”, but also data migration across clouds and regions between “object store” and “object store”. Similar to Rsync, in addition to object storage, it also supports synchronizing local directories, accessing remote directories through SSH, HDFS, WebDAV, etc., and provides advanced functions such as full synchronization, incremental synchronization, conditional pattern matching, etc.

Basic Usage

Command format

juicefs sync [command options] SRC DST

ImmediatelySRCSync toDST, you can synchronize both directories and files.

Of which:

  • SRCRepresents the address and path of the data source
  • DSTRepresents the destination address and path
  • [command options]Represents an optional synchronization option. See detailsCommand Reference

Address formats are[NAME://][ACCESS_KEY:[email protected]]BUCKET[.ENDPOINT][/PREFIX]

Of which:

  • NAMEIs the storage type, such ass3oss。 View detailsAll supported storage services
  • ACCESS_KEYandSECRET_KEYIs the API access key stored by the object
  • BUCKET[.ENDPOINT]Is the access address of the object store
  • PREFIXIs optional and defines the prefix of the directory name to be synchronized.

The following is an example of an address stored in an Amazon S3 object:

s3://ABCDEFG:[email protected]

In particular,SRCandDSTIf/The end will be treated as a directory, for example:movies/。 Not to/The end will be regarded as “prefix” and will be matched according to the rules of prefix matching. For example, there is a prefix in the current directorytestandtextTwo directories, which can be synchronized to the target path using the following command~/mnt/

juicefs sync ./te ~/mnt/te

In this way,syncThe command willteThe prefix matches all directories or files containing the prefix in the current path, i.etestandtext。 And the target path~/mnt/teMediumteIt is also a prefix. It will replace the prefix of all synchronized directories and files. In this example, it willteReplace withte, that is, keep the prefix unchanged. If you adjust the prefix of the destination path, for example, change the destination prefix toab

juicefs sync ./te ~/mnt/ab

Synchronized from the target pathtestThe directory name becomesabsttextWill becomeabxt

Resource list

The following storage resources are assumed:

  1. Object storage a

    • Bucket Name: AAA
    • Endpoint:https://aaa.s3.us-west-1.amazonaws.com
  2. Object storage B

    • Bucket Name: BBB
    • Endpoint:https://bbb.oss-cn-hangzhou.aliyuncs.com
  3. Juicefs file system

    • Metadata storage:redis://10.10.0.8:6379/1
    • Object storage:https://ccc-125000.cos.ap-beijing.myqcloud.com

All storedAccess keyAre:

  • ACCESS_KEYABCDEFG
  • SECRET_KEYHIJKLMN

Synchronization between object storage and juicefs

Store object A’smoviesSynchronize directories to the juicefs file system:

#Mount juicefs
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
#Perform synchronization
juicefs sync s3://ABCDEFG:[email protected]/movies/ /mnt/jfs/movies/

The of the juicefs file systemimagesSynchronize directory to object store a:

#Mount juicefs
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
#Perform synchronization
juicefs sync /mnt/jfs/images/ s3://ABCDEFG:[email protected]/images/

Synchronization between object storage and object storage

Synchronize all data of object store a to object store B:

juicefs sync s3://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]

Advanced Usage

Incremental synchronization and full synchronization

The sync command works in incremental synchronization mode by default, that is, first compare the differences between the source path and the target path, and then synchronize only the parts with differences. have access to--updateor-uOptions for updating filesmtime

For full synchronization, i.e. resynchronization regardless of whether the same file exists on the target path, you can use--force-updateor-f。 For example, store the object A’smoviesFull synchronization of directories to the juicefs file system:

#Mount juicefs
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
#Perform full synchronization
juicefs sync --force-update s3://ABCDEFG:[email protected]/movies/ /mnt/jfs/movies/

pattern matching

syncThe pattern matching function of the command is similar to Rsync. You can exclude or include certain types of files through rules, and synchronize any set through the combination of multiple rules. The rules are as follows:

  • with/The pattern at the end will only match the directory, otherwise it will match the file, link or device;
  • contain*?or[Characters will be matched in Wildcard mode, otherwise they will be matched according to the conventional string;
  • *Match any non empty path component, in/Stop matching at;
  • ?Matching Division/Any character outside;
  • [Matches a set of characters, such as[a-z]or[[:alpha:]]
  • If there is no wildcard, it can be used to match the meaning of the wildcard, but in the case of no wildcard, it can be used to escape the meaning of the wildcard;
  • Always recursively match with patterns as prefixes.

Exclude files / directories

use--excludeOption sets the directories or files to exclude. For example, fully synchronize the juicefs file system to object store a, but do not synchronize hidden files and folders:

In Linux system, all.Beginning names are treated as hidden files

#Mount juicefs
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
#Full synchronization, excluding hidden files and directories
juicefs sync --exclude '.*' /mnt/jfs/ s3://ABCDEFG:[email protected]/

You can repeat this option to match more rules, such as excluding all hidden filespic/Catalogue and4.pngFile:

juicefs sync --exclude '.*' --exclude 'pic/' --exclude '4.png' /mnt/jfs/ s3://ABCDEFG:[email protected]

Include files / directories

use--includeOption sets the directories or files to include (not excluded), for example, synchronize onlypic/and4.pngTwo files, other files excluded:

juicefs sync --include 'pic/' --include '4.png' --exclude '*' /mnt/jfs/ s3://ABCDEFG:[email protected]

When using include / exclude rules, the option with the first position has higher priority.--includeIt should be in the front if it is set first--exclude '*'Excluding all files, then the following--include 'pic/' --include '4.png'Inclusion rules will not take effect.

Multithreading and bandwidth limitation

JuiceFS syncBy default, 10 threads are enabled to perform synchronization tasks, which can be set as needed--threadOption to increase or decrease the number of threads.

In addition, if you need to limit the bandwidth occupied by synchronization tasks, you can set--bwlimitOptions, unitsMbps, the default is0That is, there are no restrictions.

Directory structure and file permissions

By default, the sync command only synchronizes file objects and directories containing file objects, and empty directories will not be synchronized. To synchronize empty directories, you can use--dirsOptions.

In addition, when synchronizing between local, SFTP, HDFS and other file systems, if you need to maintain file permissions, you can use--permsOptions.

Copy symbolic link

JuiceFS syncstayBetween local directoriesDuring synchronization, settings are supported--linksOption turns on the ability to synchronize itself rather than the object it points to when a symbol chain is encountered. The path pointed to by the synchronized symbolic link is the original path stored in the source symbolic link. No matter whether the path is reachable before and after synchronization, it will not be converted.

Several other details that need attention

  1. Symbolic links themselvesmtimeWill not be copied;
  2. --check-newand--permsThe behavior of the option is ignored when symbolic links are encountered.

Multi machine concurrent synchronization

In essence, synchronizing data between two object stores is to pull data from one end and then push it to the other end. As shown in the figure below, the efficiency of synchronization depends on the bandwidth between the client and the cloud.

In the figure below, jucesync supports a large number of concurrent data when the bandwidth of a single machine is full.

The manager executes as the mastersyncCommand, by--workerThe parameter defines multiple worker hosts. Juicefs will dynamically split the synchronized workload according to the total number of workers and distribute it to each host for execution at the same time. That is, the amount of synchronous tasks originally processed on one host is divided into multiple copies and distributed to multiple hosts for simultaneous processing. The amount of data that can be processed per unit time is larger, and the total bandwidth is doubled.

When configuring multi machine concurrent synchronization tasks, you need to configure the SSH password free login from the manager host to the worker host in advance to ensure that the client and tasks can be successfully distributed to the worker.

Manager will distribute the juicefs client program to the worker host. In order to avoid the compatibility problem of the client, please ensure that manager and worker use the same type and architecture of operating system.

For example, synchronize object store a to object store B, and adopt multi host parallel synchronization:

juicefs sync --worker [email protected],[email protected] s3://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]

Current host and two worker hosts[email protected]and[email protected]The task of data synchronization between two object stores will be shared.

If the SSH service of the worker host is not the default port 22, please go through the manager host.ssh/configThe configuration file sets the SSH service port number of the worker host.

Scene application

Remote disaster recovery and backup of data

Remote disaster recovery backup aims at the file itself, so the files stored in juicefs should be synchronized to other object storage. For example, synchronize the files in juicefs file system to object storage a:

#Mount juicefs
sudo juicefs mount -d redis://10.10.0.8:6379/1 /mnt/jfs
#Perform synchronization
sudo juicefs sync /mnt/jfs/ s3://ABCDEFG:[email protected]/

After synchronization, you can directly see all files in object store a.

Create a copy of juicefs data

Different from the disaster recovery backup oriented to the file itself, the purpose of establishing the juicefs data copy is to establish an image with exactly the same content and structure for the juicefs data storage. When the object storage in use fails, you can switch to the data copy to continue working by modifying the configuration. It should be noted that only the data of the juicefs file system is copied here, and the metadata is not copied. The data backup of the metadata engine is still needed.

This requires directly operating the underlying object store of juciefs and synchronizing it with the target object store. For example, to store object B as a copy of the data of the juicefs file system:

juicefs sync cos://ABCDEFG:[email protected] oss://ABCDEFG:[email protected]

After synchronization, the content and structure in object store B are exactly the same as those in the object store used by juicefs.

If you are helpful, please pay attention to our projectJuicedata/JuiceFSYo! (0ᴗ0✿)