Huawei cloud fusioninsight Mrs HDFS component data storage strategy configuration guide

Time:2022-7-30

Operation scenario

By default, HDFS namenode automatically selects datanode to save a copy of the data. In actual business, the following scenarios may exist:

  • There may be different storage devices on the datanode. You need to select a suitable storage device to store data hierarchically.
  • The importance of data in different directories of datanode is different. Data needs to be saved by selecting an appropriate datanode node according to the directory label.
  • Datanode clusters use heterogeneous servers, and key data needs to be stored in rack groups with high reliability.

Impact on the system

Configuring HDFS data storage policy requires restarting the service, which cannot be accessed when the service is restarted.

prerequisite

  • The administrator has planned the strategy of data storage according to business needs.
  • HDFS client is installed, please refer to the “installing client” chapter in the “administrator’s Guide”.

Configure datanode to use hierarchical storage

HDFS’ heterogeneous tiered storage framework provides RAM_ Disk, disk, archive and SSD storage devices correspond to different storage media that may exist on the datanode.

  • RAM_ Disk is a hard disk virtual by memory, which has the highest read and write performance. Its capacity is limited by the size of memory, which is usually very small, and data may be lost after power failure.
  • SSD, namely solid state disk, has high read and write performance. However, the storage capacity is usually small, and the unit storage cost is higher than that of ordinary mechanical hard disk.
  • Disk refers to ordinary mechanical hard disk, which is the main storage type used by HDFS to save data.
  • The archive type represents high-density and low-cost storage media with relatively poor read-write performance. It is usually assembled at nodes with low computing power for large-capacity non hotspot data storage.

By reasonably combining the four storage types, a storage strategy suitable for different scenarios can be formed. The storage strategies currently supported by HDFS are shown in the following table:

Policy ID name Block placement location (number of copies) Alternative storage strategies Alternative storage policies for replicas
15 LAZY_PERSIST RAM_DISK: 1, DISK: n-1 DISK DISK
12 All_SSD SSD: n DISK DISK
10 ONE_SSD SSD: 1, DISK: n-1 SSD, DISK SSD, DISK
7 HOT (default) DISK: n < none> ARCHIVE
5 WARM DISK: 1, ARCHIVE: n-1 ARCHIVE, DISK ARCHIVE, DISK
2 COLD ARCHIVE: n < none> < none>

Take the policy “15-lazy_persist” as an example. If the number of block copies is 3, the first block copy of the file configured with this policy will be written to ram_ Disk, and the remaining copies are written to disk. As a backup solution, if the first block copy is written to ram_ If the disk type storage medium fails, try to write the storage type specified in the “alternative storage policy”; If a copy other than the first copy fails to be written, try to write to the storage type specified in “alternate storage policy for copy”.

  1. In fusioninsight manager, select “cluster > name of cluster to be operated > Services > HDFS > configuration > all configurations”.
  2. Check whether the parameter value of “dfs.storage.policy.enabled” is the default value of “true”. If not, please modify it to “true”.
  3. Modify the parameter value of “dfs.datanode.data.dir”. By default, the system thinks that the storage device for data storage is disk. At this time, it needs to be modified according to the type of the actual storage device. The parameter value is “[storage setting type] storage directory”, and multiple directories are separated by commas. The modification effect is as follows:
“[RAM_DISK]/home/hadoop/dfs/ram,[SSD]/home/hadoop/dfs/ssd,/home/hadoop/dfs/hd,[ARCHIVE]/home/hadoop/dfs/archive”
  1. Modify the parameter value of “dfs.datanode.max.locked.memory”, which must be greater than the parameter value of “dfs.blocksize” and smaller than the mounted ram_ The amount of disk space.
  2. Click “save”, click “OK” in “save configuration”, select “more > Restart service” after saving, and restart HDFS service. The interface prompts “operation succeeded.”, Click finish and HDFS starts successfully.
  3. Execute the command HDFS storagepolicies -setstoragepolicy -path < Path > -policy < policy name> on the HDFS client to specify the directory < path> of a specific path, and store it hierarchically according to the policy < policy name>. For example, when the test directory under the root path is stored according to the “lazy_persist” policy, you can execute the following commands:
hdfs storagepolicies -setStoragePolicy -path /test -policy LAZY_PERSIST

Configure datanode to use rack group storage

In actual business, key data is stored in highly reliable nodes according to actual business needs. At this time, datanodes form heterogeneous clusters. By modifying the storage policy of datanode, the system can forcibly save the data in the specified rack group.

A rack group represents a collection of multiple racks. After configuring this storage policy, the key data will force the priority to save the copy data on all datanode nodes of this rack group.

Use constraints

  1. File write
  • The first copy will be selected from the forced rack group. If there are no available nodes in the forced rack group, the write will fail.
  • The second copy will be selected from the local client machine or random nodes in the rack group (when the rack group of the client machine is not a mandatory rack group).
  • The third copy will be selected from other rack groups.
  • Each copy should be stored in a different rack group. If the number of copies required is greater than the number of rack groups available, the extra copies are stored in random rack groups.
  1. If more than one copy is missing or cannot be stored in the forced rack group, it will not be backed up again due to the increase of the number of copies or the damage of data blocks. The system will continue to try to back up again until a normal node in the group is forced to return to the available state.
  2. If the rack group policy is configured, the balancer will move data blocks within the same rack group.
  3. If the rack group policy is configured, mover will move data blocks within the same rack group.
  4. When writing files, the selection of nodes will be strictly in accordance with the storage policy. Therefore, when the additional storage type is the same as the forced rack group copy storage type, such as changing the policy after the file is written or during the deletion process, only the forced group copy will not be deleted.

Operation steps

  1. In fusioninsight manager, click host. Check the specified host, select “more > set rack”, and fill in the name of the new rack in “set rack”. Click OK to save the rack configuration information.
  2. In fusioninsight manager, select “cluster > name of cluster to be operated > Services > HDFS > configuration > all configurations”.
  3. Modify the parameter value of “dfs.block.replicator.classname”. The default value is “org.apache.hadoop.hdfs.server.blockmanagement.availablespaceblockplacementpolicy”, which means that the namenode uses the default algorithm to save a copy of the data to HDFS.
    Selecting “org.apache.hadoop.hdfs.server.blockmanagement.blockplacementpolicywithrackgroup” means that datanode forces the selection of the specified rack group when saving data.
  4. Modify the parameter value of “dfs.use.dfs.network.topology” and set it to “false”, which means that dfsnetworktopology is no longer used in the rack block placement policy.
  5. Modify the parameter value of “net.topology.impl” and set it to “org.apache.hadoop.net.networktopologywithrackgroup”, which means that in the rack block placement strategy, computer clusters are formed according to the network topology of tree hierarchical structure.
  6. Modify the parameter value of “dfs.blockplacement.mandatory.rackgroup.name” to specify the mandatory rack group to be selected. There can be only one forced rack group. If this item is left blank or not configured, the forced rack group concept will not be enabled.
  7. Click “save”, click “OK” in “save configuration”, select “more > Restart service” on the overview page after saving, and start HDFS service.

The interface prompts “operation succeeded.”, Click finish and HDFS starts successfully.

Suggestions on the use of data storage strategy

  1. For the two data storage strategies involved in this chapter, it is recommended to make data planning before use, and select the appropriate storage strategy according to different use scenarios.
  2. Tiered storage is selected for storage media, such as SSD and SAS disk; The other two storage strategies are for data nodes. These two categories are different conceptual levels.
  3. Label storage and forced rack group are mutually exclusive, so users can only choose one of them when selecting storage strategy. Both can be used with tiered storage.
  4. For all data storage strategies, the following control modes can be supported at the same time:
  • Through dfs.block.replicator Classname controls the use of a replica placement policy, and the configurations between them are mutually exclusive (default placement policy, nodelabel placement policy, free space placement policy, rack group placement policy).
  • Use tiered storage.

If the above three controls are activated, the processing sequence of HDFS is to first select the node range according to nodelabel, then filter the nodes according to the replica placement strategy, and finally select the corresponding nodes and disks within the selected node range using the hierarchical storage function for processing.

This article is written byHua Weiyunrelease.

Recommended Today

Mybatis search returns the data method of Map, List collection type

Mybatis search returns data of Map, List collection type 1. Find the List collection of returned Bean objects Basically, it is no different from returning a Bean object. The resultType is still the full class name of the Bean object, but the method type in the interface needs to be modified. public List<Employee> getEmpListByEmail(String email); […]