1. Namenode data directory
Specify a local file system path to determine where NN stores fsimage and editlog files. You can specify multiple paths separated by commas At present, our production line environment is only equipped withBig data trainingSet a directory and store it on the disk with RAID1 or RAID5.
2. Datanode data directory
Specify the local disk path where DN stores block data. You can specify multiple paths separated by commas. In a production environment, multiple disks may be mounted on one DN.
3. Number of copies of data block
The number of copies of the data block. The default value is 3
4. Data block size
The size of HDFS data block is 128M by default. At present, 1g is configured for our production line environment
5. Maximum bandwidth used for HDFS equalization
The maximum bandwidth used for HDFS equalization is 1048576 by default, that is, 1MB / s, which is too small for most clusters with Gigabit or even 10GB bandwidth. However, this value can be set when the balancer script is started, and the cluster level default value can not be modified. At present, the environment of our production line is 50M / S ~ 100M / s
6. Number of damaged disks
How many disks of DN are damaged and stop the service? The default value is 0, that is, once any disk fails, DN will be closed. For clusters with many disks (for example, every dn12 disks), disk failure is normal. Generally, this value can be set to 1 or 2 to avoid frequent DN offline.
7. Number of data transmission connections
The number of data transmission connections that the datanode can handle at the same time, that is, specify the maximum number of threads used to transmit data inside and outside the datanode. The official name of this parameter is changed to DFS datanode. max.transfer. Threads, the default value is 4096, the recommended value is 8192, and our production line environment is also 8192
8. Number of threads of namenode processing RPC calls
The number of threads used to process RPC calls in namenode. The default is 10. For larger clusters and better configured servers, this value can be appropriately increased to improve the concurrency of namenode RPC services. The recommended value of this parameter is: the natural logarithm of the cluster * 20
python -c ‘import math ; print int(math.log(N) * 20)’
Our 800 + node production line environment is configured between 200 and 500
9. Number of threads of namenode processing datanode reporting data block and heartbeat
The number of threads used to handle datanode reported data blocks and heartbeat, which is the same as DFS namenode. handler. The count algorithm is consistent
10. Number of threads of datanode processing RPC calls
The number of threads used to process RPC calls in the datanode. The default is 3. This value can be appropriately increased to improve the concurrency of datanode RPC service. Increasing the number of threads will increase the memory requirements of datanode. Therefore, this value should not be adjusted excessively. Our production line environment is set to 10
11. Maximum transmission threads of datanode
Maximum number of transfer threads specifies the maximum number of threads used to transfer data inside and outside the datanode.
This value specifies the maximum number of files that datanode can process at the same time. It is recommended to increase this value. The default value is 256. The maximum value can be configured as 65535. Our production line environment is configured as 8192.
12. Cache size when reading and writing data
– set the cache size when reading and writing data, which should be twice the hardware paging size
Our production line environment is set to 65536 (64K)
13. Redundant data block deletion
During the routine maintenance of Hadoop cluster, we found such a situation:
If a node is judged dead by namenode due to network failure or datanode process death, HDFS will automatically start fault-tolerant copying of data blocks immediately; When the node is added to the cluster again, because the data on the node is not damaged, the number of backups of some blocks on HDFS exceeds the set number of backups. Through observation, it is found that these redundant data blocks will not be completely deleted for a long time. What does this time depend on?
The length of this time is related to the interval between data block reports. Datanode will regularly report all block information on the current node to namenode, parameter DFS blockreport. Intervalmsec is the parameter that controls the reporting interval.
hdfs-site. There is a parameter in the XML file:
<description>Determines block reporting interval in milliseconds.</description>
Among them, 3600000 is the default setting, 3600000 milliseconds, that is, 1 hour, that is, the time interval of block report is 1 hour, so it takes a long time for these redundant blocks to be deleted. Through the actual test, it is found that when the parameter is adjusted slightly smaller (60 seconds), the redundant data blocks are indeed deleted quickly
14. Delayed reporting of new blocks
When a new block is written on the datanode, it will be reported to the namenode immediately by default. On a large Hadoop cluster, data is being written all the time. Data blocks are written on the datanode and then reported to namenode at any time. Therefore, namenode will frequently process fast reporting requests such as datanode, and will frequently hold locks. In fact, it will greatly affect the processing and response time of other RPCs.
The delay fast report configuration can reduce the number of block reports after the datanode writes the block, and improve the response time and processing speed of the RPC processed by the namenode.
On our production line environment HDFS cluster, this parameter is configured to 500ms, that is, when a new block is written in the datanode, it is not reported to the namenode immediately, but to wait 500ms. During this time period, the newly written block is reported to the namenode at one time.
15. Increase the upper limit of simultaneously opened file descriptors and network connections
Use the ulimit command to increase the maximum number of file descriptors that can be opened at the same time to an appropriate value. At the same time, adjust the kernel parameter net core. Somaxconn the number of network connections to a large enough value.
Supplement: net core. The role of somaxconn
net. core. Somaxconn is a kernel parameter in Linux, which indicates the upper limit of the backlog of socket listen. What is a backlog? The backlog is the listening queue of the socket. When a request has not been processed or established, it will enter the backlog. The socket server can process all requests in the backlog at one time, and the processed requests are no longer in the listening queue. When the server processes the request slowly so that the listening queue is filled, the new request will be rejected. In Hadoop 1.0, the parameter IPC server. listen. queue. Size controls the length of the listening queue of the server socket, that is, the length of the backlog. The default value is 128. The Linux parameter net core. The default value of somaxconn is also 128. When the server is busy, such as namenode or jobtracker, 128 is not enough. In this way, we need to increase the backlog. For example, our cluster will IPC server. listen. queue. The size is set to 32768. In order to achieve the expected effect of the whole parameter, you also need to set the kernel parameter net core. Somaxconn is set to a value greater than or equal to 32768.