Apache Doris Fe metadata failure operation and maintenance

Time:2021-10-16

Doris is an excellent MPP data warehouse,

There was a problem with the first two antennas. I had three Fe’s, one Fe hung up, and then I couldn’t restart,

After backing up the Fe metadata, I will clear the metadata of this node and use it

–Delete first

ALTER SYSTEM DROP FOLLOWER “FE:9010”

–Add in

ALTER SYSTEM ADD FOLLOWER “FE:9010”

Delete the modified node, and then use the — helper method to add the node to the cluster as a new Fe, but the startup will report an error and cause the master Fe to hang up. The specific exception information is as follows

[email protected] props={REFRESH_VLSN=17921230, PORT=9010, HOSTNAME=172.22.197.238, P_NODETYPE1=ELECTABLE, NODE_NAME=172.22.197.238_9010_1611290318143, P_NODETYPE0=SECONDARY, P_NODENAME1=172.22.197.240_9010_1608972313975, P_PORT1=9010, P_NODENAME0=172.22.197.238_9010_1611290318143, P_PORT0=9010, P_HOSTNAME1=172.22.197.240, GROUP_NAME=PALO_JOURNAL_GROUP, P_HOSTNAME0=172.22.197.238, ENV_DIR=/hdd_data01/doris-meta/bdb, P_NUMPROVIDERS=2}

at com.sleepycat.je.rep.InsufficientLogException.wrapSelf(InsufficientLogException.java:315) ~[je-7.3.7.jar:7.3.7]

at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766) ~[je-7.3.7.jar:7.3.7]

Apache Doris Fe metadata failure operation and maintenance

Later, with the help of the community, sister Miao and Chen Mingyu, we made various attempts to locate. We thought that the metadata synchronization exception was caused during startup. This exception may be caused by my load data task modifying metadata synchronously. Later, in the early morning, after the production and shipment were completed, we stopped all load tasks, and then deleted the metadata of the problem Fe node, Then, after re-use — helper startup, an error is still reported. Finally, there is no way. Try to copy the Fe metadata of the master node to the problem node Fe, delete the metadata directory of the problem node Fe, then rebuild, and copy the assigned metadata to the metadata directory

Specific steps:

Stop all load tasks

Delete the metadata directory and rebuild the directory

Copy metadata from the master node to the problem node Fe (it is very important to delete the configuration item metadata_failure_recovery = true in fe.conf or set it to false). Note that the name in image / version should be modified. The copied name is the name of the master and changed to the name of the node

Execute alter system drop follow to delete the modified node

Use — helper to start the service on the problem node

Execute alter system add follower under Mysql to add Fe nodes

Normal startup

be careful:

1. Problem node: set the metadata in fe.conf_ failure_ Recovery = true the configuration item is deleted or set to false

2. The master node starts using metadata_ failure_ Recovery = true start to recover. After the startup is normal, delete the configuration or set it to false, stop the master Fe, and then after the restart, confirm that the master query and import are normal

After the above steps are completed, start Fe in the — helper mode at the problem node. At this time, start it normally and the problem is solved

Restart all load tasks

However, this is still a problem. In theory, after deleting the metadata of the problem Fe node, it should be no problem to treat it as a new Fe node and add it with — helper. After startup, the metadata will be automatically synchronized from the master Fe, but the synchronization fails (there is no load task to modify the metadata at this time)