Using pgbackrest parallel archive to solve the problem of wal accumulation

Time:2021-6-10

Author: Wen Zhiyin

Problem phenomenon:

Recently, the disk of the last PostgreSQL virtual machine was alarmed. Check the size of each file directory and find PG_ The wal directory takes up 600g + and the data directory takes up 300g.

Existing architecture:

RDS virtual machine with one master and one slave
Disk size 1.2t
The data disk is archived and backed up for SSD and stored on KS3 storage file

Solutions:

1. Find the reason why the wal log is not released continuously

First, we need to know which parameters affect the amount and PG of wal log generation_ Wal directory file size: Max_ wal_ size  ( Integer): makes the wal grow to the maximum size at the automatic wal checkpoint. This is a soft limit; In special cases, the wal size can exceed   max_ wal_ Size, such as error archive under heavy load_ Command, or larger wal_ keep_ The settings of segments. The default is 1GB. Increasing this parameter will increase the time required for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line. wal_ keep_ segments  ( Integer): Specifies that if the backup server needs to obtain log segment files for stream replication, PG_ The minimum number of past log file segments that can be kept in the wal directory. Each segment is usually 16 megabytes. If a backup server connected to the sending server is more than wal behind_ keep_ In this case, the copy connection will be interrupted. The end result is that the downlink connection will eventually fail (however, if you are using wal archiving, the backup server can recover by getting segments from the archive). Set PG only_ The minimum number of file segments retained in wal; The system may need to reserve more segments for war archiving or recovery from a checkpoint. If wal_ keep_ Segments is zero (the default), so there is more space to store the wal archive or recover from a checkpoint. If wal_ keep_ Segments is zero (the default), and the system will not reserve any extra segments for backup purposes. Therefore, the number of old wal segments available for backup server is a function of the location of the last checkpoint and the archiving status of wal. This parameter can only be set in the PostgreSQL. Conf file or on the server command line

As explained above, the number of wal logs is mainly affected by the above two parameters. Find the values of the parameters related to the instance

postgres=# show max_wal_size; 
max_wal_size 
-------------- 
10GB 
postgres=# show wal_keep_segments ; 
wal_keep_segments 
------------------- 
600 

In general, under normal database conditions, pg_ The size of the wal directory file should be about 10g, so why does it produce 600g + wal now? In fact, the above parameters have given us the answer: in special cases, the wal size can exceed   max_ wal_ Size, such as error archive under heavy load_ Command, or larger wal_ keep_ The settings of segments. So check PG_ wal/archive_ In the status file, a large number of xxx.ready files are found, indicating that the wal log has not been filed in time. Combined with the existing architecture, the archiving speed of KS3 storage files is far slower than that of wal log generation on local SSD in terms of performance limitation, so wal can not be archived in time and accumulated continuously, resulting in disk alarm finally. The reason has been found, and the next step is the solution.

2. Solutions

Speed up wal archiving

To view the archive command:

archive_command = 'DIR=/ks3data/wal/v3/d5ddd1d7-f458-4c50-ae23-012abc6e0570/723f790f-b66b-4312-bf08-02f8d0f083e5/`date +%F`; sudo test ! -d $DIR && sudo mkdir $DIR; sudo test ! -f $DIR/%f && sudo cp %p $DIR/%f

The archiving command copies the wal log serially. Because it copies the wal log to S3 storage, the performance is affected, and the archiving speed is far behind the speed of wal generation. So I thought about whether we can archive the wal log in parallel. Find the relevant information, and finally decide to test the pgbackrest tool.

pgBackRest

Pgbackrest is designed to be a reliable, easy-to-use backup and restore solution that can seamlessly scale to the largest database and workload by using algorithms optimized for specific database requirements.

Product features:
1. Parallel backup and restore
2. Local or remote operation
3. Full, incremental and differential backup
4. Backup rotation and archive expiration
5. Backup integrity
6. Page checksums
7. Back up your resume
8. Stream compression and checksums
9. Incremental restore
10. Parallel, asynchronous war push & get
11. Table space and link support
12. S3 and azure compatible object storage support
13. Encryption
14. With PostgreSQL  > = 8.3 compatibility

Related connections:
https://pgbackrest.org/
https://github.com/pgbackrest…

Refer to the link document for specific installation. Here we mainly use the 10th feature: parallel, asynchronous, wal push & get   To configure the pgbackrest parameter file:

[demo] 
pg1-path=/rds/postgresql/ 
pg1-port=5432 
pg1-socket-path=/tmp 
pg1-host-user=postgres 

[global] repo1-path=/ks3data/wal/v3/c2880a97-c981-4962-a25b-568ce07fbe80/7ab5da8d-ef99-4783-85ac-628730ea0124/2021-01-29 
repo1-retention-full=2 
archive-async=y 
log-level-file=detail 
spool-path=/rds/pgbackrest 

[global:archive-push] 
compress-level=3 
process-max= 3 

[global:archive-get] 
process-max=3 
compress-level=3 

among
Archive async = y controls the opening of asynchronous mode. If this parameter is closed, process Max is invalid. It is always 1. Process max = 3 indicates the maximum number of processes

Turn off archive async effect:

[email protected] archive_status]# ps -ef | grep arch 
postgres 736 729 0 Jan26 ? 00:00:00 postgres: archiver archiving 0000000100000004000000EE 
postgres 225852 736 0 21:31 ? 00:00:00 sh -c DIR=/ks3data/binlogs/v3/c2880a97-c981-4962-a25b-568ce07fbe80/7ab5da8d-ef99-4783-85ac-628730ea0124/`date +%F`; sudo test ! -d $DIR && sudo mkdir $DIR; sudo test ! -f $DIR/0000000100000004000000EE && sudo pgbackrest --stanza=demo archive-push pg_wal/0000000100000004000000EE 
root 225867 225852 0 21:31 ? 00:00:00 sudo pgbackrest --stanza=demo archive-push pg_wal/0000000100000004000000EE 
root 225868 225867 9 21:31 ? 00:00:00 pgbackrest --stanza=demo archive-push pg_wal/0000000100000004000000EE root 225877 217995 0 21:31 pts/1 00:00:00 grep arch 

Open archive async:

[[email protected] pgbackrest]# ps -ef | grep arch 
postgres 736 729 0 Jan26 ? 00:00:00 postgres: archiver archiving 000000010000000500000043 
root 227107 1 0 21:32 ? 00:00:00 pgbackrest --log-level-console=off --log-level-stderr=off --stanza=demo archive-push:async /rds/postgresql/pg_wal 
root 227108 227107 7 21:32 ? 00:00:07 pgbackrest --host-id=1 --log-level-console=off --log-level-file=off --log-level-stderr=error --process=1 --remote-type=repo --stanza=demo archive-push:local 
root 227109 227107 7 21:32 ? 00:00:07 pgbackrest --host-id=1 --log-level-console=off --log-level-file=off --log-level-stderr=error --process=2 --remote-type=repo --stanza=demo archive-push:local 
root 227110 227107 7 21:32 ? 00:00:07 pgbackrest --host-id=1 --log-level-console=off --log-level-file=off --log-level-stderr=error --process=3 --remote-type=repo --stanza=demo archive-push:local 
root 228624 217995 0 21:33 pts/1 00:00:00 grep arch 
postgres 228626 736 0 21:33 ? 00:00:00 sh -c DIR=/ks3data/binlogs/v3/c2880a97-c981-4962-a25b-568ce07fbe80/7ab5da8d-ef99-4783-85ac-628730ea0124/`date +%F`; sudo test ! -d $DIR && sudo mkdir $DIR; sudo test ! -f $DIR/000000010000000500000043 && sudo pgbackrest --stanza=demo archive-push pg_wal/000000010000000500000043

By comparison, there are three pgbackrest processes. Check the relevant logs

-------------------PROCESS START------------------- 2021-01-29 21:32:29.735 P00 INFO: archive-push:async command begin 2.30: [/rds/postgresql/pg_wal] --archive-async --compress-level=3 --log-level-console=off --log-level-file=detail --log-level-stderr=off --pg1-path=/rds/postgresql --process-max=3 --repo1-path=/ks3data/binlogs/v3/c2880a97-c981-4962-a25b-568ce07fbe80/7ab5da8d-ef99-4783-85ac-628730ea0124/2021-01-27 --spool-path=/rds/pgbackrest --stanza=demo 2021-01-29 21:32:29.736 P00 INFO: push 72 WAL file(s) to archive: 00000001000000050000000A...000000010000000500000051 
2021-01-29 21:32:32.877 P02 DETAIL: pushed WAL file '00000001000000050000000B' to the archive 
2021-01-29 21:32:33.508 P03 DETAIL: pushed WAL file '00000001000000050000000C' to the archive 
2021-01-29 21:32:34.445 P01 DETAIL: pushed WAL file '00000001000000050000000A' to the archive 
2021-01-29 21:32:36.955 P01 DETAIL: pushed WAL file '00000001000000050000000F' to the archive 
2021-01-29 21:32:37.764 P02 DETAIL: pushed WAL file '00000001000000050000000D' to the archive 
2021-01-29 21:32:39.259 P03 DETAIL: pushed WAL file '00000001000000050000000E' to the archive 
2021-01-29 21:32:41.799 P01 DETAIL: pushed WAL file '000000010000000500000010' to the archive 
2021-01-29 21:32:42.598 P02 DETAIL: pushed WAL file '000000010000000500000011' to the archive 
2021-01-29 21:32:43.564 P03 DETAIL: pushed WAL file '000000010000000500000012' to the archive 
2021-01-29 21:32:45.864 P01 DETAIL: pushed WAL file '000000010000000500000013' to the archive 
2021-01-29 21:32:46.533 P02 DETAIL: pushed WAL file '000000010000000500000014' to the archive 
2021-01-29 21:32:47.608 P03 DETAIL: pushed WAL file '000000010000000500000015' to the archive 
2021-01-29 21:32:50.496 P01 DETAIL: pushed WAL file '000000010000000500000016' to the archive
............ 
2021-01-29 21:34:22.864 P00 INFO: archive-push:async command end: completed successfully (113129ms) 

It can be seen from the log that: p00 is responsible for the statistics of the wal file format to be archived and checking the archiving status of P01, P02 and P03 processes, and P01, P02 and P03 are responsible for the log archiving asynchronously and in parallel.

Results & CONCLUSION:
Using pgbackrest parallel archiving, the problem is solved.