1 热备库报错信息:
2024-05-20 16:33:43.914 CST,,,1061876,,664b0ae7.1033f4,1,,2024-05-20 16:33:43 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,"","postmaster",,0
2024-05-20 16:33:43.914 CST,,,1061876,,664b0ae7.1033f4,2,,2024-05-20 16:33:43 CST,,0,LOG,00000,"starting PostgreSQL 14.10 (Ubuntu 14.10-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit",,,,,,,,,"","postmaster",,0
2024-05-20 16:33:43.914 CST,,,1061876,,664b0ae7.1033f4,3,,2024-05-20 16:33:43 CST,,0,LOG,00000,"listening on IPv4 address ""0.0.0.0"", port 5432",,,,,,,,,"","postmaster",,0
2024-05-20 16:33:43.914 CST,,,1061876,,664b0ae7.1033f4,4,,2024-05-20 16:33:43 CST,,0,LOG,00000,"listening on IPv6 address ""::"", port 5432",,,,,,,,,"","postmaster",,0
2024-05-20 16:33:43.916 CST,,,1061876,,664b0ae7.1033f4,5,,2024-05-20 16:33:43 CST,,0,LOG,00000,"listening on Unix socket ""/var/run/postgresql/.s.PGSQL.5432""",,,,,,,,,"","postmaster",,0
2024-05-20 16:33:43.923 CST,,,1061878,,664b0ae7.1033f6,1,,2024-05-20 16:33:43 CST,,0,LOG,00000,"database system was interrupted; last known up at 2024-05-20 16:33:42 CST",,,,,,,,,"","startup",,0
2024-05-20 16:33:44.716 CST,,,1061878,,664b0ae7.1033f6,2,,2024-05-20 16:33:43 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,"","startup",,0
2024-05-20 16:33:45.975 CST,,,1061878,,664b0ae7.1033f6,3,,2024-05-20 16:33:43 CST,1/0,0,LOG,00000,"redo starts at 0/60000028",,,,,,,,,"","startup",,0
2024-05-20 16:33:45.977 CST,,,1061878,,664b0ae7.1033f6,4,,2024-05-20 16:33:43 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/60000100",,,,,,,,,"","startup",,0
2024-05-20 16:33:45.978 CST,,,1061876,,664b0ae7.1033f4,6,,2024-05-20 16:33:43 CST,,0,LOG,00000,"database system is ready to accept read-only connections",,,,,,,,,"","postmaster",,0
2024-05-20 16:33:46.605 CST,,,1061905,,664b0aea.103411,1,,2024-05-20 16:33:46 CST,,0,LOG,00000,"started streaming WAL from primary at 0/61000000 on timeline 2",,,,,,,,,"","walreceiver",,0
2024-05-20 16:33:46.676 CST,,,1061878,,664b0ae7.1033f6,5,,2024-05-20 16:33:43 CST,1/0,0,LOG,00000,"recovery stopping before commit of transaction 26511, time 2024-05-20 16:33:45.693183+08",,,,,,,,,"","startup",,0
2024-05-20 16:33:46.676 CST,,,1061878,,664b0ae7.1033f6,6,,2024-05-20 16:33:43 CST,1/0,0,LOG,00000,"redo done at 0/61002910 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.70 s",,,,,,,,,"","startup",,0
2024-05-20 16:33:46.676 CST,,,1061905,,664b0aea.103411,2,,2024-05-20 16:33:46 CST,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,"","walreceiver",,0
2024-05-20 16:33:47.321 CST,,,1061878,,664b0ae7.1033f6,7,,2024-05-20 16:33:43 CST,1/0,0,LOG,00000,"selected new timeline ID: 3",,,,,,,,,"","startup",,0
2024-05-20 16:33:47.398 CST,,,1061878,,664b0ae7.1033f6,8,,2024-05-20 16:33:43 CST,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,"","startup",,0
2024-05-20 16:33:48.071 CST,,,1061876,,664b0ae7.1033f4,7,,2024-05-20 16:33:43 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,"","postmaster",,0
表现:
热备库的流复制可以搭建成功,但是随即马上又开始报错,状态从t->f
排查:
1 看主备库的pg_wal目录下的文件,发现,两边文件的时间线不一致
2 查看关于热备库的配置中所有的配置
select * from pg_settings where name like '%recovery%'
发现recovery_target_time参数的值竟然停留在20240506
解决:
把主备库的recovery_target_time置空后,重启库,再重新搭建流复制,搭建成功,没有之前的报错了。
recovery_target_time(时间戳)
此参数指定将继续进行恢复的时间戳。
最多可以指定recovery_target_time、recovery_target_name或recovery_target_xid中的一个。
默认情况是恢复到WAL日志的末尾。
精确的停止点也会受到recovery_target_including的影响。
recovery_target_enclusive(布尔值)
指定是在指定的恢复目标之后停止(true),还是在恢复目标之前停止(false)。
同时适用于recovery_target_time和recovery_target_xid,无论为该恢复指定哪一个。
这指示分别具有完全目标提交时间或ID的事务是否将包括在恢复中。
默认值为true。
分析:
1 大概率是之前用pgbackrest工具做过一次时间点恢复,在主库恢复时指定来要恢复到的时间,导致recovery_target_time参数发生变化
pgbackrest --stanza=pg1-data --delta --type=time
--target="2024-05-06 16:20:54.825649+08"
--log-level-console=info --target-action=promote restore
确实如此。
所以在用pgbackrest使用了时间点恢复之后,要手动将这个参数置空
2热备库报错信息
requested timeline 4 is not a child of this server's history",
"Latest checkpoint is at 8/81000060 on timeline 3,
but in the history of the requested timeline, the server forked
off from that timeline at 8/68000138
请求的时间线4不是该服务器历史的子级”,“最新检查点位于时间线3上的8/8100060(000000030000000800000081),但在请求的时间轴的历史中,服务器在8/68000138(000000030000000800000068)从该时间轴分叉
主库的pg_wal下时间线是00000003
备库的pg_wal下时间线是00000003.history和00000004.history
可能原因:
用了pgbackrest的expire命令后,恢复主库,新的时间线(00000004)的wal日志丢失了
解决方法:
改主库的参数
alter system set recovery_target='';
alter system set recovery_target_action='pause';
alter system set recovery_target_timeline='latest';
把热备库的
alter system set recovery_target_timeline = '00000003.history';
后问题解决,最好把热备库的pg_wal下关于00000004.history的数据删除掉
3 把已经断开的备库改成一个读写都可以的库
/usr/lib/postgresql/14/bin/pg_ctl -D /etc/postgresql/14/main promote
4 pg_wal目录误清空之后数据库报错
invalid primary checkpoint record
could not locate a valid checkpoint record
解决方法:
/usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
5 常规关闭数据库方式关闭数据库失败
/usr/lib/postgresql/14/bin/pg_ctl -D /etc/postgresql/14/main stop
则可以采用以下命令快速关闭数据库
/usr/lib/postgresql/14/bin/pg_ctl -D /etc/postgresql/14/main stop -m f
smart:智能关闭模式。但凡有外部连接未结束,就需要等待这些连接结束后才会开始关闭数据库。
fast:快速关闭模式。通过向所有活动会话发送SIGTERM信号来结束这些会话。
immediate:直接关闭模式。向所有子进程发送SIGQUIT信号。通过这种方式关闭数据库,完整性是不可靠的。再次启动数据库时将会重放WAL日志进行恢复。
三种模式均不允许有新的连接接入数据库。
参考文章:
https://blog.csdn.net/Hehuyi_In/article/details/102645026
https://blog.csdn.net/songyundong1993/article/details/132270115
https://www.pianshen.com/article/28651011137/
https://www.modb.pro/db/334412