监听参数设置不当导致数据库高可用受损案例
语法 :
ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.28.224.10)(PORT=1521))','(ADDRESS=(PROTOCOL=TCP)(HOST=10.28.224.10)(PORT=10000))' SCOPE=MEMORY SID='+ASM1';
1 案例概要
刚刚接手的一个客户,12.2.0.1 RAC环境上运行了5套数据库。计算节点dm01dbadm01因为出现内存故障,需要停止该计算节点进行硬件维护。现场的同事计划先安全关闭该计算节点上所有的数据库实例,然后再关闭CRS集群,最后才关闭该主机进行硬件更换。
现场同事在安全关闭计算节点dm01dbadm01上的数据库实例阶段,前4个数据库实例都正常关闭,业务系统未受到任何影响,但关闭PPO这个数据库实例后,业务部门很快就反映,业务系统无法连接到数据库,PPO这个数据库是核心库,它会影响到所有的业务。
同事赶紧启动计算节点dm01dbadm01上的PPO实例,则业务系统立即恢复了正常,可以正常连接到数据库。此时,客户质疑是同事的操作方法不对,理由是:客户以前是直接关闭CRS集群,然后关机进行硬件维护,这种情况下,业务系统未受到任何影响。
由于未能及时定位故障原因,客户担心业务系统再次受到影响,于是,本次内存更换操作中止。
2 故障分析
同事赶紧联系公司反馈这一故障,听完这个故障的来龙去脉后,感觉问题就出在监听上,于是让现场的同事立即收集相关的信息。
[oracle@dm01dbadm01 ~]$ export ORACLE_SID=PPO1 [oracle@dm01dbadm01 ~]$ sqlplus / as sysdba SQL> show parameter local_listener local_listener string (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.62.2)(PORT=1521)))) [oracle@dm01dbadm02 ~]$ export ORACLE_SID=PPO2 [oracle@dm01dbadm02 ~]$ sqlplus / as sysdba SQL> show parameter local_listener local_listener string (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.62.2)(PORT=1521)))) [grid@dm01dbadm02 ~]$ lsnrctl status listener ...(略) Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.62.4)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.61.2)(PORT=1521))) Services Summary... Service "+ASM" has 1 instance(s). Instance "+ASM2", status READY, has 1 handler(s) for this service... ...(略) Service "qys" has 1 instance(s). Instance "qys2", status READY, has 1 handler(s) for this service... The command completed successfully [grid@dm01dbadm02 ~]$ |
看到收集的信息,立马就知道了故障原因。由于两个计算节点PPO数据库实例的local_listener参数都设置成了节点1的VIP地址,所以导致节点2的本地监听无法注册PPO实例,而节点1的本地监听能正常注册PPO实例。当两个计算节点的PPO实例都启动时,可以正常连接数据库,但如果关闭节点1上的PPO实例时,则无法连接数据库。
3 故障重现
为了进一步证实自己的结论,于是在测试环境进行故障再现。
[oracle@19crac1 ~]$ sqlplus / as sysdba SQL> show parameter local_listener local_listener string (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.191)(PORT=1521)) [oracle@19crac2 ~]$ sqlplus / as sysdba SQL> show parameter local_listener local_listener string (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.191)(PORT=1521)) [grid@19crac1 ~]$ lsnrctl status listener ...(略) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.190)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.191)(PORT=1521))) Services Summary... Service "racdb" has 1 instance(s). Instance "racdb1", status READY, has 1 handler(s) for this service... ...(略) The command completed successfully [grid@19crac1 ~]$ [grid@19crac1 ~]$ lsnrctl status listener_scan1 ...(略) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.194)(PORT=1521))) Services Summary... Service "racdb" has 2 instance(s). Instance "racdb1", status READY, has 1 handler(s) for this service... Instance "racdb2", status READY, has 1 handler(s) for this service... ...(略) The command completed successfully [grid@19crac1 ~]$ [grid@19crac2 ~]$ lsnrctl status listener ...(略) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.192)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.193)(PORT=1521))) Services Summary... Service "+ASM" has 1 instance(s). Instance "+ASM2", status READY, has 1 handler(s) for this service... Service "+ASM_DG_DATA" has 1 instance(s). Instance "+ASM2", status READY, has 1 handler(s) for this service... Service "+ASM_DG_GRID" has 1 instance(s). Instance "+ASM2", status READY, has 1 handler(s) for this service... The command completed successfully [grid@19crac2 ~]$ |
可以看出,计算节点2的本地监听上,未能成功注册racdb服务。
此时,两个节点的数据库实例都正常运行,尝试使用SCAN_IP连接数据库
[oracle@19crac2 ~]$ sqlplus test/[email protected]:1521/racdb SQL> |
可以看出,当两个节点的数据库实例都正常运行时,通过SCAN_IP可以正常连接数据库。
模拟故障重现,关闭节点1上的实例:
[oracle@19crac1 ~]$ srvctl stop instance -db racdb -i racdb1 -o immediate [oracle@19crac1 ~]$ |
此时,再次模拟业务系统连接数据库,观察此时是否能正常连接。
[oracle@19crac2 ~]$ sqlplus test/[email protected]:1521/racdb ERROR: ORA-12516: TNS:listener could not find available handler with matching protocolstack |
可以看出,连接异常,也即故障已经重现。
处理办法,修改两个节点的local_listener参数,如下所示:
alter system set local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.191)(PORT=1521))' scope=both sid='racdb1'; alter system set local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.193)(PORT=1521))' scope=both sid='racdb2'; |
再次检查节点2上的本地监听。
[grid@19crac2 ~]$ lsnrctl status listener ...(略) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.192)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.193)(PORT=1521))) ...(略) Service "racdb" has 1 instance(s). Instance "racdb2", status READY, has 1 handler(s) for this service... The command completed successfully [grid@19crac2 ~]$ |
可以看出,正确修改了local_listener参数后,节点2上的本地监听已经成功注册了数据库服务。
再次关闭节点1上的实例,验证修改了local_listener参数后,能否正常连接数据库。
[oracle@19crac1 ~]$ srvctl stop instance -db racdb -i racdb1 -o immediate [oracle@19crac1 ~]$ |
再次模拟业务系统连接数据库测试。
[oracle@19crac2 ~]$ sqlplus test/[email protected]:1521/racdb SQL> |
可以看出,修改了local_listener参数后,已经能够正常连接数据库。
对于客户的质疑,应该直接关闭CRS集群,然后关机进行硬件维护,这种情况下,业务系统才不会受到任何影响的说法,我们进行模拟验证。
[root@19crac1 ~]# crsctl stop crs [grid@19crac2 admin]$ ip addr 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 inet 192.168.56.192/24 brd 192.168.56.255 scope global enp0s3 valid_lft forever preferred_lft forever inet 192.168.56.193/24 brd 192.168.56.255 scope global secondary enp0s3:1 valid_lft forever preferred_lft forever inet 192.168.56.194/24 brd 192.168.56.255 scope global secondary enp0s3:2 valid_lft forever preferred_lft forever inet 192.168.56.191/24 brd 192.168.56.255 scope global secondary enp0s3:3 valid_lft forever preferred_lft forever
[grid@19crac2 ~]$ lsnrctl status listener ...(略) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.192)(PORT=1521))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.193)(PORT=1521))) Services Summary... Service "+ASM" has 1 instance(s). Instance "+ASM2", status READY, has 1 handler(s) for this service... Service "+ASM_DG_DATA" has 1 instance(s). Instance "+ASM2", status READY, has 1 handler(s) for this service... Service "+ASM_DG_GRID" has 1 instance(s). Instance "+ASM2", status READY, has 1 handler(s) for this service... The command completed successfully [grid@19crac2 ~]$ [oracle@19crac2 ~]$ sqlplus test/[email protected]:1521/racdb ERROR: ORA-12516: TNS:listener could not find available handler with matching protocol stack |
可以看出,节点1关闭CRS集群,而计算节点2的本地监听上,仍然未能成功注册racdb服务,此时,业务系统同样无法连接数据库的。而客户当初直接关闭CRS集群,然后关机进行硬件维护时,业务系统未受到影响,这只能说明数据库的local_listener参数是在后期被修改的,所以在以前进行硬件维护时,业务系统不受影响。
4 CRS集群 与 “先关闭数据库实例,后关闭CRS集群” 这两种方式的区别
我们继续讨论,直接关闭CRS集群 与 “先关闭数据库实例,后关闭CRS集群” 这两种方式的区别。
[root@19crac1 ~]# crsctl stop crs 数据库的alert日志: Shutting down ORACLE instance (abort) (OS id: 21287) Shutdown is initiated by [email protected] (TNS V1-V3). [oracle@19crac1 ~]$ srvctl stop instance -db racdb -i racdb1 数据库的alert日志: Shutting down ORACLE instance (immediate) (OS id: 27525) Shutdown is initiated by [email protected] (TNS V1-V3). |
从数据库的alert日志可以看出,当使用crsctl stop crs 直接关闭CRS集群时,数据库实例是以abort方式关闭的,这种方式过于暴力,除非数据库完全hang死,否则一般不建议使用这种方式,有可能会引起数据库故障;而当以srvctl stop instance方式关闭数据库实例时,数据库实例是以immediate方式关闭的,这种方式才是安全关闭数据库的方式。
5 故障解决
客户申请到维护窗口,计划再次进行内存更换。更换之前,针对PPO数据库,修改了所有计算节点的local_listener参数,检查本地监听也已经恢复正常,所有计算节点都能正常监听PPO服务。
此次内存更换,仍然是先关闭数据库实例,再关闭该节点的CRS集群,最终关机更换内存。业务系统未受到任何影响。