Bootstrap

Oracle集群管理 -local listener配置问题导致数据库无法连接

监听参数设置不当导致数据库高可用受损案例

语法 :

ALTER SYSTEM SET local_listener=' (ADDRESS=(PROTOCOL=TCP)(HOST=10.28.224.10)(PORT=1521))','(ADDRESS=(PROTOCOL=TCP)(HOST=10.28.224.10)(PORT=10000))' SCOPE=MEMORY SID='+ASM1';

1 案例概要

     刚刚接手的一个客户,12.2.0.1 RAC环境上运行了5套数据库。计算节点dm01dbadm01因为出现内存故障,需要停止该计算节点进行硬件维护。现场的同事计划先安全关闭该计算节点上所有的数据库实例,然后再关闭CRS集群,最后才关闭该主机进行硬件更换。

     现场同事在安全关闭计算节点dm01dbadm01上的数据库实例阶段,前4个数据库实例都正常关闭,业务系统未受到任何影响,但关闭PPO这个数据库实例后,业务部门很快就反映,业务系统无法连接到数据库,PPO这个数据库是核心库,它会影响到所有的业务。

    同事赶紧启动计算节点dm01dbadm01上的PPO实例,则业务系统立即恢复了正常,可以正常连接到数据库。此时,客户质疑是同事的操作方法不对,理由是:客户以前是直接关闭CRS集群,然后关机进行硬件维护,这种情况下,业务系统未受到任何影响。

由于未能及时定位故障原因,客户担心业务系统再次受到影响,于是,本次内存更换操作中止。

2 故障分析

     同事赶紧联系公司反馈这一故障,听完这个故障的来龙去脉后,感觉问题就出在监听上,于是让现场的同事立即收集相关的信息。

[oracle@dm01dbadm01   ~]$ export ORACLE_SID=PPO1

[oracle@dm01dbadm01   ~]$ sqlplus / as sysdba

SQL> show parameter   local_listener

local_listener                       string      (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.62.2)(PORT=1521))))

[oracle@dm01dbadm02   ~]$ export ORACLE_SID=PPO2

[oracle@dm01dbadm02   ~]$ sqlplus / as sysdba

SQL> show parameter   local_listener

local_listener                       string        (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.62.2)(PORT=1521))))

[grid@dm01dbadm02 ~]$   lsnrctl status listener

...(略)

Listening Endpoints   Summary...

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.62.4)(PORT=1521)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.61.2)(PORT=1521)))

Services Summary...

Service   "+ASM" has 1 instance(s).

  Instance "+ASM2", status READY,   has 1 handler(s) for this service...

...(略)

Service   "qys" has 1 instance(s).

  Instance "qys2", status READY,   has 1 handler(s) for this service...

The command completed   successfully

[grid@dm01dbadm02 ~]$

看到收集的信息,立马就知道了故障原因。由于两个计算节点PPO数据库实例的local_listener参数都设置成了节点1的VIP地址,所以导致节点2的本地监听无法注册PPO实例,而节点1的本地监听能正常注册PPO实例。当两个计算节点的PPO实例都启动时,可以正常连接数据库,但如果关闭节点1上的PPO实例时,则无法连接数据库。

3 故障重现

为了进一步证实自己的结论,于是在测试环境进行故障再现。

[oracle@19crac1 ~]$   sqlplus / as sysdba

SQL> show parameter   local_listener

local_listener                       string        (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.191)(PORT=1521))

[oracle@19crac2 ~]$   sqlplus / as sysdba

SQL> show parameter   local_listener

local_listener                       string      (ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.191)(PORT=1521))

[grid@19crac1 ~]$   lsnrctl status listener

...(略)

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.190)(PORT=1521)))

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.191)(PORT=1521)))

Services Summary...

Service   "racdb" has 1 instance(s).

  Instance "racdb1", status READY,   has 1 handler(s) for this service...

...(略)

The command completed   successfully

[grid@19crac1 ~]$

[grid@19crac1 ~]$   lsnrctl status listener_scan1

...(略)

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.194)(PORT=1521)))

Services Summary...

Service   "racdb" has 2 instance(s).

  Instance "racdb1", status READY,   has 1 handler(s) for this service...

  Instance "racdb2", status READY,   has 1 handler(s) for this service...

...(略)

The command completed   successfully

[grid@19crac1 ~]$

[grid@19crac2 ~]$   lsnrctl status listener

...(略)

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.192)(PORT=1521)))

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.193)(PORT=1521)))

Services Summary...

Service   "+ASM" has 1 instance(s).

  Instance "+ASM2", status READY,   has 1 handler(s) for this service...

Service   "+ASM_DG_DATA" has 1 instance(s).

  Instance "+ASM2", status READY,   has 1 handler(s) for this service...

Service   "+ASM_DG_GRID" has 1 instance(s).

  Instance "+ASM2", status READY,   has 1 handler(s) for this service...

The command completed   successfully

[grid@19crac2 ~]$

可以看出,计算节点2的本地监听上,未能成功注册racdb服务。

此时,两个节点的数据库实例都正常运行,尝试使用SCAN_IP连接数据库

[oracle@19crac2 ~]$   sqlplus test/[email protected]:1521/racdb

SQL>

可以看出,当两个节点的数据库实例都正常运行时,通过SCAN_IP可以正常连接数据库。

模拟故障重现,关闭节点1上的实例:

[oracle@19crac1 ~]$   srvctl stop instance -db racdb -i racdb1 -o immediate 

[oracle@19crac1 ~]$

此时,再次模拟业务系统连接数据库,观察此时是否能正常连接。

[oracle@19crac2 ~]$   sqlplus test/[email protected]:1521/racdb

ERROR:

ORA-12516:   TNS:listener could not find available handler with matching protocolstack

可以看出,连接异常,也即故障已经重现。

处理办法,修改两个节点的local_listener参数,如下所示:

alter system set   local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.191)(PORT=1521))'   scope=both sid='racdb1';

alter system set   local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.193)(PORT=1521))'   scope=both sid='racdb2';

再次检查节点2上的本地监听。

[grid@19crac2 ~]$   lsnrctl status listener

...(略)

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.192)(PORT=1521)))

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.193)(PORT=1521)))

...(略)

Service   "racdb" has 1 instance(s).

  Instance "racdb2", status READY,   has 1 handler(s) for this service...

The command completed   successfully

[grid@19crac2 ~]$

可以看出,正确修改了local_listener参数后,节点2上的本地监听已经成功注册了数据库服务。

再次关闭节点1上的实例,验证修改了local_listener参数后,能否正常连接数据库。

[oracle@19crac1 ~]$   srvctl stop instance -db racdb -i racdb1 -o immediate 

[oracle@19crac1 ~]$

再次模拟业务系统连接数据库测试。

[oracle@19crac2 ~]$   sqlplus test/[email protected]:1521/racdb

SQL>

可以看出,修改了local_listener参数后,已经能够正常连接数据库。

对于客户的质疑,应该直接关闭CRS集群,然后关机进行硬件维护,这种情况下,业务系统才不会受到任何影响的说法,我们进行模拟验证。

[root@19crac1 ~]#   crsctl stop crs

[grid@19crac2 admin]$   ip addr

2: enp0s3:   <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP   group default qlen 1000

    inet 192.168.56.192/24 brd 192.168.56.255   scope global enp0s3

       valid_lft forever preferred_lft   forever

    inet 192.168.56.193/24 brd 192.168.56.255   scope global secondary enp0s3:1

       valid_lft forever preferred_lft   forever

    inet 192.168.56.194/24 brd 192.168.56.255   scope global secondary enp0s3:2

       valid_lft forever preferred_lft   forever

    inet 192.168.56.191/24 brd 192.168.56.255   scope global secondary enp0s3:3

       valid_lft forever preferred_lft   forever

   

[grid@19crac2 ~]$   lsnrctl status listener

...(略)

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.192)(PORT=1521)))

    (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.193)(PORT=1521)))

Services Summary...

Service   "+ASM" has 1 instance(s).

  Instance "+ASM2", status READY,   has 1 handler(s) for this service...

Service   "+ASM_DG_DATA" has 1 instance(s).

  Instance "+ASM2", status READY,   has 1 handler(s) for this service...

Service   "+ASM_DG_GRID" has 1 instance(s).

  Instance "+ASM2", status READY,   has 1 handler(s) for this service...

The command completed   successfully

[grid@19crac2 ~]$

[oracle@19crac2 ~]$   sqlplus test/[email protected]:1521/racdb

ERROR:

ORA-12516:   TNS:listener could not find available handler with matching protocol stack

可以看出,节点1关闭CRS集群,而计算节点2的本地监听上,仍然未能成功注册racdb服务,此时,业务系统同样无法连接数据库的。而客户当初直接关闭CRS集群,然后关机进行硬件维护时,业务系统未受到影响,这只能说明数据库的local_listener参数是在后期被修改的,所以在以前进行硬件维护时,业务系统不受影响。

4 CRS集群 与 “先关闭数据库实例,后关闭CRS集群” 这两种方式的区别

我们继续讨论,直接关闭CRS集群 与 “先关闭数据库实例,后关闭CRS集群” 这两种方式的区别。

[root@19crac1 ~]#   crsctl stop crs

数据库的alert日志:

Shutting down ORACLE   instance (abort) (OS id: 21287)

Shutdown is initiated   by [email protected] (TNS V1-V3).

[oracle@19crac1 ~]$   srvctl stop instance -db racdb -i racdb1

数据库的alert日志:

Shutting down ORACLE   instance (immediate) (OS id: 27525)

Shutdown is initiated   by [email protected] (TNS V1-V3).

从数据库的alert日志可以看出,当使用crsctl stop crs 直接关闭CRS集群时,数据库实例是以abort方式关闭的,这种方式过于暴力,除非数据库完全hang死,否则一般不建议使用这种方式,有可能会引起数据库故障;而当以srvctl stop instance方式关闭数据库实例时,数据库实例是以immediate方式关闭的,这种方式才是安全关闭数据库的方式。

5 故障解决

客户申请到维护窗口,计划再次进行内存更换。更换之前,针对PPO数据库,修改了所有计算节点的local_listener参数,检查本地监听也已经恢复正常,所有计算节点都能正常监听PPO服务。

此次内存更换,仍然是先关闭数据库实例,再关闭该节点的CRS集群,最终关机更换内存。业务系统未受到任何影响。

;