Starting Database Instance Using srvctl Fails With Errors PRCR-1013 CRS-2674 CRS-2678 CRS-5802

Recently we had an issue with one of the Exadata compute nodes where the database instances are not controlled by srvctl. When we use srvctl to start a database instance, it does nothing, but comes back to the prompt. The same database instance can be started and stopped using sqlplus.

[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$ srvctl status database -db ASKMDB
Instance ASKMDB1 is not running on node exaaskmdb01
Instance ASKMDB2 is not running on node exaaskmdb02
Instance ASKMDB3 is not running on node exaaskmdb03
Instance ASKMDB4 is not running on node exaaskmdb04
Instance ASKMDB5 is not running on node exaaskmdb05
Instance ASKMDB6 is not running on node exaaskmdb06
[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$ srvctl start database -db ASKMDB
[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$
[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$ srvctl status database -db ASKMDB
Instance ASKMDB1 is not running on node exaaskmdb01
Instance ASKMDB2 is not running on node exaaskmdb02
Instance ASKMDB3 is not running on node exaaskmdb03
Instance ASKMDB4 is not running on node exaaskmdb04
Instance ASKMDB5 is not running on node exaaskmdb05
Instance ASKMDB6 is not running on node exaaskmdb06

We tried with sqlplus....

SQL> startup
ORACLE instance started.
Total System Global Area 1.4160E+10 bytes
Fixed Size 8636920 bytes
Variable Size 1.2119E+10 bytes
Database Buffers 2013265920 bytes
Redo Buffers 19566592 bytes
Database mounted.
Database opened.
SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> exit

The cluster utility crsctl is showing this database resource status as UNKNOWN.

[grid@exaaskmdb04 ~]$ crsctl status res ora.askmdb.db -t
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.askmdb.db
1 ONLINE UNKNOWN exaaskmdb03 STABLE
2 ONLINE OFFLINE Instance Shutdown,ST
ABLE
3 ONLINE OFFLINE STABLE
4 ONLINE OFFLINE STABLE
5 ONLINE OFFLINE STABLE
6 ONLINE OFFLINE STABLE
--------------------------------------------------------------------------------
[grid@exaaskmdb04 ~]$

We tried the action plan as explained in my previous blog Exadata Database UNKNOWN issue and it did not work.

Then we thought something else is happening.

How to handle this situation. We need extra logs to see what is happening during srvctl. Why srvctl is not able to start and stop database instance.

We have an option to enable tracing for srvctl utility. Using the option "SRVM_TRACE=true", we performed the same operation to start and stop the database instance.

[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$ export SRVM_TRACE=true
[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$ srvctl stop instance -db ASKMDB -i ASKMDB3

This tracing captured some error logs.

NOTE : The output is truncated

[main] [ 2019-12-09 13:02:25.187 EST ] [CRSNative.stopResources:681] [MAJOR EVENT] About to stop resources: filter (((((NAME starts ora.askmdb.) AND (NAME ends .db)) AND (TYPE == ora.database.type)) AND ((STATE != OFFLINE) OR (TARGET != OFFLINE))) AND (LAST_SERVER == exaaskmdb03)), force false, keep false
[main] [ 2019-12-09 13:09:25.278 EST ] [CRSNative.stopResourcesHelper:661] [MAJOR EVENT] Failed to stop resources: filter (((((NAME starts ora.askmdb.) AND (NAME ends .db)) AND (TYPE == ora.database.type)) AND ((STATE != OFFLINE) OR (TARGET != OFFLINE))) AND (LAST_SERVER == exaaskmdb03)), msg CRS-2675: Stop of 'ora.askmdb.db' on 'exaaskmdb03' failed
CRS-2678: 'ora.askmdb.db' on 'exaaskmdb03' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
CRS-5802: Unable to start the agent process
[main] [ 2019-12-09 13:09:25.280 EST ] [CRSCache.getAttributesFromCache:229] CRS cache: ora.askmdb.db [<DATABASE_TYPE:RAC>]
[main] [ 2019-12-09 13:09:25.288 EST ] [InterruptHandler.unRegisterInterruptHandler:76] UNRegistering shutdown hook.....
[main] [ 2019-12-09 13:09:25.288 EST ] [InterruptHandler.unRegisterInterruptHandler:81] UnRegistered shutdown hook.....
[main] [ 2019-12-09 13:09:25.288 EST ] [OPSCTLDriver.main:247] OPSCTL execute() failed. Unregistered OPSCTL driver's interrupt handler
[main] [ 2019-12-09 13:09:25.288 EST ] [OPSCTLDriver.main:254] exiting abnormally due to FrameworkException
PRCD-1131 : Failed to stop database ASKMDB and its services on nodes exaaskmdb03
PRCR-1133 : Failed to stop database ASKMDB and its running services
PRCR-1132 : Failed to stop resources using a filter
CRS-2675: Stop of 'ora.askmdb.db' on 'exaaskmdb03' failed
CRS-2678: 'ora.askmdb.db' on 'exaaskmdb03' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
CRS-5802: Unable to start the agent process
[main] [ 2019-12-09 13:09:25.301 EST ] [OPSCTLDriver.main:256] PRCD-1131 : Failed to stop database ASKMDB and its services on nodes exaaskmdb03
PRCR-1133 : Failed to stop database ASKMDB and its running services
PRCR-1132 : Failed to stop resources using a filter
CRS-2675: Stop of 'ora.askmdb.db' on 'exaaskmdb03' failed
CRS-2678: 'ora.askmdb.db' on 'exaaskmdb03' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
CRS-5802: Unable to start the agent process
oracle.ops.opsctl.StopAction.executeInstance(StopAction.java:599)
oracle.ops.opsctl.Action.execute(Action.java:426)
oracle.ops.opsctl.OPSCTLDriver.execute(OPSCTLDriver.java:507)
oracle.ops.opsctl.OPSCTLDriver.main(OPSCTLDriver.java:236)
[main] [ 2019-12-09 13:09:25.301 EST ] [SRVMContext.term:160] Performing SRVM Context Term. Term counter is 2

Then we looked into all the cluster logs and found the following error logs in "crsd_oraagent_oracle.trc". The location of this log file is /u01/app/grid/diag/crs/exaaskmdb03/crs/trace/crsd_oraagent_oracle.trc.

2019-12-09 13:58:48.424 :CLSFRAME:698249344: New Framework state: 2
2019-12-09 13:58:48.424 :CLSFRAME:698249344: M2M is starting...
2019-12-09 13:58:48.424 : CRSCOMM:698249344: Ipc: Starting send thread
2019-12-09 13:58:48.425 :GIPCXCPT:698249344: gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
2019-12-09 13:58:48.425 :GIPCXCPT:698249344: gipcConnectSyncF [connectToServer : clsIpcClient.cpp : 380]: EXCEPTION[ ret gipcretConnectionRefused (29) ] failed sync connect endp 0x1df73e0 [0000000000000090] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1df9fd0, sendp 0x1df9d80 status 13flags 0xa108871a, flags-2 0x0, usrFlags 0x30000 }, addr 0x1df87f0 [0000000000000097] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0
2019-12-09 13:58:48.425 : CRSCOMM:698249344: IpcC: gipcConnect() failed, rc= 29
2019-12-09 13:58:48.425 : CRSCOMM:698249344: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29
2019-12-09 13:58:48.425 :CLSFRAME:698249344: Failure at IPC connect to server:2
2019-12-09 13:58:48.425 :CLSFRAME:698249344: Unable to start module-to-module comms: 1
2019-12-09 13:58:48.425 : CRSCOMM:494253824: Ipc: sendWork thread started.
2019-12-09 13:58:48.425 : AGENT:698249344: Created alert : (:CRSAGF00120:) : Agent Framework failed to start:1
2019-12-09 13:58:48.425 : AGENT:698249344: Agfw calling user exitCB, will exit on return
2019-12-09 13:58:48.425 : AGENT:698249344: returned from user exitCB, exiting
2019-12-09 13:58:48.425 : AGENT:698249344: Agent is exiting with exit code: 1

The first thing we did we to verify the following file permissions. All these files are with correct permission.

[root@exaaskmdb03 output]# ls -lrt crsd_oraagent*
-rw-r--r-- 1 grid oinstall 63443 Nov 4 02:33 crsd_oraagent_gridOUT.trc
-rw-r--r-- 1 grid oinstall 7 Nov 4 02:33 crsd_oraagent_grid.pid
-rw-r--r-- 1 oracle oinstall 190495 Dec 9 13:58 crsd_oraagent_oracleOUT.trc
-rw-r--r-- 1 oracle oinstall 7 Dec 9 13:58 crsd_oraagent_oracle.pid

Collecting all these trace information and log information, we found the MOS ID 2335214.1.

Verified the socket files...

[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$ ls -l *SOCKET*
srwxrwxrwx 1 root root 0 Nov 4 02:31 sCRSD_UI_SOCKET
srwxrwxrwx 1 root root 0 Nov 4 02:29 sOHASD_UI_SOCKET
[oracle@exaaskmdb03.askmlabs.com:ASKMDB3]$

This MOS ID has the following solutions.

crsctl stop crs -f
cd /var/tmp/.oracle
rm *
crsctl start crs

After performing the complete cluster restart, we were able to start and stop the database using srvctl and able to see the IPC SOCKET files.

Hope this information helps....

askMLabs