Tuesday, February 18, 2014

Cluster Ready Services not starting on one of the nodes CRS-2674: Start of 'ora.ctssd' on 'rac1' failed

While trying to start my cluster on one of the nodes I was getting
[root@rac1 ~]# crsctl start cluster
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2674: Start of 'ora.ctssd' on 'rac1' failed
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2674: Start of 'ora.cluster_interconnect.haip' on 'rac1' failed
CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2681: Clean of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2674: Start of 'ora.ctssd' on 'rac1' failed
CRS-4000: Command Start failed, or completed with errors.
[root@rac1 ~]#
Wierd, well I did some investigation and tried finding what is causing the issue.
Checked the status of cluster on all nodes
[root@rac1 ~]# crsctl check cluster -all
**************************************************************
rac1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
rac2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
[root@rac1 ~]# 
Found that it working on my other node

Tried looking at the status of ocr file from second node, which is health and that is good.
[root@rac2 ~]# ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          3
         Total space (kbytes)     :     262120
         Used space (kbytes)      :       2900
         Available space (kbytes) :     259220
         ID                       :  488817485
         Device/File Name         :      +DATA
                                    Device/File integrity check succeeded
                                    Device/File not configured
                                    Device/File not configured
                                    Device/File not configured
                                    Device/File not configured
         Cluster registry integrity check succeeded
         Logical corruption check succeeded
[root@rac2 ~]#
Now the first thing I have done is gone under GRID_HOME
$ORACLE_HOME/log/rac1/cssd/ and tailed ocssd.log which gave me the clue why it is not starting. This was my disk was full with the logs
2014-02-18 10:50:25.020: [    CSSD][4090484480]clssnmSendingThread: sending status msg to all nodes
2014-02-18 10:50:25.020: [    CSSD][4090484480]clssnmSendingThread: sent 5 status msgs to all nodes
2014-02-18 10:50:25.624: [GIPCXCPT][286312192] gipcmodClsaAuthInit: failed on clsaauthstart ret clsaretOSD (8), endp 0x7f750819a940 [0000000000000cfb] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac1_)(GIPCID=29c28405-72e1b71f-3076))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac1_)(GIPCID=72e1b71f-29c28405-3134))', numPend 5, numReady 1, numDone 0, numDead 3, numTransfer 0, objFlags 0x0, pidPeer 3134, flags 0x603710, usrFlags 0x14000 }
2014-02-18 10:50:25.624: [GIPCXCPT][286312192] gipcmodClsaAuthInit: slos op  :  mkdir
2014-02-18 10:50:25.624: [GIPCXCPT][286312192] gipcmodClsaAuthInit: slos dep :  No space left on device (28)
2014-02-18 10:50:25.624: [GIPCXCPT][286312192] gipcmodClsaAuthInit: slos loc :  authprep6
2014-02-18 10:50:25.624: [GIPCXCPT][286312192] gipcmodClsaAuthInit: slos info:  failed to make dir /u01/app/11.2.0/grid/auth/css/rac1/A2256404
2014-02-18 10:50:25.625: [GIPCXCPT][286312192] gipcmodMuxTransferAccept: internal accept request failed endp 0x7f75080099e0, child 0x7f750819a940, ret gipcretAuthFail (22)
2014-02-18 10:50:25.625: [ GIPCMUX][286312192] gipcmodMuxTransferAccept: EXCEPTION[ ret gipcretAuthFail (22) ]  error during accept on endp 0x7f75080099e0
After fixing the space issue the cluster was started.

2 comments:

Anonymous said...

How you have fix the issue.

Harvey said...

By creating some space the issue was fixed.