Thursday, April 3, 2014

Oracle Voting Disk Explained

What is oracle voting disk and why we use it and how nodes use it ...

There can be a lot of scenarios for oracle RAC Nodes and Voting disks, but I am trying to explain the logic behind the voting disk and its use.

In simple terms Oracle voting disk is a file which hold at least two types of information. One if node can see its instance running, second if node can see other instances running.

Every second the node will send out an heart beat to other nodes and voting disk to check if they can see them and report in voting disk. Lets divide this into two main scenarios one where interconnect breaks down for a node and second where node can not write to voting disk.

Scenario 1 understanding interconnect breakdown scenario. Image 1 shows when all is good

Image 1
Image 1
In image 1 each node can see each other node and there are no issues. let's see what happens if interconnect for node 2 breaks as shown in image 2
Image 2
In case of image 2 because of break down of interconnect for node 2 now there are two cluster groups. One with node2 and other with node1, node3, node4. This situation is called split brain. In this case both sub cluster can work independently and update same block to cause corruption.

To avoid this oracle's two cluster with highest number of nodes will elect a master and get hold of control file then the newly elected master will see voting disk table to see which node should be poisoned and will put poison bit for that node and will evict that node from, cluster. In this case it will be node 2 which will be evicted.

Scenario 2 where nodes break communication with voting disks.

Suppose we are using more than one voting disk like three in example blow thinks are going smooth
Image 3
Now if because of some load or any other reason node 3 is not able to write to voting disk 1 and 2, please note that interconnect is all good here.
Image 4
If the node is not able to communicate with the voting disks then the condition is the node which can talk to more than half of the voting disks will survive. In the case of image 4 there is again split brain between two sub clusters. One with node3 and other with node1, node2, node4. Now one of the node from cluster withnode1, node2 and node4 should be able to elect a master and get hold of control file and then update voting disk to evict node 3 and node 3 will be evicted.

Lets discuss two scenarios which will make this exercise rather interesting.

Image 4 shows that on a perfectly working cluster somehow node3 can not communicate with votingdisk2. Now you remember if communication between node and voting disk is breaking down then node eviction will happen only for the nodes which access less then half of the voting disks. That means if the node is accessing only one voting disk that node will be evicted. Hence in this case the eviction will not occur but you will notice some error message in the alter log.

Image 4


Why I chose 4 node cluster for my example. I wanted to tell what will happen if some how there are two su clusters with equal number of nodes in it like in image 5

image 5
Interesting, what will happen now...
Because the breakdown of the interconnect caused a situation of split brain. Which means from either of the cluster one node will try to be the master and evict other. Usually the surviving sub cluster would be the one which has maximum number of nodes. But in this case both sub clusters have equal number. This issue is fixed by choosing the cluster with minimum node numbers in it. Hence the surviving sub cluster will be one with node 1 and node 2 and it will choose the master node and then master node will send the poison bit for other nodes in voting disk and other nodes will be evicted.


No comments: