Problem: Both DRBD servers are in Primary/Unknown state. The site is down. Hell has broken loose!

[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---  
ns:272314733 nr:15766224 dw:88314479 dr:218442557 al:24340 bm:12960 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:23841424 misses:24646 starving:0 dirty:306 changed:24340  
[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---  
ns:15766224 nr:272314733 dw:288180421 dr:6902987 al:4734 bm:13160 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:3961688 misses:4941 starving:0 dirty:207 changed:4734  

I assigned db2 to be primary DB server a week ago for monitoring.

This means DB2 was Primary at failure time. I have stopped heartbeat and drbd on db1.

[root@db1 ~]# chkconfig --list | grep heartbeat  
heartbeat       0:off   1:off   2:on    3:on    4:on    5:on    6:off  
[root@db1 ~]# chkconfig --list | grep drbd  
drbd            0:off   1:off   2:off   3:on    4:on    5:on    6:off  


chkconfig --level 35 drbd off  
chkconfig --level 35 heartbeat off  

Rescue operation started at 18:55 PST, 14 Mar 2008 .

The following is the status.

webmail servers are stopped.

DB1 and DB2 are turned ON.

DB2 was prinmary before failure. Thus heartbeat and DRBD services are stopped on DB1.

/etc/ha.d/ha.cf updated on both servers to add eth0 as bcast interface as well. The dead time is changed from 10 seconds to 30 seconds.

The /etc/ha.d/ha.cf now looks like this:

logfacility     local0
keepalive 2
deadtime 30 # USE THIS!!!
#deadtime 10
bcast   eth0
bcast   eth1
#serial /dev/ttyS0
#baud 19200
auto_failback off
respawn hacluster /usr/lib/heartbeat/ipfail
node db1.example.com db2.example.com

After these changes, heartbeat and DRBD are kept OFF at DB1. And heartbeat (not DRBD) is re-started on DB2.

The status of DRBD on DB2 is :-

[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---  
ns:15766224 nr:272314733 dw:288877845 dr:23063634 al:4799 bm:13495 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:4135979 misses:5006 starving:0 dirty:207 changed:4799  

I have to start DRBD on DB1 manually :-

[root@db1 ~]# modprobe drbd  
  
[root@db1 ~]# lsmod | grep -i drbd  
drbd                  181896  0  
  
[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
[root@db1 ~]#               

DB2 still is in standalone mode:-

[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---  
ns:15766224 nr:272314733 dw:288891985 dr:23063646 al:4799 bm:13495 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:4139514 misses:5006 starving:0 dirty:207 changed:4799  

Back on DB1 :-

[root@db1 ~]# drbdadm up all  
  
[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---  
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0  
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0  
[root@db1 ~]#  

Notice the status of DRBD on DB1 as “WFConnection” => Waiting For Connection

While the status is same on DB2:-

[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---  
ns:15766224 nr:272314733 dw:288891985 dr:23063646 al:4799 bm:13495 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:4139514 misses:5006 starving:0 dirty:207 changed:4799  

You cannot run drbdadm on this node. But still, if you will run, you will get:-

[root@db2 ~]# drbdadm up all  
Failure: (124) Device is attached to a disk (use detach first)  
Command 'drbdsetup /dev/drbd0 disk /dev/sda6 /dev/sda5 flexible --set-defaults --create-device' terminated with exit code 10  
[root@db2 ~]#         

Since we don’t need the disk to be actually mounted while we do this. Nor we want any other service interfering with our /data, lets shutdown heartbeat (not drbd) on DB2. This will automatically unmount /data as well.

[root@db2 ~]#  service heartbeat stop  
  
So after this step, all nfs, mysql, etc services are stopped and /data is unmounted.  
  
[root@db2 ~]# mount  
/dev/sda1 on / type ext3 (rw)  
proc on /proc type proc (rw)  
sysfs on /sys type sysfs (rw)  
devpts on /dev/pts type devpts (rw,gid=5,mode=620)  
tmpfs on /dev/shm type tmpfs (rw)  
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)  
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)  
nfsd on /proc/fs/nfsd type nfsd (rw)  

The status of both server is as follows:-

heartbeat and DRBD stopped on DB1.

heartbeat stopped on DB2.

DRBD running on DB2.

data unmounted on both servers.

At this time, both servers have acquired the status of Secondary/Unknown.

Let’s try connecting both servers:-

drbdadm -- connect all

first on DB2 (the current Primary server)

[root@db2 ~]#  drbdadm -- connect all  
  
[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown   r---  
ns:0 nr:0 dw:288894573 dr:23063650 al:4799 bm:13585 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:4140161 misses:5006 starving:0 dirty:207 changed:4799  

Doesn’t help. It is still StandAlone.

[root@db1 ~]#  drbdadm -- connect all  
  
[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---  
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0  
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0  
[root@db1 ~]#                                                                         

Doesn’t help. It is still waiting for connection.

There is no other way but to make this system primary by force:

Perfrom this step ONLY on DB2 (which is [supposed to be] the current primary) (This will start the sync process and may take hours, depending on the IO speed of your setup):-

drbdadm -- --overwrite-data-of-peer primary all

[root@db2 ~]# drbdadm -- --overwrite-data-of-peer primary all  
  
This statement has made this Primary, alright. But still the other side is not connected.  
  
[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---  
ns:0 nr:0 dw:288894573 dr:23063650 al:4799 bm:13585 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:4140161 misses:5006 starving:0 dirty:207 changed:4799  
[root@db2 ~]#            
[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---  
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0  
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0  

Here is what I see in log at DB1 aftger I issue “connect all” or “overwrite data” commands:

Mar 14 07:36:15 db1 kernel: drbd0: conn( Unconnected -> WFConnection )
Mar 14 07:40:12 db1 kernel: drbd0: conn( WFConnection -> WFReportParams )
Mar 14 07:40:12 db1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86
Mar 14 07:40:12 db1 kernel: drbd0: Split-Brain detected, dropping connection!
Mar 14 07:40:12 db1 kernel: drbd0: self 35B46A95F04C53C0:01028196C77A0C52:3D69CB5573788ED9:0000000000000004
Mar 14 07:40:12 db1 kernel: drbd0: peer 1DB708067D86D985:01028196C77A0C53:3D69CB5573788ED8:0000000000000004
Mar 14 07:40:12 db1 kernel: drbd0: conn( WFReportParams -> Disconnecting )
Mar 14 07:40:12 db1 kernel: drbd0: error receiving ReportState, l: 4!
Mar 14 07:40:12 db1 kernel: drbd0: asender terminated
Mar 14 07:40:12 db1 kernel: drbd0: tl_clear()
Mar 14 07:40:12 db1 kernel: drbd0: Connection closed
Mar 14 07:40:12 db1 kernel: drbd0: conn( Disconnecting -> StandAlone )
Mar 14 07:40:12 db1 kernel: drbd0: receiver terminated

============================= Help Start =============================

helpful link: http://lists.linbit.com/pipermail/drbd-user/2007-June/006950.html

Split-Brains are mainly caused by a loss of communication between the
two nodes, and the Secondary node becoming Primary, while the other node remains in
Primary state (both nodes think their peer is dead, so they become both Primary).

Check your syslog to see what happened before the Split-Brain occurred (eth0 link down, etc.)

  1. how to correct the situation ?

You have to decide which node you want to “sacrifice”, and tell him to
discard his data.

Run the following command on the node on which you want to discard the data :

root at bad-data# drbdadm -- --discard-my-data connect all

Then, simply connect the other machine :

root at good-data# drbdadm connect all

The nodes will start to resynchronize, transferring data from the good-data
node (will appears as SyncSource in /proc/drbd) to the bad-data node
(SyncTarget).

============================= Help  END =============================

My logs prove the same problem. The connection / Handshake is successful. Both nodes identify each other as valid nodes with unique IDs, but because of split braing, they are not syncing. One has to loose data. And I want DB1 to loose data.

So before I move on, heartbeat on both servers is again stopped to minimize any impact.

[root@db1 ~]#  service  heartbeat stop  
  
  
[root@db2 ~]# service heartbeat stop  

I make DB2 primary before moving on:-

[root@db2 ~]# drbdadm -- --overwrite-data-of-peer primary all

Lets make DB1 loose it’s data. This is bad-data node.

NOTE: ALARM : WARNING. Only issue the following command on bad-data node. I have DB1 as bad-data node.

[root@db1 ~]# drbdadm -- --discard-my-data connect all  
  
After issuing this command on DB1, I see the status as follows on both servers:  
  
[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---  
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0  
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0  

Still the old status.

[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---  
ns:0 nr:0 dw:288900033 dr:23068879 al:4799 bm:13675 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208  
act_log: used:0/127 hits:4141526 misses:5006 starving:0 dirty:207 changed:4799  

Still the old status.

Now issue the connect all on good-data node, which is DB2.

[root@db2 ~]# drbdadm connect all

Immediately, I see the following on DB2 :-

[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---  
ns:51552 nr:0 dw:288900033 dr:23120431 al:4799 bm:13678 lo:0 pe:0 ua:0 ap:0  
[======>.............] sync'ed: 33.4% (107572/159124)K  
finish: 0:00:08 speed: 12,888 (12,888) K/sec  
resync: used:0/31 hits:12491638 misses:12212 starving:0 dirty:0 changed:12212  
act_log: used:0/127 hits:4141526 misses:5006 starving:0 dirty:207 changed:4799  
[root@db2 ~]#  

Alhumdulillah.

Let’s check DB1. It would have finished syncing by now.

[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---  
ns:0 nr:159124 dw:159124 dr:0 al:0 bm:129 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:11281 misses:129 starving:0 dirty:0 changed:129  
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0  
You have new mail in /var/spool/mail/root  
[root@db1 ~]#    

Alhumdulillah !!!!!

So now here is the final status of both DB servers:

DB1, which was supposed to be Secondary:-

[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---  
ns:0 nr:159124 dw:159124 dr:0 al:0 bm:129 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:11281 misses:129 starving:0 dirty:0 changed:129  
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0  

And DB2, which was supposed to be Primary :-

[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---  
ns:159124 nr:0 dw:288900033 dr:23228003 al:4799 bm:13804 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12499650 misses:12338 starving:0 dirty:0 changed:12338  
act_log: used:0/127 hits:4141526 misses:5006 starving:0 dirty:207 changed:4799  

And the log files from both servers show the following , after I issued the last “connect all” command after the “discard data” command.

DB1: /var/log/messages :-

Mar 14 08:01:18 db1 kernel: drbd0: conn( WFConnection -> WFReportParams )
Mar 14 08:01:18 db1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86
Mar 14 08:01:18 db1 kernel: drbd0: Split-Brain detected, manually solved. Sync from peer node
Mar 14 08:01:18 db1 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Mar 14 08:01:18 db1 kernel: drbd0: Writing meta data super block now.
Mar 14 08:01:18 db1 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
Mar 14 08:01:18 db1 kernel: drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Mar 14 08:01:18 db1 kernel: drbd0: Began resync as SyncTarget (will sync 159124 KB [39781 bits set]).
Mar 14 08:01:18 db1 kernel: drbd0: Writing meta data super block now.
Mar 14 08:01:34 db1 kernel: drbd0: Resync done (total 16 sec; paused 0 sec; 9944 K/sec)
Mar 14 08:01:34 db1 kernel: drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar 14 08:01:34 db1 kernel: drbd0: Writing meta data super block now.

DB2: /var/log/messages :-

Mar 14 08:01:18 db2 kernel: drbd0: conn( StandAlone -> Unconnected )
Mar 14 08:01:18 db2 kernel: drbd0: receiver (re)started
Mar 14 08:01:18 db2 kernel: drbd0: conn( Unconnected -> WFConnection )
Mar 14 08:01:18 db2 kernel: drbd0: conn( WFConnection -> WFReportParams )
Mar 14 08:01:18 db2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86
Mar 14 08:01:18 db2 kernel: drbd0: Split-Brain detected, manually solved. Sync from this node
Mar 14 08:01:18 db2 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Mar 14 08:01:18 db2 kernel: drbd0: Writing meta data super block now.
Mar 14 08:01:18 db2 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Mar 14 08:01:18 db2 kernel: drbd0: Began resync as SyncSource (will sync 159124 KB [39781 bits set]).
Mar 14 08:01:18 db2 kernel: drbd0: Writing meta data super block now.
Mar 14 08:01:34 db2 kernel: drbd0: Resync done (total 16 sec; paused 0 sec; 9944 K/sec)
Mar 14 08:01:34 db2 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Mar 14 08:01:34 db2 kernel: drbd0: Writing meta data super block now.

Great. Alhumdulillah.

Now is the time to start heartbeat and DRBD services on both nodes and also configure them to start at system boot. Recall that I disabled the heartbeat and drbd services on DB1 (Secondary), so in case if it reboots (Accidently), it doesn’t halt on the famous “[yes]” question at terminal, waiting for ever. As I could not have done anything as I am here in Pakistan and the servers are located in US. And the service provider’s technical support is as shi**y as it can get. (Sorry to say).

Remember to start heartbeat service on DB2 (Primary) first.

So here it goes.

FIRST on DB2:-

[root@db2 ~]# service heartbeat start  
Starting High-Availability services:  
[  OK  ]  
[root@db2 ~]# service drbd start  

Starting DRBD resources:  
  
  
[root@db2 ~]# chkconfig --level 35 heartbeat on  
  
[root@db2 ~]# chkconfig --level 35 drbd on  

THEN on DB1:-

[root@db1 ~]# service heartbeat start  
Starting High-Availability services:  
[  OK  ]  
[root@db1 ~]# service drbd  start  

Starting DRBD resources:  
  
[root@db1 ~]# chkconfig --level 35 heartbeat on  
  
[root@db1 ~]# chkconfig --level 35 drbd on  

Just for the peace of mind, here is the status of DRBD on both servers one more time:

[root@db2 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---  
ns:161868 nr:0 dw:288902777 dr:23233232 al:4799 bm:13804 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:12499650 misses:12338 starving:0 dirty:0 changed:12338  
act_log: used:0/127 hits:4142212 misses:5006 starving:0 dirty:207 changed:4799  
[root@db1 ~]# cat /proc/drbd  
version: 8.0.6 (api:86/proto:86)  
SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51  
0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---  
ns:0 nr:161868 dw:161868 dr:0 al:0 bm:129 lo:0 pe:0 ua:0 ap:0  
resync: used:0/31 hits:11281 misses:129 starving:0 dirty:0 changed:129  
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0  

Alhumdulillah.

Time to start both webmail servers too. (It has been too long for them to sit idle, isn’t it ? ! )