Luke and Great, Great Aunt Marge
Paul telling Sylvia how to ski
00106.jpg
Luke alert
IMG_9552.JPG
Aonther Echidina
00045.jpg
Our room at Tanna

 

January 2008
M T W T F S S
« Nov   Feb »
123456
78910111213
14151617181920
21222324252627
28293031  

drbd split-brain problems.

DRBDAs part of a High Availability linux firewall I have setup, I have used DRBD in order to share a filesystem to ensure that we don't lose too much information in the event of a failure. However, the primary server of the system has been pushed into production and the secondary system is still in the process of being moved around and configured.
As a result of this, the secondary system lost network connection for the timeout period which lead to the primary server thinking that it was the only one (standalone) and the secondary drbd wouldn't come up - a form of split-brain from what I've read.
Before continuing, make sure that you have a look at the documentation. I won't be held responsible if you completely fry your drbd or (even worse) overwrite your primary with the wrong data
Apparantly, this is a known problem with 0.7.x and has been fixed with 0.8.x (feel free to correct me). In my instance I am using 0.7.21. On the primary server, I get:

primary:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)Avail Use% Mounted on
SVN Revision: 2326 build by root@vajra, 2007-07-09 16:39:51
0: cs:StandAlone st:Primary/Unknown ld:Consistentnit/rw
    ns:534782720 nr:6220 dw:534944920 dr:291509157 al:33812 bm:47 lo:0 pe:0 ua:0 ap:0                503M    0  503M  0% /dev/shm

and on the secondary server, I get:

secondary:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@vajraSecondary, 2007-07-12 09:31:09
0: cs:WFConnection st:Secondary/Unknown ld:Consistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0

I tried to reconfigure this setup by running

primary:~# drbdadm primary all

on the primary server and

secondary:~# drbdadm secondary all

on the secondary server in order to tell the system which one was the primary and which was the secondary, then running

primary:~# drbdadm connect all

in order to specify that they should reconnect.
When I did this, I got an error that read like this:

Jan 25 11:16:18 secondary kernel: drbd0: Secondary/Unknown --> Secondary/Primary
Jan 25 11:16:18 secondary kernel: drbd0: sock was shut down by peer
Jan 25 11:16:18 secondary kernel: drbd0: drbd0_receiver [4175]: cstate BrokenPipe --> BrokenPipe
Jan 25 11:16:18 secondary kernel: drbd0: short read expecting header on sock: r=0
Jan 25 11:16:18 secondary kernel: drbd0: worker terminated
Jan 25 11:16:18 secondary kernel: drbd0: drbd0_receiver [4175]: cstate BrokenPipe --> Unconnected
Jan 25 11:16:18 secondary kernel: drbd0: Connection lost.
Jan 25 11:16:18 secondary kernel: drbd0: drbd0_receiver [4175]: cstate Unconnected --> WFConnection

In the end, I decided that the only way that I was going to be able to get this back to a reasonable state was to flush the data on the secondary server and resync all of the data from the primary server. The correct command to do this is invalidate or invalidate-remote (depending on which machine you want to invalidate. Make sure that you run this on the correct server!
When I tried to run this command on the secondary server, I got the next cryptic message:

secondary:~# drbdadm invalidate all
can not open /dev/drbd0: No such file or directory
Command 'drbdsetup /dev/drbd0 invalidate' terminated with exit code 20
drbdsetup exited with code 20
secondary:~#

After a bit of hunting, I found the solution on an archived mailing list ( http://archives.free.net.ph/message/20060619.131041.fd07cb48.en.html ). I was able to resync the two filesystems with the following commands on the _secondary_ server (there is another slightly more destructive method in the thread):

secondary:~# /etc/init.d/drbd stop

Now, restart the drbd on the primary server:

primary:~# drbdadm connect all
primary:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@vajra, 2007-07-09 16:39:51
0: cs:WFConnection st:Primary/Unknown ld:Consistent
    ns:0 nr:0 dw:537240380 dr:291534637 al:34125 bm:360 lo:0 pe:0 ua:0 ap:0

Which puts the primary server back into a state where it's waiting for a connection. Then, back on the secondary server:

secondary:~# rmmod drbd
ERROR: Module drbd does not exist in /proc/modules
secondary:~#  modprobe drbd
secondary:~# drbdadm attach r0
secondary:~# drbdadm invalidate r0
secondary:~# drbdadm adjust r0

After a couple of minutes, you should be able to see the following output if you cat /proc/drbd on both the servers:

primary:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@vajra, 2007-07-09 16:39:51
0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:24768676 nr:0 dw:537576684 dr:315968437 al:34170 bm:1852 lo:0 pe:0 ua:0 ap:0
        [=======>............] sync'ed: 37.5% (39870/63773)M
        finish: 1:04:55 speed: 10,476 (9,448) K/sec
primary:~#
secondary:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@vajraSecondary, 2007-07-12 09:31:09
0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:24778832 dw:24778832 dr:0 al:0 bm:5479 lo:6 pe:206 ua:6 ap:0
        [=======>............] sync'ed: 37.5% (39860/63773)M
        finish: 1:04:59 speed: 10,412 (9,448) K/sec
1: cs:Unconfigured
secondary:~#

Once that's done, you should get the following on each of the servers:

primary:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@vajra, 2007-07-09 16:39:51
0: cs:Connected st:Primary/Secondary ld:Consistent
secondary:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root@vajraSecondary, 2007-07-12 09:31:09
0: cs:Connected st:Secondary/Primary ld:Consistent

Now is _not_ the time to work out which is the SyncTarget and which is the SyncSource! Hopefully this fixes the problem for someone else, it worked for me but individual mileage may vary.

Popularity: 41% [?]

Everybody's a critic WTF?? Nothing about this made senseas useful as a blindfolded monkey throwing dartsmediocre ... at bestsolved my problem but needed modificationspectacular.  \'Nuff said (5 votes, average: 5.00 out of 5)
Loading ... Loading ...

8 comments to drbd split-brain problems.

  • Wessel

    Thank you! Those 4 simple commands helped me recover from a split-brain in a test setup.

    Wessel did not rate this post.
  • I’m glad this post saved you some time ;)

    andrewb did not rate this post.
  • Fabulous — I didn’t realize that you can run all this without disrupting the primary at all. I managed to do this gracefully on a production setup without any outage.

    Thank you muchly!

    ryan did not rate this post.
  • alvarock!

    thank you very much. this post saved my ass from being kicked.

    alvarock! did not rate this post.
  • harryztybetu

    worked for me too, great help - much appreciated.

    harryztybetu did not rate this post.
  • nanetto

    Hi,
    i’m sorry, but I don’t understand what ‘r0′ is in your config.
    If I run these command:
    secondary:~# drbdadm attach r0
    secondary:~# drbdadm invalidate r0
    secondary:~# drbdadm adjust r0
    my server says: ‘r0 is not defined in yuour config’.

    someone can help me??

    thx a lot.

    bye

    nanetto did not rate this post.
  • Hi nanetto,

    The r0 refers to the resource defined for drbd - check your /etc/drbd.conf (or whereever is relevant for your distro). You’ll find a line that starts with: resource rX{
    startup {…}
    disk { …. }
    net {….}
    syncer {….}
    on Primary {…}
    on Secondary {…}
    }

    your resource may not be defined as r0, but rather as rX (as in /etc/drbd.conf)

    I hope this helps, if you have futher problems don’t hesitate to repost.

    andrewb did not rate this post.
  • Anastasios Zafeiropoulos

    That is definetely a nice approach to resolve a split-brain problem.
    Thanks a lot.

    Anastasios Zafeiropoulos did not rate this post.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>