Getting Started with Linux-HA (heartbeat)

Intro

Let me preface this document by saying most of this is _not_ original work.  My purpose for writing this document is just trying to contribute in some way to possibly help those who REALLY get things done.  The "work" I am contributing is mostly compiling bits and pieces from other HA documents (such as Volker Wiegand's Hardware Installation Guide) into a document that can help novices (like myself!) get started on HA without pestering Alan (like I did!).

Hope this helps someone :-).  [Editor's Note: Alan says it's already helped him a lot by cutting down his email load!]
 

Getting Started

The first thing you'll need is two computers.  You need not have identical hardware in both machines (or amount of memory, etc.), but if you did, it would make your life that much easier when a component fails.

Now you have to decide on some of your implementation.  Your "cluster" is established via a "heartbeat" between the two computers (nodes) generated by the software package of the same name.  However, this heartbeat needs one or more media paths between the nodes and one path must be serial port to serial port over a null modem cable (see the HA Hardware Installation guide for instructions on how to build a null modem cable).

At this point, you're actually ready to begin hardware-wise.  Of course, since you're looking into HA, you'll mostly likely want to avoid having only one point of failure.  In this case, that would be your null modem cable or serial port.  So, you need to decide whether you wish to add a second serial/null modem connection or a second network interface card (NIC) to each node connected via a crossover cable.  See Appendix A for instructions on how to build a crossover cable.  My setup goes the 2 NIC route because I only had one null modem cable, had plenty of NICs on hand and thought it was good to have two medium types.

Once your hardware is in order, you must install your OS and configure your networking (I used Red Hat 6.0.).  Assuming you have 2 NICs, one should be configured for you "normal" network and the other as a private network between your clustered nodes.  For an example, we will assume that our cluster will have the following addresses:

Node 1 (linuxha1):   192.168.85.1  (normal 192x net)
                     10.0.0.1 (private 10x net)
Node 2 (linuxha2):   192.168.85.2  (192x)
                     10.0.0.2  (10x)

  Red Hat makes this easy during installation (please don't think I'm carrying their banner, it's just what I use), however, if you use another distribution or are having any problems, refer to the Ethernet HOWTO.     To check your configuration, type:

         ifconfig

This will show your network interfaces and their configuration.  You can obtain the same information in a less verbose form from "netstat -nr"

If it looks good, make sure you can ping between both nodes on all interfaces.

Next, you need to test your serial connection.  On one node, which will be the receiver, type:
           cat </dev/ttyS0

On the other node, type,:
           echo hello >/dev/ttyS0

If it works, change their roles and try again.  If it doesn't, it may be as simple as having the wrong device file.  Volker's HA Hardware Guide and the Serial HOWTO are two good resources for troubleshooting your serial connection.

Installing Heartbeat.

You can now install the heartbeat package.  If you're reading this, you already have it, but in any case it's available at:

       http://www.henge.com/~alanr/ha/download

Untar it into your favorite source directory.   The RPM version is available at the web site, or make your own RPM version, type "make rpm" and use rpm to install.  Otherwise, you can simply type "make install".

NOTE:  If you want to run PPP over a serial connection and do NOT use RPM, you will need to manually install an additional script, which can be found in the README.      Describe exactly how to install this script and where.

Configuring Heartbeat

There are two files you will need to configure before starting up heartbeat.  First, is ha.cf.  This will be placed in the /etc/ha.d directory that is created after installation.  It tells heartbeat what types of media paths to use and how to configure them.   The ha.cf in the source directory contains all the various options you can use, I'll go through it line by line...
serial /dev/ttyS0
       -->  Mandatory.  Replace /dev/ttyS0 with the appropriate dev file for your required serial heartbeat.
watchdog /dev/watchdog
Optional.  The watchdog function provides a way to have a system that is still minimally functioning, but not providing a heartbeat, reboot itself after a minute of being sick.  This could help to avoid a scenario where the machine recovers its heartbeat after being pronounced dead.  If that happened and a disk mount failed over, you could have two nodes mounting a disk simultaneously. If you wish to use this feature, then in addition to this line, you will need to load the "softdog" kernel module and create the actual device file.  To do this, first type "insmod softdog" to load the module. Then, type "grep misc /proc/devices" and note the number it reports (should be 10).  Next, type "cat /proc/misc | grep watchdog" and note that number (should be 130).  Now you can create the device file with that info typing, "mknod /dev/watchdog c 10 130".
udp eth1
Specifies to use a udp heartbeat over the 10x eth1 interface (replace with eth0, if that's what you use).
keepalive 2
Sets the time between heartbeats to 2 seconds.
deadtime 10
Node is pronounced dead after 10 seconds.
hopfudge 1
Optional.  For ring topologies, number of hops allowed in addition to the number of nodes in the cluster.
baud 19200
Speed at which to run the serial line (bps).
udpport 1001
Use port number 1001 for udp.
ppp-udp /dev/ttyS1 10.0.0.3
Specifies that you want a PPP/UDP heartbeat across /dev/ttyS1 and its address should be 10.0.0.3. Note that this address is required to be a local IP address.  Note also, that a given tty port cannot be used for both a ppp-udp heartbeat and a serial heartbeat.
node linuxha1
Mandatory.  Hostname of machine in cluster.
node linuxha2
Mandatory.  Hostname of machine in cluster.
Once you've got your ha.cf set up, you need to configure haresources.  This file specifies the services for the cluster and who the default owner is.
For our example, we'll assume the high availability services are Apache and Samba (the IP for the cluster is mandatory!).  The haresources will need one line:
                  linuxha1 192.168.85.3 httpd smb
So, this line dictates that on startup, have linuxha1 serve the IP 192.168.85.3 and start apache and samba as well.
On shutdown, heartbeat will first stop smb, then apache, then give up the IP.  This assumes that the command "uname -n" spits out "linuxha1" - yours may well produce "linuxha1.domain.com" and if it does, use that instead!

Note:  httpd and smb are the name of startup scripts for Apache and Samba, respectively.  Heartbeat will look for startup scripts of the same name in the following paths:
    /etc/ha.d/resource.d
    /etc/rc.d/init.d

These scripts must start services via "scriptname start" and stop them via "scriptname stop".
So you can use any services as long as they conform to the above standard.

Should you need to pass arguments to a custom script, the format would be:

                scriptname::argument
So, if we added a service "maid" which needed the argument "vacuum", our haresources line would modify to the following:
                linuxha1 192.168.85.3 httpd smb maid::vacuum


This bring us to some added flexibility with the service IP address.  We are actually using a shorthand notation above.  The actual line could read (we've canned the maid):

                linuxha1 IPaddr::192.168.85.3 httpd smb
Where IPaddr is the name of our service script, taking the argument 192.168.85.3.  Sure enough, if you look in the directory /etc/ha.d/resource.d, you will find a script called IPaddr.  This script will also allow you to manipulate the netmask and broadcast address of this IP service.  To specify a subnet with 32 addresses, you could define the service as (leaving off the IPaddr because we can!):
                linuxha1 192.168.85.3/27 httpd smb
This sets the IP service address to 192.168.85.3, the netmask to 255.255.255.224 and the broadcast address would default to 192.168.85.31 (which is the highest address on the subnet).  The last parameter you can set is the broadcast address.  To override the default  and set it to 192.168.85.16, your entry would read:
                linuxha1 192.168.85.3/5/192.168.85.16 httpd smb
You may be wondering whether any of the above is necessary for you.  It depends.  If you've properly established a net route (independent of heartbeat) for the service's IP address, with the correct netmask and broadcast address, then no, it's not necessary for you.  However, this case won't fit everybody and that's why the option's there!  In addition, you may have more than one possible interface that could be used for the service IP.  Read on to see how heartbeat treats this...

Once you straighten out your haresources file, copy ha.cf and haresources to /etc/ha.d and you're ready to start!
 

Selecting an Interface

One important aspect of configuring the haresources file for a machine which has multiple ethernet interfaces is to know how heartbeat selects which interface will wind up supporting the service addresses that are configured in haresources.  After all, no interface was specified in the haresources file.

Heartbeat decides which interface will be used by looking at the routing table.  It tries to select the lowest cost route to the IP address to be taken over.  In the case of a tie, it chooses the first route found.  For most configurations this means the default route will be least preferred.

If you don't specify a netmask for the IP address in the haresources file, the netmask associated with the selected route will be used.

Starting and testing heartbeat

From Red Hat, or other distributions which use the SystemV style init files, simply type /etc/rc.d/init.d/heartbeat start on both nodes.  I would recommend starting on the system master (in our example linuxha1) first.

If you want heartbeat to run on startup, what to do will differ on your distribution.  For Red Hat (again, sorry) and Mandrake, you will need to place links to the startup script in the appropriate init level directories.  I have heartbeat start last and only care about the 0(halt), 6(reboot), 3(text-only), 5(X) run levels.
So, I needed to type in the following (as root, of course):

    cd /etc/rc.d/rc0.d ; ln -s ../init.d/heartbeat K01heartbeat
    cd /etc/rc.d/rc3.d ; ln -s ../init.d/heartbeat S99heartbeat
    cd /etc/rc.d/rc5.d ; ln -s ../init.d/heartbeat S99heartbeat
    cd /etc/rc.d/rc6.d ; ln -s ../init.d/heartbeat K01heartbeat

The last time I ran slackware, there was no /etc/rc.d/init.d directory (may have changed by now) and to do the same thing, I would have placed in /etc/rc.d/rc.local:
    /etc/ha.d/heartbeat start
***This assumes you copy the file ha.rc to /etc/ha.d/heartbeat.  If you can't find /etc/rc.d/init.d with your distribution and you're unsure of how processes start, you can use the rc.local method.  But you're on your own for shutdown, I just don't remember...

Note:  If you use the watchdog function, you'll need to load its module at bootup as well.  For Red Hat, I put the following command at the bottom of the /etc/rc.d/rc.sysinit file:
    /sbin/insmod softdog
For the rc.local method, just put the same line right above where you start heartbeat.
 

Once you've started heartbeat, take a peek at your log file (default is /var/log/ha-log) before testing it.  If all is peachy, the service owner's log (linuxha1 in our example) should look something like this:
1999/09/16_09:13:25 INFO: heartbeat starting.
1999/09/16_09:13:25 Starting serial heartbeat on tty /dev/ttyS0
1999/09/16_09:13:25 UDP heartbeat started on port 1001 interface eth1
1999/09/16_09:13:25 Using watchdog device: /dev/watchdog
1999/09/16_09:13:35 Requesting resource group 192.168.85.3
1999/09/16_09:13:35 INFO: heartbeat initialization complete.
1999/09/16_09:13:35 INFO: Running /etc/ha.d/resource.d/IPaddr 192.168.85.3 status
1999/09/16_09:14:06 Acquiring resource group: ha1.iso-ne.com 192.168.85.3 httpd smb
1999/09/16_09:14:06 INFO: Running /etc/rc.d/init.d/smb  start
1999/09/16_09:14:07 INFO: Running /etc/rc.d/init.d/httpd  start
1999/09/16_09:14:08 INFO: Running /etc/ha.d/resource.d/IPaddr 192.168.85.3 start
1999/09/16_09:14:08 INFO: ifconfig eth0:0 192.168.85.3 netmask 255.255.255.0  broadcast 192.168.85.255
1999/09/16_09:14:08 Sending Gratuitous Arp for 192.168.85.3 on eth0:0 [eth0]
 


OK, now try to ping your cluster's IP (192.168.85.3 in the example). If this works, telnet to it and verify you're on linuxha1.
Next, make sure your services are tied to the .3 address.  Bring up netscape and type in 192.168.85.3 for the URL.  For Samba, try to map the drive "\\192.168.85.3\test"  assuming you set up a share called "test".  See Samba docs to get that going.  As an aside, however, you'll want to use the "netbios name" parameter to have your Samba share listed under the cluster name and not the hostname of your cluster member!

If this all works, you've got availability.  Now let's see if we have High Availability :-)

Take down linuxha.  Kill power, kill heartbeat, whatever you have the stomach for, but don't just yank both the serial and eth1 heartbeat cables.  If you do that, you'll have services running on both nodes and when you re-connect the heartbeat, a bit of chaos....
Now ping the cluster IP. Approximately 5-10 seconds later it should start responding again. Telnet again and verify you're on linuxha2.  If it happens but takes more like 30 seconds, something is wrong.

If you get this far, it's probably working, but you should probably check all your heartbeats, too.
First, check your serial heartbeat.  Unplug the crossover cable from your eth1 NIC that you're using for your udp heartbeat.  Wait about 10 seconds.
Now, look at /var/log/ha-log on linuxha2 and make sure there's no line like this:
    1999/08/16_12:40:58 node ha1.iso-ne.com: is dead
If you get that, your serial heartbeat isn't working and your second node is taking over.  Test your null modem cable.  Run the above serial tests again.

If your log is clean, great.  Re-connect the crossover cable.  Once that's done, disconnect the serial cable, wait 10 seconds and check the ha2 log again.
If it's clean, congrats!  If not, you can check /var/log/ha-log and /var/log/ha-debug for more clues.
 

Appendix A - Crossover Cable Construction

Your cable diagram should be as follows:

    Connector A     Connector B
 
 
Connector A Connector B
Pin # Pin #
1 3
2 6
3 1
6 2
4 7
5 8
7 4
8 5

Rev 0.0.6
(c) 1999 Rudy Pawul
rpawul@iso-ne.com