TruCluster Server High Availability Case Study Overview

Tru64 UNIX documentation

Tech tips and white papers

» Best practices
» Technical updates
» White papers & tech tips
» Send us your comments

Tru64 UNIX documentation

» Tru64 UNIX operating system
» TruCluster software
» Advanced Printing Software
» Advanced Server for UNIX
» Device driver
» Internet Express
» New hardware delivery
» Patch kits
» Porting guides
» POSIX conformance
» secure shell
» Windows 2000 single
sign-on

Related links

» Tru64 UNIX home
» Tru64 UNIX QuickSpecs and software product descriptions
» TruCluster Server high availability case study
 

The kitche cluster is a production-level, no-single-point-of-failure (NSPOF), high availability cluster. Although we use a number of clusters as production systems to test prerelease software, we wanted to design and build an NSPOF cluster that we could manage extremely conservatively to determine what level of uptime was possible. In addition, when things went wrong, as they always do in real life, we wanted to learn what caused the failure and how to ameliorate or avoid it next time.

We learned to significantly improve uptime by:

  • Planning before doing. The corollary is logging all administrative operations.
  • Careful monitoring of the hardware/software environment.
  • Scheduling downtime before it becomes unscheduled downtime.
  • Testing changes on a test bed before implementing.
  • Keeping the number of people with access to the root account to a minimum.

This page provides the following information:

Design Goal

The priorities for the cluster were availability and reliability of services. Performance was important, but was subordinate to availability and reliability.

The stated goal was: Create a two-node cluster with complete hardware redundancy and software failover of key operating system and application components. The services that the cluster provides must be available a minimum 99.999% of the time to users and client systems.

Although the stated goal availability goal was extremely ambitious, the cluster achieved 100% uptime during the first six months of 2000. This means that the four critical ASE services that the cluster provided were always available during that period. The cluster was upgraded to TruCluster Server Version 5.0A in October, 2000. When availability metrics are agreed upon and gathered, they will be posted on this site.

Assumptions and Restrictions

The cluster design incorporated the following assumptions and restrictions:

  • We would control only the network wires from the systems to the network box on the busbar and to the FDDI hub closest to the systems. After that, site services would control the network, including all network connections to the office areas, to other systems in the laboratory, and to remote sites.

  • We would control only the electrical power from the breaker box on the busbar through the UPSs to the systems. Site services would control all electrical power beyond the circuit breaker box.

  • TruCluster Server software would provide high availability for services.

  • We would install officially released versions of software products. We would install beta versions of products only if we needed to, and no other option existed. For example, we would install beta versions of patches only when the patch would fix a severe problem.

Hardware Overview

The first step toward ensuring high availability is redundant hardware that eliminates any single point of failure. In addition, each system in the cluster must be capable of carrying a full load without a significant performance degradation if the other system fails.

The cluster consists of two identical AlphaServer® 4000 systems, each with two 5/533 CPUs, 512 MB memory, and the 16-slot PCI option. The systems are named kitche1 and kitche2. (See the hardware table for a full list of the cluster hardware.)

One 4 GB disk contains the system disk (root, swap, and /usr), which were LSM encapsulated and mirrored to another 4 GB disk in the shelf in the Version 1.6 cluster. When we upgraded to TruCluster Server Version 5.0A, we used hardware RAID to mirror the shared clusterwide file systems. Additional disks contain /usr/local, /tmp, and /var/adm. There is one spare bootable system disk that we can use to recover the system disk in case of a catastrophe.

The first BA370 consists of five RAID5 sets (8 GB) for the staff and IMAP areas, one 4 GB disk for automatic replacement if a disk fails in a RAID5 set; and the rest of the disks are JBODs. The second BA370 is an exact duplicate of the first BA370 in storage arrangement. LSM is used to mirror the disks between the first BA370 and the second BA370.

The systems and ESA10000 cabinets have dual power supplies feeding the cabinet. Each system has three power supplies. Each storage shelf has dual power supplies.

All components used in the cluster are attached to UPSs. The UPSs get their source from the same busbar, but are attached to separate breakers on different phases. If the power source fails, the UPSs can supply the systems with battery power for approximately 20 minutes before cleanly shutting them down.

Return to TruCluster Server High Availability Case Study