|
The
kitche cluster is a production-level, no-single-point-of-failure
(NSPOF), high availability cluster. Although we use a number
of clusters as production systems to test prerelease software,
we wanted to design and build an NSPOF cluster that we could
manage extremely conservatively to determine what level of
uptime was possible. In addition, when things went wrong,
as they always do in real life, we wanted to learn what caused
the failure and how to ameliorate or avoid it next time.
We
learned to significantly improve uptime by:
- Planning before doing. The corollary is logging all administrative
operations.
-
Careful monitoring of the hardware/software environment.
-
Scheduling downtime before it becomes unscheduled downtime.
-
Testing changes on a test
bed before implementing.
-
Keeping the number of people with access to the root
account to a minimum.
This
page provides the following information:
Design Goal
The
priorities for the cluster were availability and reliability
of services. Performance was important, but was subordinate
to availability and reliability.
The
stated goal was: Create a two-node cluster with complete
hardware redundancy and software failover of key operating
system and application components. The services that the cluster
provides must be available a minimum 99.999% of the time to
users and client systems.
Although
the stated goal availability goal was extremely ambitious,
the cluster achieved 100% uptime during the first six months
of 2000. This means that the four critical ASE
services that the cluster provided were always available
during that period. The cluster was upgraded to TruCluster
Server Version 5.0A in October, 2000. When availability metrics
are agreed upon and gathered, they will be posted on this
site.
Assumptions and Restrictions
The
cluster design incorporated the following assumptions and
restrictions:
- We
would control only the network wires from the systems to
the network box on the busbar and to the FDDI hub closest
to the systems. After that, site services would control
the network, including all network connections to the office
areas, to other systems in the laboratory, and to remote
sites.
- We
would control only the electrical power from the breaker
box on the busbar through the UPSs to the systems. Site
services would control all electrical power beyond the circuit
breaker box.
- TruCluster
Server software would provide high availability for services.
- We would install officially released versions of software
products. We would install beta versions of products only
if we needed to, and no other option existed. For example,
we would install beta versions of patches only when the
patch would fix a severe problem.
Hardware Overview
The
first step toward ensuring high availability is redundant
hardware that eliminates any single point of failure. In addition,
each system in the cluster must be capable of carrying a full
load without a significant performance degradation if the
other system fails.
The
cluster consists of two identical AlphaServer® 4000 systems,
each with two 5/533 CPUs, 512 MB memory, and the 16-slot PCI
option. The systems are named kitche1 and kitche2.
(See the hardware table
for a full list of the cluster hardware.)
One
4 GB disk contains the system disk (root, swap,
and /usr), which were LSM encapsulated and mirrored
to another 4 GB disk in the shelf in the Version 1.6 cluster.
When we upgraded to TruCluster Server Version 5.0A, we used
hardware RAID to mirror the shared clusterwide file systems.
Additional disks contain /usr/local, /tmp, and
/var/adm. There is one spare bootable system disk that
we can use to recover the system disk in case of a catastrophe.
The
first BA370 consists of five RAID5 sets (8 GB) for the staff
and IMAP areas, one 4 GB disk for automatic replacement if
a disk fails in a RAID5 set; and the rest of the disks are
JBODs. The second BA370 is an exact duplicate of the first
BA370 in storage arrangement. LSM is used to mirror the disks
between the first BA370 and the second BA370.
The
systems and ESA10000 cabinets have dual power supplies feeding
the cabinet. Each system has three power supplies. Each storage
shelf has dual power supplies.
All
components used in the cluster are attached to UPSs. The UPSs
get their source from the same busbar, but are attached to
separate breakers on different phases. If the power source
fails, the UPSs can supply the systems with battery power
for approximately 20 minutes before cleanly shutting them
down.
Return
to TruCluster Server High Availability Case Study
|