| DESCRIPTION The ERP identified in the Engineering
Advisory contains numerous fixes for device-related hangs, panics,
and boot issues.
Descriptions of the fixes follow:
- This patch fixes a configuration issue found in non CAM
devices and CD_ROM devices.
- This patch improves the reliability of the Tru64 Cluster DRD
subsystem when faced with tape devices and tape device failures.
- There was a timing hole where two opens would be sent
down at the same time to the tape driver. Before the tape
driver would check to determine if it was already open, the
paths could be changed, which would result in a kernel
memory fault panic. A typical stack trace for the panic
would be:
THREAD 1 drd_open() drd_set_tape_changer_server()
drd_check_path() drd_issue_local_ioctl() ctape_ioctl()
ccmn_path_setup3 ccmn_alloc_path3() cmn_reg_hier_path3
THREAD 2 drd_open() drd_local_open() drd_local_device_open()
drd_issue_local_ioctl() ctape_ioctl() ctape_verify_path()
ccmn_path_setup3 ccmn_del_stale_paths3() ccmn_destroy_invalid_paths()
ccmn_reg_hier_path3
- When a device is deleted via hwmgr and an open is in
progress the open can hang. This patch removes the timing
hole that allows the open to progress to the point where it
hangs.
- When a device fails all current IOs are returned with an
appropriate error status code. If the upper layers continue
to send IOs after the device has been marked as failed, IOs
can hang in drd.
- This patch also fixes barrier issues when devices fail
and a barrier is in progress.
Symptoms for 2,3 and 4 are:
Status of a drd disk with stalled IOs:
drd_disk d_hwid d_state d_flags d_type errno eei d_bp_cnt
0xfffffc00f4fe0e00 0x0086 0x0003 0x0a800081 0x0000 0x0013
0x0000 1
DRD_FAILED
DRD_DISK_BLOCKED
DK_DAIO_DISK
DRD_DISK_NOT_USABLE bp 0xfffffc00291b3500 00:02:24.180
DRD_DRAINED_FLAGS
DRD_DISK_FAILED
DRD_STOP_SERVER
DRD_DO_NOT_DELETE
DRD_IS_BARRIERABLE
Typical thread trace for vold threads at the time of hung
IOs:
0 thread_block
1 volsiowait
4 volsioctl_rea
5 spec_ioctl
6 vn_ioctl
7 ioctl_base
8 syscall
9 _Xsyscall
- This patch fixes an error in the DRD subsystem wherein
uninitialized disk attributes can cause a system panic.
- 4 panic
5 trap
6 _XentMM
7 free
8 drd_release_bp_resources
9 drd_ics_io
10 drd_ics_read
11 svr_drd_ics_read
12 icssvr_daemon_from_poolsvr_drd_ics_read
This problem appears when open/read is attempted on deleted
XCR disks.
- This patch also fixes an error during a failback of a
Tape device wherein character devt is not restored properly.
- Corrects a problem where DRD event thread may run infinitely
while responding for bid server transaction.
- This patch fixes a problem whereby the DRD subsystem may
cause a system panic, because routines may be called from a
Light weight context(LWC). This could result in a system panic
with the following or similar stack trace.
0 boot
1 panic
2 thread_block
3 lock_wait
4 lock_write
5 (source file cannot be determined)
6 (source file cannot be determined)
7 (source file cannot be determined)
8 drd_restart_io
9 drd_io_barrier_complete_timeout
10 softclock_scan
11 lwc_schedule
12 exception_exit
- Fixes a hang with disklabel(8) that occurred if a local open
failed for the same disk simultaneously.
- Corrects reference counting issues within the DRD subsystem
that can prevent the deletion of hwids.
- Fixes disk I/O hang in DRD. This patch fixes a problem in
DRD that could result in the hanging of commands like disklabel,
showfdmn or any file system I/O. Typical stack trace is as
follows:
0 thread_block
1 sleep_prim
2 mpsleep
3 drd_reopen_partitions
4 drd_change_server_node
5 drd_complete_failback
6 drd_handle_event_io_drained
7 drd_handle_one_event
8 drd_handle_events
9 drd_event_thread
- DRD now plays an active role in the device deletion callback
and voting. In the past drd would be notified after the device
deletion had occurred via an evm event. This caused numerous
panics and hung devices as drd could attempt to access a deleted
device. With this fix drd will no longer access a device that
has a deletion pending or in progress.
- This patch fixes an issue of DRD returning incorrect device
information when the hwid is not found.
- Provides a fix for a Kernel Memory Fault in drd disk code. A
typical stack trace of the problem is as follows:
0 boot
1 panic
2 trap
3 _XentMM
4 simple_lock_D
5 drd_add_server
6 drd_find_local_disks
7 drd_config_thread
- Fix for DRD_IOCTL_ERROR handling for tape devices
- Fixes a Kernel Memory Fault in IO Path for Served Disks and
for stalled IOs. A typical stack trace of the problem is as
follows:
0 stop_secondary_cpu
1 panic
2 event_timeout
3 printf
4 panic
5 trap
6 _XentMM
7 drd_ics_get_disk
8 drd_ics_io
9 drd_ics_read
10 svr_drd_ics_read
11 icssvr_daemon_from_pool
- Fixes disk access issues that shows up early in the boot
process.
This problem could result in a system panic with the following
or similar stack trace.
PANIC: "CNX MGR: Invalid configuration for cluster seq disk"0
boot
1 panic
2 init_globals
3 init_cnx
4 cnx_subsys_configure
5 cnx_callback
6 dispatch_callback
7 main
8 main
- Fixes a hang during cluster bootup caused by early
reservation conflicts. During cluster bootup, the following
warning messages appears and the node hangs till another node
comes up.
"WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed
to mount"
- Fixes a cluster hang issue during cluster boot-up, when
local disk open operations fail while disklabel is in progress.
- This patch corrects an erroneous error message that can be
displayed by drdmgr when relocating a device. For example:
drdmgr: Error, Uknown error -1431655766 for device 'tape0'
attribute DRD_SERVER
- Handles reservation conflict errors to address cluster node
hang during boot. During cluster booting, the following warning
messages appears and the node may hang until the second node
comes up. A typical message that appears on the console when the
node hangs is as below,
"WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed
to mount"
This error message is due to the path being configured later in
the boot process resulting in a reservation conflict.
- Allows retries of disk open at boot time if device is in
MUNSA reject state. A disk open can fail if the device is
currently in MUNSA reject state. This can result in boot hang
conditions while the system is being booted up.
|
| RESOLUTION HP is releasing the following ERP kits
publicly for use by any customer. The ERP kits use dupatch to
install and will not install over any Customer Specific Patches (CSPs)
that have file intersections with the ERP. Contact your service
provider for assistance if the installation of an ERP is blocked by
any of your installed CSPs
The fixes contained in the ERP
kit are available in the following mainstream patch
kit:
HP Tru64 UNIX v 5.1B-4
The kit distributes the following files:
- /usr/opt/TruCluster/sys/drd.mod
- /sys/BINARY/cam_disk.mod
Early Release Patches
HP Tru64 UNIX version: 5.1B-3
ERP Kit Name: TCRKIT1001020-V51BB26-E-20061205
Kit Location:
http://www.itrc.hp.com/service/patch/patchDetail.do?patchid=TCRKIT1001020-V51BB26-E-20061205
|