Keeping NFS running and what to do if it isn't.

»

HP Tru64 UNIX

Tru64 UNIX

» Tru64 UNIX V5.1B-6
» Tru64 UNIX V5.1B-5
» Documentation
» Information library
» Software web index
» Software products library
» Patch database
» Services
» Developer & Solution Partner Program
» Send us your comments
» Support Statements

Evolving business value

» Tru64 UNIX to HP-UX 11i transition benefits calculator
» Alpha RetainTrust Program
» Transition

Related links

» Alpha systems
» HP-UX 11i
» Integrity servers
» Linux
» HP storage
» HP solutions
HP-UX 11i: measurably better TCO!
Man with bicycles
 

This page offers NFS knowledge that anyone who supports NFS systems should know. NFS is a key component of many UNIX systems, and is often the prime consumer of network services.

When NFS doesn't work, people seem to say they have "NFS problems", not "Unix problems" or "network problems.” NFS uses much of a Unix kernel – and others – so it can be tricky to triage problems. We’re offering this insight, hints and hopefully, help!

Engineering tips

» Configuring Tru64 UNIX for Large Memory Applications in a NUMA Environment (pdf)
» LSM
» Tips home
» All best practices

» Introductory notes
Basic notes on architecture, network sniffing, and mantras.
» NFS doesn't work at all
This is the easiest, but there are some configuration traps that often catch the unwary. Even the skilled become frustrated at times.
» NFS works, but it's slower than a dial-up modem
This is the main target for this Web page. Knowing what to look for is vital. The challenge is that the problem can be almost anywhere and isn’t easy to spot.
» GbE is great - until I transfer a big file
Few Gigabit switches can handle the network load Tru64 UNIX systems generate.
» It was working, but it just hung
Sometimes this is just a variant of the first case but it can involve problems deep inside file systems or virtual memory.
» Summary
A brief look back at the big picture.
 

Introductory notes

The key thing to keep in mind when dealing with NFS is that it is two products, a client file system and a server that uses file systems much like other commands an programs do. It is vital to remember that the client and server are separate entities. When reporting problems, always include system and version number information for both client and server

Diagnose using tcpdump

Tcpdump is a vital tool in diagnosing NFS problems. Configure for it, install it. Use it every day. Tcpdump's philosophy is "one frame, one line," i.e. it generally prints one line per network message. This is great when you're trying to get an overview of what's going on, but there are times you need to see everything about a message. Options "-mv" will enable some multiline and verbose output which is sometimes enough. To see everything, the best options are to use snoop on a Solaris system or Ethereal, an open-source X11 based protocol analyzer.

The NFS Mantra

Very few "NFS problems" wind up being software problems within Tru64 UNIX or even other Unices. A kindred spirit commented

The number one cause [of NFS performance problems] is network configuration. So are numbers 2 thru 99.
Lon Stowell
Posted in comp.protocols.nfs

When it doesn't work at all

When the question is "Why isn't it working?", the first thing to do is see what the client can coax from the server. You start with the basics and work to more complex cases. Do the following from the client.

  • ping server
    Ping sends echo request ICMP messages to a server. The ICMP server is implemented within the kernel and relies on next to no user level daemons. If ping doesn't print replies, you have a network configuration problem, not a NFS problem. If you haven't checked for unplugged power and data wires, now is the time to do so. Then check ifconfig lines and make sure entries in /etc/hosts are consistant.
     
  • rpcinfo -p server
    Rpcinfo is a simple ONC-RPC test tool. -p queries the portmap daemon on the server for a list of all RPC services currently registered on the server. If it times out, portmap probably isn't running or is sick.
     
  • rpcinfo -u server nfs
    This calls a NULL procedure that is implemented in all applications. In addition to -u (UDP), you can use -t (TCP).
  • rpcinfo -u server mountd
    NFS V2 uses MOUNT V1. NFS V3 uses MOUNT V3. Don't worry about rpcinfo's complaints if MOUNT V1 and V3 are available. You should see:

    program 100005 version 1 ready and waiting
    rpcinfo: RPC: Program/version mismatch; low version = 1, high version = 1
    program 100005 version 2 is not available
    program 100005 version 3 ready and waiting


Mount problems
If all the above works, and the mount command doesn't, problems are often due to restrictions mountd is applying. Mountd often returns little and often doesn't make a syslog entry.

Mountd's options offer a spectrum of privacy (i.e. opportunities for mount failure). The options are better described in mountd(8), but the general range from most permissive to least is

  • mountd -n
    This accepts requests from anyone at any site. It is only appropriate outside of a firewall for file systems exported read only that contain no private data. If this is not used, requests from non-root users will be rejected with a complaint about weak credentials. Some non-Unix clients may attempt to use AUTH_NONE, this will always result in a weak credential complaint.
     
  • mountd -i
    When ONC RPC programs receive requests, they generally get the IP address of the sender from the OS and the nost name from the authentication header of the request. If mountd can't match the hostname and IP address, it will reject the request. Often the /etc/hosts file on client and server are mismatched, confusing both systems and the system administrators who made the change just before the end of the workday.
     
  • mountd -d
    This disallows requests from clients in other DNS domains, rejecting the request with rejected credentials, or whatever the error is.
     
  • mountd -s
    This is claimed to enable subdomain checking, but it may do the same thing as -d.


Further narrowing down the problem
Try to reproduce a problem by using common Unix commands. If you can, write a minimal C program to use for final study. The problem with relying on Unix commands is that they are rife with surprises. "cp" does not simply read one file and write another - it does various stat()s and other calls before it reads any data at all. Consider this report:

   
A couple other odd items. If we "dd if=foo.txt of=/dev/ttypn" on the
client, the mtime does change. If we "cat foo.txt >> /dev/ttypn" on
the client, the mtime does not change.[Later] Using dd of=/dev/ttyp2
conv=notrunc does NOT update mtime.

First, dd is not a simple program. Second, there are at least two ways to truncate a file. One is to include O_TRUNC in the open(2), another is to use ftruncate() or truncate(). Instead of looking at dd to see which it uses, it may be easier to write a pair of test programs to see if either or both of these change mtime. That information combined with a tcpdump trace will put you 2/3s of the way to a diagnosis.

When it works, but poorly

This is a more complicated problem. Whereas before you couldn't even get to NFS, now you can, some of the time or even nearly all of the time. Now the whole array of software NFS touches comes into question - network, file system, VM/UBC. Where do you start?

As always, you start with the simple things. This time you look for places where messages were lost or delayed. If you can find a packet loss rate of more than 1% fixing that will usually solve the problem.

  • netstat -is
    This prints MAC level statistics, information about the performance of your network interfaces. For example:
tu0 Ethernet counters at Thu Oct  9 21:46:37 1997
       65535 seconds since last zeroed
  4294967291 bytes received
  4294967290 bytes sent
    53202599 data blocks received
    21993784 data blocks sent
  3256128485 multicast bytes received
    26289560 multicast blocks received
     5819084 multicast bytes sent
       59623 multicast blocks sent
     1474661 blocks sent, initially deferred
     2296844 blocks sent, single collision
     2982245 blocks sent, multiple collisions
       34985 send failures, reasons include:
                Excessive collisions
           0 collision detect check failure
          65 receive failures, reasons include:
                Block check error
                Framing Error
                Frame too long
           0 unrecognized frame destination
           0 data overruns
           0 system buffer unavailable
           0 user buffer unavailable

Three counters have reached their upper bounds in the week the system has been up. Look for the number of data blocks sent and received and compare that to send and receive failures. In a clean Ethernet with no one plugging and unplugging cables, the error counts can be zero, except in a grossly overloaded network. In that case, expect several "Excessive collisions" errors. Here there are about 0.16% excessive collisions, high enough to have a measurable effect on NFS. The only solution is a faster network or to break on into multiple subnets.

Any other error is bad. Period. Communications theory claims that no communication channel is perfect, but properly configured Ethernet is astoundingly good. Almost all output errors are especially worrisome. If you see "Remote failure to defer" it means that the Ethernet is too long or that a transceiver is not noticing when it has won the wire. Other errors suggest hardware failures of some sort. A high receive or transmit error rate will cause bad performance.

Receive errors are less exciting than transmit errors, but those 65 errors are higher than you will see on a clean net. The error types are typical – maybe even the full set. On the same machine, apparently tu1 is talking to a cleaner subnet:

tu1 Ethernet counters at Thu Oct  9 21:46:37 1997
       65535 seconds since last zeroed
  3940414467 bytes received
     5796189 bytes sent
    21453036 data blocks received
       62309 data blocks sent
  2404736887 multicast bytes received
    19653590 multicast blocks received
     4402854 multicast bytes sent
       44561 multicast blocks sent
        2882 blocks sent, initially deferred
         396 blocks sent, single collision
         590 blocks sent, multiple collisions
           0 send failures
           0 collision detect check failure
           0 receive failures
           0 unrecognized frame destination
           0 data overruns
           0 system buffer unavailable
           0 user buffer unavailable

Half as many packets received, no errors. That's the way it should be.

Don't let yourself be lulled into thinking that the CRC check is doing its job. Keep in mind that an average 16 bit CRC will still let one message out of 65536 through, but there are situations where it may not be that effective. It's well worthwhile to keep your network as error free as possible so that any hardware or configuration problems will stand out when they do occur.

Many people would worry over the high collision rate on tu0. While there are some issues, a high collision rate ties up very little bandwidth. It's fascinating to watch 10Base2 Ethernet on an oscilloscope. You can see collisions (the signal is twice as strong) and they take something like 51 usecs. Large Ethernet packets take about 300 times that, so even a 50% collision rate may not harm performance. You can often pick out NFS reads and writes, as those show up as multiple packets with the minimum 9.6 usec separation time.

On the other hand, a high collision rate means a highly loaded Ethernet. Given the tools we ship, That may be the best clue that someone is flooding the net with junk. Tcpdump is the best tool to find out who and what.

  • netstat -s
    Netstat also reports statistics for IP and it's users. It produces a lot of output so the initial reaction is "Too Much!" The key things to look at are marked with asterisks below. Their importance is listed afterwards. Sections without interesting data for NFS are not included
ip:
*       112277645 total packets received
*       166 bad header checksums
        0 with size smaller than minimum
        0 with data size < data length
        0 with header length < data size
        0 with data length < header length
        13407455 fragments received
        1 fragment dropped (dup or out of space)
*       3692 fragments dropped after timeout
        12749963 packets forwarded
        159 packets not forwardable
        0 packets denied access
        512 redirects sent
        0 packets with unknown or unsupported protocol
        90412125 packets consumed here
        103655445 total packets generated here
        18 lost packets due to resource problems
        4291912 total packets reassembled ok
        2737452 output packets fragmented ok
        12951497 output fragments created
        3 packets with special flags set
icmp:
        410713 calls to icmp_error
        0 errors not generated 'cuz old ip message was too short
        0 errors not generated 'cuz old message was icmp
        Output histogram:
                echo reply: 857067
                destination unreachable: 410701
                routing redirect: 512
                time exceeded: 12
                address mask reply: 3
        0 messages with bad code fields
        0 messages < minimum length
        0 bad checksums
        0 messages with bad length
        Input histogram:
                echo reply: 215897
                destination unreachable: 425916
                source quench: 1650
                echo: 857067
                time exceeded: 400
                address mask request: 3
        857070 message responses generated
igmp:
        10239 messages received
        0 messages received with too few bytes
        0 messages received with bad checksum
        10239 membership queries received
        0 membership queries received with invalid field(s)
        0 membership reports received
        0 membership reports received with invalid field(s)
        0 membership reports received for groups to which we belong
        0 membership reports sent
tcp:
*       40524497 packets sent
                35825129 data packets (1599924796 bytes)
*               57176 data packets (56122576 bytes) retransmitted
                4213355 ack-only packets (3847882 delayed)
                191 URG only packets
                11406 window probe packets
                302538 window update packets
                114946 control packets
*       25716962 packets received
                18207501 acks (for 1596631807 bytes)
                304232 duplicate acks
                0 acks for unsent data
                15185777 packets (1815409364 bytes) received in-sequence
*               18979 completely duplicate packets (997570 bytes)
                3604 packets with some dup. data (87720 bytes duped)
*               264573 out-of-order packets (30522274 bytes)
                475 packets (16 bytes) of data after window
                16 window probes
                155467 window update packets
                1379 packets received after close
*               17 discarded for bad checksums
                0 discarded for bad header offset fields
                0 discarded because packet too short
       51766 connection requests
       18648 connection accepts
       66686 connections established (including accepts)
       79278 connections closed (including 1364 drops)
       6683 embryonic connections dropped
       14267537 segments updated rtt (of 14308046 attempts)
*      14195 retransmit timeouts
                8 connections dropped by rexmit timeout
       1973 persist timeouts
       20841 keepalive timeouts
               9156 keepalive probes sent
               199 connections dropped by keepalive
udp:
*       61642016 packets sent
*       63183812 packets received
        0 incomplete headers
        0 bad data length fields
*       3 bad checksums
        146557 full sockets
        824421 for no port (413723 broadcasts, 0 multicasts)
        0 input packets missed pcb cache

    ip: 103655445 total packets generated here
    ip: 112277645 total packets received
    tcp: 40524497 packets sent
    tcp: 25716962 packets received
    udp: 61642016 packets sent
    udp: 63183812 packets received
    These just set baselines for the amount of activity to compare against the various error counters below.

    ip: 166 bad header checksums
    tcp: 17 discarded for bad checksums
    udp: 3 bad checksums

    If you look at Ethernet and FDDI specs and take the time to read up on the subject, you will gain great respect for the error checking that a good CRC offers. If you do the same with the IP checksum, you'll note that it’s an elegant checksum, but still just a checksum. Its goal was largely to provide a warning of software corruption of IP messages in end nodes and routers. If you look at netstat output long enough, especially if you work on weird problems, you'll be amazed at how much manages to evade the CRC check.

    Every so often someone innocently posts a suggestion to comp.protocols.nfs that perhaps the IP checksum has outlived its usefulness. The last time someone did that, within a day engineers from most of the major Unix vendors posted very different answers for why maintaining the IP checksum is critical and other people posted accounts of NFS data corruption when the checksum on UDP messages was disabled. An interesting software engineering thesis would be to study these messages and try to uncover what happened to them.

    ip: 3692 fragments dropped after timeout
    tcp: 57176 data packets (56122576 bytes) retransmitted
    tcp: 18979 completely duplicate packets (997570 bytes)
    tcp: 14195 retransmit timeouts
    These are hallmarks of messages lost in the net. While the NFS timeout rate (see below) is NFS specific, this is helpful because it shows that more than NFS is having trouble.

  • ping server
    If NFS traffic between nodes on the same subnet is fine, but going across a router is not, then there is a very high probability that the router is dropping packets. Router statistics are generally available only to the high priest in charge of guarding the router, and he's probably too busy to help out. Ping is a useful check here. If you start up a ping that crosses a router, you should see the echo for each packet you send. If you don't, what does it tell you? First, you should have run ping between pairs of nodes on each subnet first to verify that no packets are lost in that case. You can also use tcpdump on both client and server to see if messages that reach the router show up on the other side. If they don't, then that's a very strong sign that the router is swamped or congested.
     
  • nfsstat
    The server statistics aren't too useful, but the client RPC portion is. (nfsstat -cr will print just that.) For example:

alingo 26% nfsstat -cr
Client rpc:
tcp: calls      badxids    badverfs   timeouts   newcreds
     9          0          0          0          0         
     creates    connects   badconns   inputs     avails     interrupts
     2          2          0          20         9          0         

udp: calls      badxids    badverfs   timeouts   newcreds   retrans
     125955     3          0          342        0          0         
     badcalls   timers     waits
     343        229        0          

The key things are UDP calls and timeouts. A timeout rate of more than 1% generally results in awful performance. If badxids is incrementing, that usually means the server's file system is overloaded and duplicate replies are coming back because the client has retransmitted the original request.

  • nfsd and nfsiod
    Nfsd must be running for NFS to work at all. The number of server threads you should have it run is a function of load and speed of the exported file system. However, even at the recommended value of 8, you should see good performance. Nfsiod is not needed to let NFS work, but it provides a big boost to client performance by helping to shepard multiple read and write requests at a time. Again, the recommended number of threads, 7, should provide decent performance for casual use.

GbE is great - until I transfer a really big file

Gigabit Ethernet came out at what should have been the perfect time – Tru64 could easily saturate 100 Mbs media with uniprocessor systems and a 10X increase would give us new headroom. Besides, there is a lot to be said for not saturating media. One big challenge was keeping up with the packet load, a 1500 byte packet takes only 12 usec of wire time. Some vendors created "Jumbo frames," a 9000 byte alternative that still only takes 72 usec. One reason that size was chosen was that it could hold an 8KB NFS I/O message.

When Gigabit hardware became available, Tru64 NFS only supported double buffered reads, i.e. when an application reads a file sequentially, NFS would send a read request before the application asked for the data. This performed poorly on SMP systems, as it didn't allow all the CPUs to be busy handling reads and had no chance to keep up with Gigabit loads. NFS was changed to issue two readaheads when the application made each read request, with an eight read ceiling. The first tests on Gigabit showed this let 4 CPU ES40s saturate Gigabit when reading cached files on the server. Unlike 10Base2 and FDDI, it wouldn't take years of hardware and software work to swamp the new medium.

Unfortunately, things aren't quite that simple. While those experiments were in 2001, It appears that few Gigabit switches can keep up. Customers and benchmarking people keep running into problems, but it's natural to look to the computers for the source of the problem instead of the infrastructure. Such “NFS problems” may be simply network congestion issues. Eventually switch vendors will increase their buffering, but in the meantime there is a lot of hardware that can't handle some rather simple configurations.

A Tru64 NFS client limits the network traffic it causes to one request per thread plus the number of nfsiod threads. The latter assist by doing read ahead and write behinds to allow the application requesting the I/O to return to user level. A similar thing is done with disk I/O, but NFS I/O is complicated by doing retransmits and whatnot, so clients generally have full fledged threads handing the work.

Tru64’s default number of nfsiod threads is 7, so a program reading or writing a file can have up to 8 requests outstanding at once. The standard I/O size of 48 KB means that there may be 384 KB out. 384 KB appears to be enough to swamp various infrastructures. 384 KB is 3,072 Kb - merely 3 msec of wire time. Consider two simple cases.

You can do a simple test to see how switches handle two fast streams flowing into a single stream. Consider one client reading files from two servers, with all three connected to a Gigabit switch. The switch will see 2 Gb/sec of data arriving and has to squeeze it out a 1 Gb/sec wire. Using separate programs (e.g. cp) to read the files, there will be no more than nine outstanding reads. (The client doesn't try to evenly allocate nfsiod threads to individual programs, while that might be a problem, it's a separate problem and shouldn't affect this at all.) If the result is less than what you got with just one reader, then you probably have a congestion problem. The client's "fragments dropped after timeout" counter will have incremented which says not all fragments of the read replies reached the client. Nfsstat on the client will report timeouts and retransmissions. The servers won't be at fault, as their load is less than in the single server case. If you can change switches to another vendor’s and get a different fragment loss, then that will be more evidence that the switches can’t handle the load. Another experiment that can be very useful is to reduce the number of nfsiod threads or remount the file systems with a mount option like –o rsize=16384 and see if performance improves.

One frustrating aspect of all this is that while network switches track a huge amount of statistics, expect to have trouble finding information about frames discarded due to congestion. This makes it very hard to understand what is going on without a lot more effort or discussions with the switch vendors.

In summary, if NFS seems to work okay as long as you don't read or write files, be sure to consider the infrastructure. Various things to do include

  • On the client, verify that retransmits are happening (nfsstat -cr).
    Look under retrans for that.
  • On the client (if reading) or server (if writing), see if there are any IP "fragments dropped after timeout".
    This is a very strong indication that the infrastructure is losing fragments.
  • Experiment with various number of nfsiod threads or rsize mount option.
    You can kill and restart nfsiod on the fly. Change the argument to change the number of helper threads that are started. If performance jumps up when you decrease the number of threads below a certain point, that's a very clear sign you've crossed a congestion threshold.
  • If you can enable “flow control” on your switches, that may greatly improve matters. That will throttle fast senders, so if a system is sending to both fast and slow receivers, the fast receiver may see throughput go down.

It was working, but it just hung

When NFS hangs it can be a challenge to figure out exactly why. Before diving into the kernel and hunting for hung threads, the very first thing to do is to go back to the beginning and check to see if anything works.

Assuming that everything else still works, then the important thing to do is to find the threads that are involved and get their stack traces. You already know how to find the user processes on the client, but there's more to look at and most people, even senior OS engineers, don't commonly know where to find it. Sometimes important culprits can be found as daemons or other user level code. You have to consider any process that might be reading or writing any file! If the system is low on memory, a system call may call code that flushes out NFS pages to make space for its own I/O.

The easiest way to find a non-kernel thread stuck in NFS I/O is to look for processes in the uninterruptible state. Of course, processes in U state happen all the time. Consider this from a mail hub:

% ps ax | grep ' U '   372 ??       U       59:31.20
/usr/sbin/ypserv
10748 ??>       U        0:01.13 imapd:
17604 ??>       U        0:05.49 imapd:
21332 ??>       U        0:02.91 imapd:
26344 ??>       U        0:00.19 -AA26344 quarry.zk3.dec.com: DATA (sendmail)
27291 ??>       U        0:00.38 imapd:
27829 ??>       U        0:00.21 -AA27829 quarry.zk3.dec.com: DATA (sendmail)
% ps ax | grep ' U '
26043 ??>       U        0:00.83 imapd:
26850 ??>       U        0:05.71 perl /var/adm/ues/bin/cklocks
29186 ??>       U        0:00.17 procmail -f ...

There was no repeat, therefore none of these processes are hung. The most likely user process to hang is the update daemon which does a sync(2) every thirty seconds. If your system is running multithreaded applications, you may want to use ps axm | more and look for U flags.

Nfsiod and nfsd spawn several kernel threads, threads that belong to Pid 0 or its equivalent on a member. The rest of the time nfsiod and nfsd perform bookkeeping duties and are rarely involved in "NFS problems". While their kernel threads are the standard places to find hangs, be sure to check all the other kernel threads too, especially those involved with managing swapping, VM, and the UBC.

You can get a glimpse of the kernel threads via ps mlp 0 or this alternative:

# ps -p 0 -m -o  wchan,state,time
WCHAN    S           TIME
*        R <      1:23.44
-        R N      0:00.00
malloc_  U <      0:00.04
4f027c   U <      0:04.89
4f0464   U <      0:00.06
isp_rq   S <      0:08.00
isp_abo  I <      0:00.00
isp_fm   I <      0:00.02
ss_tmo   S <      0:00.00
isp_rq   I <      0:00.00
isp_abo  I <      0:00.00
isp_fm   I <      0:00.00
ss_tmo   S <      0:00.00
624298   U <      0:00.00
623df8   U <      0:00.00
netisr   S <      1:07.23
87e13a28 S <      0:00.34
4c90b0   U <      0:00.00
4ca900   U <      0:00.00
4caac0   U <      0:00.00
624510   U <      0:00.00
ubc_dir  U        0:00.00
648718   U <      0:00.00
648728   U <      0:00.05
623ae8   U <      0:01.12
648748   U <      0:00.00
648758   U <      0:00.00
61a8f8   U <      0:00.00
6486e8   U <      0:00.00
250670   U <      0:00.00
nfsiod_  I        0:00.00
nfsiod_  I        0:00.28
nfsiod_  I        0:00.01
nfsiod_  I        0:00.24
nfsiod_  I        0:00.00
nfsiod_  I        0:00.55
nfsiod_  I        0:00.54
5abf570  U <      0:00.00
5abf330  U <      0:00.00
5abea30  U <      0:00.00
nfs_tcp  I        0:00.00
nfs_tcp  I        0:00.00
nfs_tcp  I        0:00.00
nfs_tcp  I        0:00.00
nfs_tcp  I        0:00.00
nfs_tcp  I        0:00.00
nfs_tcp  I        0:00.04
nfs_tcp  I        0:00.00
nfs_udp  I        0:00.00
nfs_udp  I        0:00.00
nfs_udp  I        0:00.00
nfs_udp  I        0:00.00
nfs_udp  I        0:00.02
nfs_udp  I        0:00.00
nfs_udp  I        0:00.00
nfs_udp  I        0:00.01

NFS wchan names were selected to make it easier to pick out the NFS threads. When they are busy, they will not have those wchan names, but they will appear in the same line of the ps's output. That output is useful mainly to see if some threads are hung. It’s not easy to go from a ps line to the thread address, but it's easy to get a list of all kernel threads from the running system or a crash dump:

# dbx -k /vmunix
(dbx) set $pid=0
(dbx) tstack

On the client, the nfsiod program starts several kernel threads (part of Pid 0) that take over reads and writes to NFS files and makes sure they happen. This includes retransmitting requests when replies are not received in a timely fashion. Many times you can find that the application Pid is waiting for I/O to complete and that one or more nfsiod threads are doing the I/O but are waiting for a reply or for transmit done processing to complete. Unfortunately, both of these show similar stack traces:

dbx) set $pid=4362
(dbx) tstack

Thread 0xfffffc0014ebd8c0:
>  0 thread_block()_
   1 mpsleep(0x0, 0xfffffc000ee31400, 0xfffffc0014f32680,
             0x1004, 0x20000000001)
   2 clntkudp_callit_addr(h = 0xfffffc000ee31408, procnum = 7_
           

On the server, things are more variable. Usually one server thread is stuck deep inside file system code waiting for something to happen and all the other threads have identical stacks. These latter ones are not interesting, all they show is that the client has given up waiting for a reply and has retransmitted the request. The server has picked up the request and has called code that is waiting for the original request to finish and release whatever SMP or other locks its thread holds.

Again, look at the stacks for the kernel threads, the NFS server ones will stand out by their names and sizes. From the stack trace you can generally decide if the problem lies with the file system, VM, or NFS.

Summary: Gotchas and other things to watch for

It's very easy to leap the hundreds of KB of kernel code to the conclusion that NFS is broken. This is due in part to several warning messages NFS prints on the console and user's terminals. It is also due in part to the importance of NFS in many environments. While HTTP may win out sometimes, many web servers return WWW pages that are fetched from NFS servers.

ASE V4.0x systems in particular are extremely heavy NFS consumers, and even access local file systems via NFS. Generally when an ASE system is having problems with NFS, people don't realize that the system is doing little but NFS. Unfortunately, the conclusion is that NFS is at fault when there are actually many possibilities.

Learn to analyze the many clues available to diagnose a problem. Once you learn them, you will find that

  • You can often diagnose a problem in a few minutes.
     
  • You will have enough evidence to convince your support group that you need their help.
     
  • You will develop a reputation for bringing the right problem to the right personnel. When you do ask for help, people will immediately accept you may have a serious problem. Having good data can greatly decrease the time to solving the problem. Just a few minutes collecting data can save you days getting the problem resolved.

A final reminder: collect tcpdump traces, netstat output, tcpdump traces, appropriate stack traces, and good tcpdump traces.