Continuous Profiling
(It's 10:43; Do You Know Where Your Cycles Are?)
Jennifer Anderson, Lance Berc, Jeff Dean,
Sanjay Ghemawat, Monika Henzinger, Shun-Tak Leung, Dick Sites,
Mark Vandevoorde, Carl Waldspurger, and William E. Weihl
Digital Systems Research Center
Palo Alto, CA 94301 USA
Processors are getting faster (600 MHz and climbing) and issue widths are increasing
(4- and 8-way becoming common), yet application performance is not keeping pace. On large
commercial applications, average CPI (cycles-per-instruction) numbers may be as high as 4
or 5. With 8-way issue, a CPI of 5 means that only 1 issue slot in every 40 is being put
to good use!
It is common to blame such problems on memory, and in fact many applications spend many
cycles waiting for memory. But other problems -- e.g., branch mispredicts -- also waste
cycles, and independent of the general causes, if one hopes to improve the performance of
a particular application, one needs to know which instructions are stalling and why.
The Digital Continuous Profiling Infrastructure provides an efficient and accurate way
of answering such questions. It uses the Alpha hardware performance counters to obtain
high-frequency samples of various events (cycles, imisses, branch mispredicts, etc.).
Samples are then processed by a suite of analysis tools that accurately characterize where
the time is being spent in a complex workload, from the fraction of cycles spent in each
executable image to the CPI for each instruction and the reasons for any static or dynamic
stalls.
Both the data collection subsystem and the analysis tools have interesting novel
features. The data collection subsystem uses the hardware performance counters to sample
program counters periodically, recording the samples in on-disk profile files. The system
is designed to run continuously in the background on production systems; for this to be
practical, the overhead must be very low. The system as currently implemented imposes an
average overhead ranging from 1 to 3% depending on the workload, yet sustains a high
sampling rate (about one sample every every 62K cycles on average when monitoring cycles,
or about 5200 samples per second on a 333-MHz processor). This permits continuous
operation, and improves the quality of the profiles by minimizing the perturbation of the
system induced by profiling.
The data collection system is transparent: it works with unmodified executables, with
no need to recompile, relink, or make any other changes. It is also comprehensive: it
collects profiles for all code that runs on the system, including applications, shared
libraries, and the kernel. (PALcode on the Alpha is uninterruptible; events that occur in
PALcode are still counted, but the samples show up elsewhere, adding a small amount of
noise to the sample data.)
Identifying and classifying processor stalls at the level of individual instructions is
also a major challenge. The Alpha performance counters, like those in many other modern
processors, can count a variety of events. However, the interrupts for an event that
causes a performance-counter overflow are delivered several cycles after the event happens
(6 cycles on the 21164), causing the samples to land on an instruction some time after the
one relevant to the event. This makes the samples for most events less useful for the kind
of fine-grained analysis we want to produce.
Fortunately, the counter-overflow interrupts for a few events (e.g., instruction-cache
misses) do land on the relevant instruction, and in particular counting cycles yields
sample counts that give a reasonable statistical picture of the total time each
instruction spent waiting to issue: the sample count for an individual instruction when
monitoring cycles is proportional to the total time that instruction spent at the head of
the issue queue. These time-biased samples alone are useful in pinpointing which
instructions in a workload consume the most time, but they do not directly tell why. A
suite of analysis tools uses a detailed machine model and a set of heuristics to convert
time-biased samples into the average CPI for each individual instruction, the number of
times the instruction was executed, and explanations for any static or dynamic stalls.
Other tools -- e.g., Intel's VTune and SGI's Speedshop -- use performance counters to
sample the occurrences of various events. However, they suffer from the same problem as
the performance counters on the Alpha: samples for most events land on nearby
instructions, not the ones that caused the events. As a result, they cannot give an
accurate picture of the CPI for each instruction, the number of times each instruction was
executed, or the reasons for stalls. Such information is available from simulators, but
simulators have serious limitations for analyzing the performance of real systems, not
least of which is their massive overhead.
Our profiling system has been running on Digital Alpha processors under Tru64
UNIX since
September 1996, and was publicly released in December 1996. A port has
been done for Alpha/NT and is in progress for OpenVMS. The system has already
been used to analyze and
improve the performance of a wide range of complex commercial applications, including
graphics systems, databases, industry benchmark suites, and compilers. For example,
our tools pinpointed a performance problem in a commercial database system; fixing
the problem
reduced the response time of an SQL query from 180 to 14 hours. In another example,
our tools' fine-grained instruction-level analyses identified opportunities to
improve
optimized code produced by Digtal's compiler, speeding up the mgrid SPECfp benchmark
by
15%.
Our tools can be used directly by programmers; they are also intended to be
used to drive profile-based optimizations in compilers, linkers, post-linkers,
and run-time
optimization tools. Work is underway to feed the output of our tools into Digital's
optimizing backend and into the OM post-linker optimization framework. In addition,
we are
beginning to explore new optimizations that leverage the detailed instruction-level
information provided by our tools.
|