This chapter contains information about the Alpha, PA-RISC and the Itanium®
processor families.
It also discusses processor-related development issues
and provides examples using both C and assembly language for typical processor-dependent
code.
Most of the APIs and techniques discussed in this chapter are highly
nonportable.
As a result, be careful when using these practices as they will
result in significantly more work when supporting additional platforms or
operating systems.
11.1 Itanium® Processor Family Overview
The Itanium® architecture was developed by HP and Intel® as a 64-bit processor for engineering workstations and e-commerce servers. The Itanium® processor family uses a new architecture called Explicitly Parallel Instruction Computing (EPIC). |
|
Some of the most important goals in the design of EPIC were instruction-level parallelism, software pipelining, speculation, predication, and large register files. These allow Itanium® processors to execute multiple instructions simultaneously. EPIC exposes many of these parallel features to the compiler (hence "Explicitly" in the name). This increases the complexity of the compiler, but it allows the compiler to directly take advantage of these features. |
|
EPIC is designed so future generations of processors can utilize different levels of parallelism, which increases the flexibility of future designs. This allows the machine code to express where the parallelism exists without forcing the processor to conform to previous machine widths. |
|
For more information on assembly programming for the Itanium® processor family, see the Intel® Itanium® Architecture Assembly Language Reference Guide. Available at: |
|
http://developer.intel.com/software/products/ opensource/tools1/tol_whte2.htm |
11.2 PA-RISC Processor Family Overview
The PA-RISC architecture was developed by HP and first introduced in 1986. It was one of the first commercially available RISC processors, including the fixed-size, hardware-inspired instructions common to RISC designs, but also including additional instructions for manipulating strings and other data types used in commercial processing. This enhancement to the typical RISC instruction set was a direct result of research by HP for a processor design optimized over the full range of technical and commercial applications. |
|
The PA-RISC architecture included several features to increase parallelism and speed program execution. Instruction pipelining, delayed branching (delay slot), instruction cache prefetch based on branch prediction, and a Shift-and-Add instruction to speed integer multiplication are a few of the key features in the design. These features require a complex compiler with a detailed understanding of the architecture, but permit a tremendous reduction in instruction path length over conventional processor designs. |
|
The PA-RISC architecture has been through several enhancements since it was introduced. The current revision (PA-RISC 2.0) is a 64-bit implementation and includes instructions for multimedia processing and features to improve cache performance. More detailed information on the PA-RISC architecture can be found in, PA-RISC 2.0 Architecture by Gerry Kane, Prentice Hall, ISBN 0-13-182734-0. |
Data access can be accomplished using 8-, 16-, 32-, or 64-bit elements.
The terms
byte
and
word
are used to
provide non-numerical terms for describing the data sizes.
Elements larger
then these basics are defined in terms of either bytes or words.
A
byte
is defined to be 8 bits.
The size of a
word
has never been formally defined.
As a result there are a number
of definitions used.
The IA-32 processor line and the Alpha processor define
a
word
as 2 bytes (16 bits).
Both the PA-RISC and the Itanium®
architectures define a
word
as 32 bits.
As a result of
the differences in the definitions of
word-sized elements,
elements which are defined as composed of multiple
word
elements are also affected.
Table 11-1
provides a mapping
of common terminology for element sizes between the Alpha, Itanium®, and PA-RISC
architectures.
Table 11-1: Multi-byte Element Terminology
| Bit Size | Byte Size | Alpha Architecture | Itanium® Architecture | PA-RISC Architecture |
| 8 | 1 | byte | byte | byte |
| 16 | 2 | word | halfword | halfword |
| 32 | 4 | longword | word | word |
| 64 | 8 | quadword | longword | longword |
| 128 | 16 | octaword | quadword | quadword |
Data alignment is important on the Alpha, Itanium®, and PA-RISC processors. Unaligned data access can be extremely expensive or prohibited. When permitted, the high cost of correcting unaligned data access is because multiple memory accesses are actually performed and the appropriate bytes are then reconstructed to the form requested. Depending on the architecture, this operation may be performed by the hardware but more frequently is done in software. Often this alignment fixup requires hundreds of cycles for each unaligned access, significantly impacting the performance of the application.
Section 7.6.3
contains more information and examples
on data alignment management on HP-UX 11i.
11.5 Determining Processors and Configurations
There are several ways to identify the version of the processor(s) in
a system.
For HP-UX 11i, the recommended way is to use
sysconf(_SC_CPU_VERSION)
from within a program or
getconf
from the command line as shown in
Example 11-1.
Example 11-1: Identifying the CPU From the Shell
% getconf _SC_CPU_VERSION 768
The values for each type of processor are listed in
unistd.h; a portion of these definitions are show in
Table 11-2.
Table 11-2: _SC_CPU_VERSION Values
| Processor | C define | Hex Value | Decimal Value |
| Motorola MC68020 | CPU_HP_MC68020 |
0x20C | 524 |
| Motorola MC68030 | CPU_HP_MC68030 |
0x20D | 525 |
| Motorola MC68040 | CPU_HP_MC68040 |
0x20E | 526 |
| HP PA-RISC 1.0 | CPU_PA_RISC1_0 |
0x20B | 523 |
| HP PA-RISC 1.1 | CPU_PA_RISC1_1 |
0x210 | 528 |
| HP PA-RISC 1.2 | CPU_PA_RISC1_2 |
0x211 | 529 |
| HP PA-RISC 2.0 | CPU_PA_RISC2_0 |
0x214 | 532 |
| Itanium® Revision 0 |
CPU_IA64_ARCHREV_0 |
0x300 | 768 |
Example 11-2
provides a C code fragment to identify
the current processor.
Example 11-2: Determining Processor Type Using C
#include <unistd.h> /* for CPU_PA_RISC definitions */
switch(sysconf(_SC_CPU_VERSION))
{
case CPU_PA_RISC1_0: /* should not see this case much */
....
case CPU_PA_RISC1_1: /* still popular! */
....
case CPU_PA_RISC2_0: /* Current Machines */
....
case CPU_IA64_ARCHREV_0: /* Rev 0 of Itanium */
....
default:
.... /* Unrecognized */
}
If you just need to determine if the current processor is a PA-RISC machine, the following code fragment will work: #include <unistd.h> if (CPU_IS_PA_RISC(sysconf(_SC_CPU_VERSION)))
|
Sometimes additional processor information is required.
One such case
is in determining if the processor is an Itanium®
or an Itanium®
2 processor.
The
sysconf()
and
getconf
arguments
to access these additional sources of information are shown in
Table 11-3.
Table 11-3: Parameters for Additional Processor Information
| Argument | Behavior |
_SC_CPU_CHIP_TYPE |
Returns information from Itanium® processor family CPUID[3]. Figure 11-1 shows the regions defined for CPUID[3]. See the Itanium® processor family architecture definition for the meaning of each bit. |
_SC_CPU_KEYBITS1 |
Returns information from Itanium® processor family CPUID[4], and the processor extension information on PA-RISC. See the architecture manuals for the exact meaning of each bit. |
The Itanium® processor family uses a processor identification register file (CPUID) which contains five 64-bit registers in a fixed region, registers 0 to 4, and a variable region, register 5 and above. The variable region of the CPUID register file is currently not implemented, but may be in future members of the Itanium® processor family. |
|
The name of the supplier of the CPU is stored in the 16 bytes of CPUID[0] and CPUID[1]. They are stored so the first letter is stored in byte 0, the second letter in byte 1, and so on. For example, for "IntelCorporation" the "I" is stored in byte 0 and the final "n" is in byte 15. |
|
CPUID[2] is currently not used. CPUID[3] contains 40 bits that identify the processor, while CPUID[4] contains general application-level information about the features supported in the processor. |
|
The Itanium®
processor family CPUID[3] register contains the processor
architecture revision that the processor implements ( |
Figure 11-1: CPUID[3] Bit Positions
In addition to the
|
System Configurations
Another useful type of information is the number of processors
in a given system.
Use the
pstat_getdynamic(3)Example 11-3: Number of CPUs Using C
#include <sys/param.h>
#include <sys/pstat.h>
#include <sys/unistd.h>
int main() {
struct pst_dynamic psd;
//obtain number of cpus
if (pstat_getdynamic(&psd, sizeof(psd), (size_t)1, 0) != -1)
{
size_t nspu = psd.psd_proc_cnt; /* Number of CPUs */
printf("Number of CPU: %d\n", nspu);
return 1;
} else {
printf("Error");
return 0;
}
}
To determine the processor clock speed, use the
pstat_getprocessor(2)sysconf()
and
pstat()
to get several pieces of
information, including:
Clock ticks per second
Number of CPUs in the system or partition
Speed of each CPU
Example 11-4: Collecting System Information in C
#include <sys/param.h>
#include <sys/pstat.h>
#include <sys/unistd.h>
#include <inttypes.h>
#include <stdio.h>
main()
{
struct pst_dynamic psd;
struct pst_processor *psp;
long ticks_per_sec;
//obtain clock ticks per second
if ( (ticks_per_sec = sysconf(_SC_CLK_TCK)) == -1)
{
perror("sysconf");
exit(-1);
}
//obtain number of spus
if (pstat_getdynamic(&psd, sizeof(psd), (size_t)1, 0) != -1)
{
size_t nspu = psd.psd_proc_cnt; /* Number of CPUs */
psp = (struct pst_processor *)malloc(nspu*sizeof(struct
pst_processor));
//for each processor, get cycles per clock tick
if (pstat_getprocessor(psp, sizeof(struct pst_processor),
nspu, 0)==nspu)
{
int i;
int64_t cycles_pertick,mhz;
int total_execs = 0;
//calculate mhz as below
for (i = 0; i < nspu; i++)
{
cycles_pertick=psp[i].psp_iticksperclktick;
mhz = (cycles_pertick*ticks_per_sec)/1000000;
(void)printf("%" PRId64 " MHz #%d\n",mhz,i);
}
}
else
perror("pstat_getprocessor");
}
else
perror("pstat_getdynamic");
}
11.6 Determining the Current Processor
HP-UX 11i provides two system APIs to determine the current processor on which your code is running. This information can be useful when using timers or determining the processor affinity of a process.
The
pstat_getproc(2)pstat_getproc()
will always return
the same value.
Multithreaded applications should use the
pthread_processor_id_np(3T)PTHREAD_GETCURRENTSPU_NP
the processor the thread is currently running
on is returned.
This interface is not portable across platforms and should
be used with caution.
Example 11-5
provides an example
of its use.
Example 11-5: Determining the Current CPU
#include <stdio.h>
#include <pthread.h>
pthread_spu_t spu_id;
pthread_processor_id_np(PTHREAD_GETCURRENTSPU_NP,&spu_id,0);
printf("%d On CPU: %d\n", pthread_self(),spu_id);
11.7 Instruction Pointer (Program Counter)
The Instruction Pointer (IP) or Program Counter
(PC) is accessible to the application programmer as a read-only register.
In the Itanium®
architecture, instructions are grouped into sets of three
instructions called a bundle.
When the IP register is read, it points to
the next bundle of instructions to be executed.
As a result, the IP is aligned
on 16-byte boundaries.
This register is accessed using the
|
|
For example, to access the IP use this assembly code: mov r32 = ip ;; // Access the IP
|
|
For more information, see the Intel® Itanium® Architecture Software Developer's Manual, Volume 1, Section 3. |
11.8 Processor Interval Time Counter
Like Alpha processors, both the PA-RISC and Itanium® processor families provide a register that counts at a fixed relationship to the processor clock frequency. This is a read-only register that lets you directly access timing information to determine application performance. This register is Application Register 44 (AR44) on the Itanium® processor family and Control Register (CR16) on PA-RISC.
There are several limitations in using this type of timer. The high granularity of the timer means that other factors, such as cache hits and unrelated I/O interrupts, may significantly affect the time measured. Additionally, each processor in a multiple processor system has its own interval timer and they are not required to be synchronized to each other. Use of the interval timer is only valid on a uniprocessor system or where the application can guarantee that it is executing on the same CPU. These limitations generally limit the use of this type of timer to performance testing during development; for example, how much does optimization increase the performance of this calculation loop.
Example 11-6
uses the
TICKS
preprocessor
macro to access the ITC on Alpha processor, PA-RISC or Itanium®
processors.
This example
is also available for download from the DSPP site:
http://www.hp.com/go/dspp
under the
Itanium®-based solutions
>>
technical tips
>>
performance tools
section.
Example 11-6: Reading the ITC
/*
*
* The TICKS macro can be used to access the high-resolution counters
* on the CPU for timing. These counters increment at a rate based on
* the CPU cycle clock. This does not imply a 1:1 ratio.
*
* Caution needs to be used, particularly when using these macros as
* there is no consideration to overflow or changing CPUs. The counters
* in multi-CPU systems are not required to be in sync and may be
* significantly different.
*
* These timers should only be used for VERY small windows, 1-2
* milliseconds to reduce the risk of these problems. These counters
* return 64-bit integer values and should not be stored in int type
* variables, whenever possible the C99 uint64_t type should be used.
*
*/
#ifndef __TICKS_H_
#define __TICKS_H
#include <inttypes.h>
/* Check for HP Tru64 UNIX compilers */
#if defined(__DECC_VER) || defined(__DECCXX_VER)
/*
* Handle the Alpha Processor Family
*/
#if defined(__alpha)
#include <alpha/builtins.h>
#define __TICKS __RPCC()
#endif /* __alpha */
/* Check for the HP-UX Compilers */
#elif defined(__HP_aCC) || defined(__HP_cc)
#if defined (__STDC_32_MODE__)
#error "Must use +e or -Ae for 64-bit integer support"
#endif /* __STDC_32_MODE__ */
/*
* Handle the Intel Itanium Processor Family
*/
#if defined(__ia64)
#include <machine/sys/inline.h>
#define __TICKS _Asm_mov_from_ar(_AREG_ITC)
/*
* Handle the PA-RISC Processor Family
*/
#elif defined(__hppa)
#if defined (__HP_aCC)
#error "Inline assembler not supported in C++, compile using C"
#endif /* __HP_aCC */
#include <machine/inline.h>
#include <machine/reg.h>
#ifdef __cplusplus
inline
#else /* __cplusplus */
#pragma INLINE __TICKS_f
#endif /* __cplusplus */
static uint64_t __TICKS_f(void) {
register uint64_t _ticks;
_MFCTL(CR16,_ticks);
return _ticks;
}
#define __TICKS __TICKS_f()
#else
#error "Unknown chip for compiler."
#endif /* arch && ( __HP_aCC || __HP_cc ) */
#else
#error "Unknown Compiler"
#endif /* Complier Tests */
#endif /* __TICKS_H */
The stack pointer in the Itanium®
processor programming
convention is stored in General Register 12 (GR12); the alias of
|
|
On Itanium®-based systems, the HP-UX 11i library
|
The PA-RISC versions of HP-UX 11i use General
Register 30 (GR30) for the stack pointer; the alias of
|
|
On PA-RISC-based systems,
http://devresource.hp.com/ drc/STK/docs/archive/unwind.pdf or Chapter 8: Stack Unwind Library of The 32-bit PA-RISC Run-time Architecture Document at: http://devresource.hp.com/drc/STK/docs/archive/rad_11_0_32.pdf |
The Itanium® processor uses stacked register files to improve performance. The processor can allocate up to 96 general registers for use as a register stack. The registers in the stack can be renamed in a process called register renaming. This mechanism is used to pass parameters for function calls. This allows the parameters to be in the same physical registers in both the calling and the called function's register stack frame. In other words, the output parameters of the calling function become the input arguments to the called function and are stored in the same registers. The registers are automatically renamed in the called function to facilitate this operation. This reduces the overhead of function calls. |
|
The register stack can be configured as a circular buffer with register renaming. If all the registers are used when a function is called, then the contents of previously used registers are spilled to memory by the Register Stack Engine (RSE). As the stack is unwound, the previous contents of the registers are restored by the RSE. |
|
The register stack can also be used in software pipelining and loop unrolling. The various uses of register renaming allows the compiler to increase the efficiency of a section of code. The renaming of registers is transparent to the application programmer, but knowledge of this allows you to facilitate the efficient use of these resources. |
|
For more information, see the Intel® Itanium® Architecture Software Developer's Manual, Volume 1, Sections 2 and 5. |
Both the Itanium® and PA-RISC processors provide hardware support for two floating-point data types as well as instructions for converting between the types:
Single-precision floating-point
Double-precision floating-point
The Itanium® processor also provides hardware support for two additional types:
|
Additionally the double-extended real (IEEE real type, 128-bit) floating-point format is supported. Depending on the implementation of the architecture, the IEEE-style quad-precision (128-bit) data type may be supported in either hardware or software.
Both methods of IEEE rounding precision are supported: converting the result to the double-extended exponent range and converting results to the destination precision. There are four rounding modes supported:
Nearest (or even)
-infinity (down)
+infinity (up)
Zero (truncate/chop)
There are two functions in HP-UX 11i to select rounding modes:
fegetround(3) |
Gets the current rounding mode. |
fesetround(3) |
Sets the rounding mode. |
The macros to define the mode are contained in
/usr/include/fenv.h.
The default mode is round to nearest.
The similar functions in Tru64 UNIX
are
read_rnd(3)write_rnd(3)
The Itanium® architecture allows lower-precision operands to produce higher precision results as suggested in the IEEE standard. The architecture also allows higher-precision operands to produce lower-precision results. This is an extension to IEEE standard suggestions in this area. (The standard suggests that this type of operation not be allowed.) When these types of operations produce values beyond the ability of the lower-precision to represent, the floating-point status registers are updated to reflect this. |
|
Itanium® processors use 82 bits internally to store floating-point values; Alpha processors use 64-bit internal representations. The Itanium® processor transforms the internal representation to the desired output representation when the value is stored from a register to a memory location. This allows higher-precision while manipulating values in registers, and the desired precision is achieved in the result stored in memory. In addition to the IEEE formats, Itanium® processors support IA-32 formats and can also store values using the register format in memory. The floating-point registers contain three fields:
|
|
The floating-point registers are used for all floating-point operations and for integer multiply and divide operations (unless the compiler realized it can use shifts). The Itanium® processor does multiplication in hardware, but uses software to perform division, square roots, and remainders. Intel® provides the algorithms and machine code to implement these operations. For applications, these operations are automatically provided by the compiler/libraries. |
|
Using 82 bits in the floating-point registers allows both reals and integers to be stored in the 64-bit significand. The 17-bit exponent, instead of 15 bits as in the IEEE format, allows the Newton-Raphson approximations used in software divide/sqrt to be seeded with reciprocal approximation instructions that call handlers if they are not conformant 80-bit numbers (that is, a 15-bit exponent cannot represent the value). This allows the calculations to occur without conditionals in the Newton-Raphson. These extra bits can also be used to simplify the implementation of some basic numeral algorithms. |
|
There are several advantages to implementing the divide/sqrt operations in software. One advantage is that because the complex operations are implemented by a series of simpler instructions, these instructions can be scheduled along with other instructions. This allows the opportunity for increasing the parallel execution of instructions. Another advantage is the ability to choose the algorithm used for the calculation. It is possible to optimize these operations for either speed or accuracy, depending on the requirements of the application. |
|
Itanium® processors also have a fused multiply-add instruction. This allows increased precision when chaining several operations together. The processors also implement several IEEE suggested operations in hardware, including maximum, minimum, absolute maximum, and absolute minimum. |
Applications that require a large amount of memory should consider the significant performance benefits available from using an increased pagesize. The Alpha, PA-RISC, and Itanium® designs are different in this area, as are the designs of HP-UX 11i and Tru64 UNIX. As the amount of memory available to applications continues to increase, changes will continue to occur in this area.
For proper operation of applications that use pagesize information
(
mmap(2)sysconf(3)_SC_PAGESIZE
argument will return the pagesize currently used by
the application.
The Alpha processor uses an 8 KB pagesize; most other processors
currently use a 4 KB pagesize.
The Itanium®
processor allows for variable
pagesizes (see
Table 11-4).
HP-UX 11i further
extends this by allowing the pagesize to be controlled on a per-application
basis.
Use the
+pd
and
+pi
linker
switches
or the
chatr
tool (see
Section 5.3.2) to select
the pagesize you want for your application.
Table 11-4
shows the architecture supported pagesizes.
All pagesizes supported by an architecture may not be supported in individual
implementations of that architecture or by the operating systems running on
them.
Table 11-4: Architecture Supported Pagesizes
| Pagesize | Alpha Architecture | PA-RISC Architecture | Itanium® Architecture |
| 4 KB | Not Available | X | X |
| 8 KB | X | X | X |
| 16 KB | X | X | X |
| 32 KB | X | Not Available | Not Available |
| 64 KB | X | X | X |
| 256 KB | Not Available | X | X |
| 1 MB | Not Available | X | X |
| 4 MB | Not Available | X | X |
| 16 MB | Not Available | X | X |
| 64 MB | Not Available | X | X |
| 256 MB | Not Available | Not Available | X |
While the architectures support multiple pagesizes, the operating systems
do not.
The core pagesize on HP-UX 11i is 4 KB; Tru64 UNIX
uses 8 KB.
Under HP-UX 11i the memory-management system can
use
SuperPages
to manage multiple, adjacent 4 KB pages as a single page.
This size is dynamically adjusted during the program's execution, and the
starting size used is 4 KB unless the
+pi
or
+pd
options or the
chatr
tool is used to change the default size.
11.13 Atomic Memory Operations
An application executing on a multiprocessor system may need a method for coordinating data access across multiple instruction streams to ensure data consistency. For example, you may have a shared memory segment that is manipulated by multiple processes and needs to be protected by some kind of voluntary lock. Typically semaphores are used to provide this lock. The semaphore permits processes to coordinate their access to the shared memory structure, which prevents data corruption.
HP-UX 11i provides standard
semop(2)lockf(3)ioctl(3)
A spinlock is a special class of semaphore.
When a process attempts
to acquire a lock via
semop(), the process becomes blocked
and is later rescheduled.
This incurs all of the system overhead associated
with putting a process to sleep and later waking it up.
Processes acquiring
a lock via spinlock attempt the lock, and if blocked spin a few machine cycles
and try again.
These processes may eventually become voluntarily blocked
and go to sleep, but only after many rapid attempts to acquire the lock.
This time spent spinning is wasted time, but the cost of spinning is
much lower than the cost of blocking and rescheduling the process.
Also, since
a waiting process is spinning on a lock, it can acquire the lock immediately
when it is released.
Furthermore, a spinlock can be implemented in user code,
while a semaphore call is a system call, which has more overhead and causes
a context switch.
Thus, on both uniprocessor and multiprocessor systems,
the performance of many applications that need to hold a lock for very quick
memory-to-memory operations can be greatly enhanced by replacing semaphores
with spinlocks.
11.13.1 High-Level Language Support
Tru64 UNIX provides a series of compiler built-ins to provide low-level
memory access for implementation of spin locks and semaphores.
Similar constructs
are not provided by default but may be created on HP-UX 11i using
the assembler.
11.13.2 Assembly Spinlocks for PA-RISC
User spinlocks in PA-RISC are accomplished by using
the load-and-clear instructions ( |
Example 11-7: PA-RISC Spinlocks in Assembly Language
;
; spin.s: Example assembly language routine for spinlock support.
;
.code
; .level 2.0W ; use this option for 64-bit assembly
.export load_and_clear,entry,priv_lev=3,rtnval=gr
.proc
load_and_clear
.callinfo no_calls
.enter
; create a 16 byte aligned pointer to the load+clear word area
addi 15,%arg0,%arg2 ; add 15 to pointer to round up
; Mask off the lower 4 bits
; Choose one of these statements and comment out the other:
depi 0,31,4,%arg2 ; (32-bit version)
; depdi 0,63,4,%arg2 ; (64-bit version)
; load and clear the spinlock. If locked, return 0
stbys,e 0,(%arg2) ; scrub everyone else.s cache
; important for performance
ldcws (%arg2),%ret0 ; load and clear the spinlock word
nop ; 3 No-Op instructions
; needed for older
nop ; HP-PA chips
nop
bv,n (%r2)
.leave
.procend
A complete set of example routines, including high-level access routines, test programs and instructions, are included in the Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC white paper. This paper describes the implementation of spinlocks and fences to guarantee memory consistency for shared memory applications on the Itanium® processor family, as well as PA-RISC platforms.
Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC is available from the DSPP Web site:
http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/ spinlocks.pdf
11.13.3 Assembly Spinlocks for the Itanium® Processor Family
Example 11-8
shows the assembly
code to test and set a lock at a specified address on Itanium®
processors.
For the purposes of this example,
|
Example 11-8: Spinlock Code for Itanium® Processors
// Tests the lock at the specified address ([lock]),
// if it is 0 it is available. If it is 1, another
// process is in the critical section.
//
spin_lock:
mov ar.ccv = 0 // cmpxchg looks for avail (0)
mov r2 = 1 // cmpxchg sets to held (1)
spin:
ld8 r1 = [lock] ;; // get lock in shared state
cmp.ne p1, p0 = r1, r2 // is lock held (lock == 1)?
(p1) br.cond.spnt spin ;; // yes continue spinning
// attempt to grab lock
cmpxchg8.acq r1 = [lock], r2 ;;
cmp.ne p1, p0 = r1, r2 // was lock empty?
(p1) br.cond.spnt spin ;; // continue spinning
cs_begin:
// critical section goes here
cs_end:
st8.rel[lock] = 0 ;; // release the lock
A complete set of example routines, including high-level access routines, test programs and instructions, are included in the Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC white paper, located at:
http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/ spinlocks.pdf
For more complex algorithms to achieve data access synchronization, see Section 13 of the Intel® Itanium® Architecture Software Developer's Manual; Volume 2:
http://developer.intel.com/design/itanium/manuals.htm
11.14 References
Intel® Itanium® Architecture Software Developer's Manual, Volumes 1 & 2, revision 2.1, October 2002:
http://developer.intel.com/design/itanium/manuals.htm
Intel® Itanium® Architecture Assembly Language Reference Guide, 2000-2001:
http://developer.intel.com/software/products/ opensource/tools1/tol_whte2.htm
HP-UX Floating-Point Guide, Edition 5, November 1997:
http://docs.hp.com/en/dev.html#Performance%20Tools%20and%20 Libraries
Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC, Tor Ekquist and David Graves:
http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/ spinlocks.pdf
PA-RISC 2.0 Architecture, Gerry Kane, Prentice Hall; ISBN: 0-13-182734-0; 1st Edition, 1995.
HP-UX Assembler Reference Manual:
http://docs.hp.com/en/dev.html#Assembler
PA-RISC 2.0 Architecture, Hewlett-Packard Company:
http://www.hp.com/dspp >> Technical Resources >> References and Manuals >> PA-RISC section