11    Hardware Considerations

This chapter contains information about the Alpha, PA-RISC and the Itanium® processor families. It also discusses processor-related development issues and provides examples using both C and assembly language for typical processor-dependent code. Most of the APIs and techniques discussed in this chapter are highly nonportable. As a result, be careful when using these practices as they will result in significantly more work when supporting additional platforms or operating systems.

11.1    Itanium® Processor Family Overview

 

The Itanium® architecture was developed by HP and Intel® as a 64-bit processor for engineering workstations and e-commerce servers. The Itanium® processor family uses a new architecture called Explicitly Parallel Instruction Computing (EPIC).

 

Some of the most important goals in the design of EPIC were instruction-level parallelism, software pipelining, speculation, predication, and large register files. These allow Itanium® processors to execute multiple instructions simultaneously. EPIC exposes many of these parallel features to the compiler (hence "Explicitly" in the name). This increases the complexity of the compiler, but it allows the compiler to directly take advantage of these features.

 

EPIC is designed so future generations of processors can utilize different levels of parallelism, which increases the flexibility of future designs. This allows the machine code to express where the parallelism exists without forcing the processor to conform to previous machine widths.

 

For more information on assembly programming for the Itanium® processor family, see the Intel® Itanium® Architecture Assembly Language Reference Guide. Available at:

 

http://developer.intel.com/software/products/ opensource/tools1/tol_whte2.htm

11.2    PA-RISC Processor Family Overview

 

The PA-RISC architecture was developed by HP and first introduced in 1986. It was one of the first commercially available RISC processors, including the fixed-size, hardware-inspired instructions common to RISC designs, but also including additional instructions for manipulating strings and other data types used in commercial processing. This enhancement to the typical RISC instruction set was a direct result of research by HP for a processor design optimized over the full range of technical and commercial applications.

 

The PA-RISC architecture included several features to increase parallelism and speed program execution. Instruction pipelining, delayed branching (delay slot), instruction cache prefetch based on branch prediction, and a Shift-and-Add instruction to speed integer multiplication are a few of the key features in the design. These features require a complex compiler with a detailed understanding of the architecture, but permit a tremendous reduction in instruction path length over conventional processor designs.

 

The PA-RISC architecture has been through several enhancements since it was introduced. The current revision (PA-RISC 2.0) is a 64-bit implementation and includes instructions for multimedia processing and features to improve cache performance. More detailed information on the PA-RISC architecture can be found in, PA-RISC 2.0 Architecture by Gerry Kane, Prentice Hall, ISBN 0-13-182734-0.

11.3    Defining a "word"

Data access can be accomplished using 8-, 16-, 32-, or 64-bit elements. The terms byte and word are used to provide non-numerical terms for describing the data sizes. Elements larger then these basics are defined in terms of either bytes or words.

A byte is defined to be 8 bits. The size of a word has never been formally defined. As a result there are a number of definitions used. The IA-32 processor line and the Alpha processor define a word as 2 bytes (16 bits). Both the PA-RISC and the Itanium® architectures define a word as 32 bits. As a result of the differences in the definitions of word-sized elements, elements which are defined as composed of multiple word elements are also affected. Table 11-1 provides a mapping of common terminology for element sizes between the Alpha, Itanium®, and PA-RISC architectures.

Table 11-1:  Multi-byte Element Terminology

Bit Size Byte Size Alpha Architecture Itanium® Architecture PA-RISC Architecture
8 1 byte byte byte
16 2 word halfword halfword
32 4 longword word word
64 8 quadword longword longword
128 16 octaword quadword quadword

11.4    Data Alignment

Data alignment is important on the Alpha, Itanium®, and PA-RISC processors. Unaligned data access can be extremely expensive or prohibited. When permitted, the high cost of correcting unaligned data access is because multiple memory accesses are actually performed and the appropriate bytes are then reconstructed to the form requested. Depending on the architecture, this operation may be performed by the hardware but more frequently is done in software. Often this alignment fixup requires hundreds of cycles for each unaligned access, significantly impacting the performance of the application.

Section 7.6.3 contains more information and examples on data alignment management on HP-UX 11i.

11.5    Determining Processors and Configurations

There are several ways to identify the version of the processor(s) in a system. For HP-UX 11i, the recommended way is to use sysconf(_SC_CPU_VERSION) from within a program or getconf from the command line as shown in Example 11-1.

Example 11-1:  Identifying the CPU From the Shell

% getconf _SC_CPU_VERSION
768

The values for each type of processor are listed in unistd.h; a portion of these definitions are show in Table 11-2.

Table 11-2:  _SC_CPU_VERSION Values

Processor C define Hex Value Decimal Value
Motorola MC68020 CPU_HP_MC68020 0x20C 524
Motorola MC68030 CPU_HP_MC68030 0x20D 525
Motorola MC68040 CPU_HP_MC68040 0x20E 526
HP PA-RISC 1.0 CPU_PA_RISC1_0 0x20B 523
HP PA-RISC 1.1 CPU_PA_RISC1_1 0x210 528
HP PA-RISC 1.2 CPU_PA_RISC1_2 0x211 529
HP PA-RISC 2.0 CPU_PA_RISC2_0 0x214 532
Itanium® Revision 0 CPU_IA64_ARCHREV_0 0x300 768

Example 11-2 provides a C code fragment to identify the current processor.

Example 11-2:  Determining Processor Type Using C

#include <unistd.h>  /* for CPU_PA_RISC definitions */
  switch(sysconf(_SC_CPU_VERSION))
  {
    case CPU_PA_RISC1_0:      /* should not see this case much */
      ....
    case CPU_PA_RISC1_1:      /* still popular! */
      ....
    case CPU_PA_RISC2_0:      /* Current Machines */
      ....
    case CPU_IA64_ARCHREV_0:  /* Rev 0 of Itanium */
      ....
    default:
      ....                    /* Unrecognized */
  }

 

If you just need to determine if the current processor is a PA-RISC machine, the following code fragment will work:

#include <unistd.h>
if (CPU_IS_PA_RISC(sysconf(_SC_CPU_VERSION)))

Sometimes additional processor information is required. One such case is in determining if the processor is an Itanium® or an Itanium® 2 processor. The sysconf() and getconf arguments to access these additional sources of information are shown in Table 11-3.

Table 11-3:  Parameters for Additional Processor Information

Argument Behavior
_SC_CPU_CHIP_TYPE Returns information from Itanium® processor family CPUID[3]. Figure 11-1 shows the regions defined for CPUID[3]. See the Itanium® processor family architecture definition for the meaning of each bit.
_SC_CPU_KEYBITS1 Returns information from Itanium® processor family CPUID[4], and the processor extension information on PA-RISC. See the architecture manuals for the exact meaning of each bit.

 

The Itanium® processor family uses a processor identification register file (CPUID) which contains five 64-bit registers in a fixed region, registers 0 to 4, and a variable region, register 5 and above. The variable region of the CPUID register file is currently not implemented, but may be in future members of the Itanium® processor family.

 

The name of the supplier of the CPU is stored in the 16 bytes of CPUID[0] and CPUID[1]. They are stored so the first letter is stored in byte 0, the second letter in byte 1, and so on. For example, for "IntelCorporation" the "I" is stored in byte 0 and the final "n" is in byte 15.

 

CPUID[2] is currently not used. CPUID[3] contains 40 bits that identify the processor, while CPUID[4] contains general application-level information about the features supported in the processor.

 

The Itanium® processor family CPUID[3] register contains the processor architecture revision that the processor implements (archrev), processor family (family), processor model number within the processor family (model), revision number (revision) or stepping of the processor, and the index of the largest implemented CPUID registers (number). The number of registers implemented is one greater than the value of number. Figure 11-1 shows the information contained in CPUID[3].

Figure 11-1:  CPUID[3] Bit Positions

 

In addition to the sysconf() and getconf interfaces, HP-UX 11i v2 provides the machinfo(1) command. The /usr/contrib/bin/machinfo tool provides information, including detailed processor information, as well as other useful machine information.

System Configurations

Another useful type of information is the number of processors in a given system. Use the pstat_getdynamic(3) system call to obtain this information. See Example 11-3.

Example 11-3:  Number of CPUs Using C

#include <sys/param.h>
#include <sys/pstat.h>
#include <sys/unistd.h>
 
int main() {
   struct pst_dynamic psd;
 
   //obtain number of cpus
   if (pstat_getdynamic(&psd, sizeof(psd), (size_t)1, 0) != -1)
   {
     size_t nspu = psd.psd_proc_cnt;    /* Number of CPUs */
     printf("Number of CPU: %d\n", nspu);
     return 1;
   } else {
     printf("Error");
     return 0;
   }
}

To determine the processor clock speed, use the pstat_getprocessor(2) system call. Example 11-4 uses sysconf() and pstat() to get several pieces of information, including:

Example 11-4:  Collecting System Information in C

#include <sys/param.h>
#include <sys/pstat.h>
#include <sys/unistd.h>
#include <inttypes.h>
#include <stdio.h>
 
main()
{
   struct pst_dynamic psd;
   struct pst_processor *psp;
   long ticks_per_sec;
 
   //obtain clock ticks per second
   if ( (ticks_per_sec = sysconf(_SC_CLK_TCK)) == -1)
       {
         perror("sysconf");
         exit(-1);
       }
 
   //obtain number of spus 
   if (pstat_getdynamic(&psd, sizeof(psd), (size_t)1, 0) != -1)
   {
     size_t nspu = psd.psd_proc_cnt;	/* Number of CPUs */
     psp = (struct pst_processor *)malloc(nspu*sizeof(struct
           pst_processor));
 
     //for each processor, get cycles per clock tick
     if (pstat_getprocessor(psp, sizeof(struct pst_processor), 
           nspu, 0)==nspu)
      {  
           int i;
           int64_t cycles_pertick,mhz;
           int total_execs = 0;
     //calculate mhz as below
           for (i = 0; i < nspu; i++) 
           {
              cycles_pertick=psp[i].psp_iticksperclktick;       
              mhz = (cycles_pertick*ticks_per_sec)/1000000;     
              (void)printf("%" PRId64 " MHz  #%d\n",mhz,i); 
           }
      }
      else
        perror("pstat_getprocessor");
   }
   else
     perror("pstat_getdynamic");
}

11.6    Determining the Current Processor

HP-UX 11i provides two system APIs to determine the current processor on which your code is running. This information can be useful when using timers or determining the processor affinity of a process.

The pstat_getproc(2) function returns information specific to a particular process, including the currently running processor. However, in multithreaded applications, the process may be running on multiple processors at the same time. In this case, pstat_getproc() will always return the same value.

Multithreaded applications should use the pthread_processor_id_np(3T) function to determine the current processor. With an argument of PTHREAD_GETCURRENTSPU_NP the processor the thread is currently running on is returned. This interface is not portable across platforms and should be used with caution. Example 11-5 provides an example of its use.

Example 11-5:  Determining the Current CPU

#include <stdio.h>
#include <pthread.h>
 
pthread_spu_t spu_id;
 
pthread_processor_id_np(PTHREAD_GETCURRENTSPU_NP,&spu_id,0);
printf("%d On CPU: %d\n", pthread_self(),spu_id);

11.7    Instruction Pointer (Program Counter)

 

The Instruction Pointer (IP) or Program Counter (PC) is accessible to the application programmer as a read-only register. In the Itanium® architecture, instructions are grouped into sets of three instructions called a bundle. When the IP register is read, it points to the next bundle of instructions to be executed. As a result, the IP is aligned on 16-byte boundaries. This register is accessed using the mov instruction.

 

For example, to access the IP use this assembly code:

  mov  r32 = ip  ;;  // Access the IP

 

For more information, see the Intel® Itanium® Architecture Software Developer's Manual, Volume 1, Section 3.

11.8    Processor Interval Time Counter

Like Alpha processors, both the PA-RISC and Itanium® processor families provide a register that counts at a fixed relationship to the processor clock frequency. This is a read-only register that lets you directly access timing information to determine application performance. This register is Application Register 44 (AR44) on the Itanium® processor family and Control Register (CR16) on PA-RISC.

There are several limitations in using this type of timer. The high granularity of the timer means that other factors, such as cache hits and unrelated I/O interrupts, may significantly affect the time measured. Additionally, each processor in a multiple processor system has its own interval timer and they are not required to be synchronized to each other. Use of the interval timer is only valid on a uniprocessor system or where the application can guarantee that it is executing on the same CPU. These limitations generally limit the use of this type of timer to performance testing during development; for example, how much does optimization increase the performance of this calculation loop.

Example 11-6 uses the TICKS preprocessor macro to access the ITC on Alpha processor, PA-RISC or Itanium® processors. This example is also available for download from the DSPP site: http://www.hp.com/go/dspp under the Itanium®-based solutions >> technical tips >> performance tools section.

Example 11-6:  Reading the ITC

/*
 *
 * The TICKS macro can be used to access the high-resolution counters
 * on the CPU for timing. These counters increment at a rate based on
 * the CPU cycle clock. This does not imply a 1:1 ratio. 
 *
 * Caution needs to be used, particularly when using these macros as
 * there is no consideration to overflow or changing CPUs. The counters
 * in multi-CPU systems are not required to be in sync and may be 
 * significantly different.
 *
 * These timers should only be used for VERY small windows, 1-2 
 * milliseconds to reduce the risk of these problems. These counters 
 * return 64-bit integer values and should not be stored in int type
 * variables, whenever possible the C99 uint64_t type should be used.
 *
 */
 
#ifndef __TICKS_H_
#define __TICKS_H
 
#include <inttypes.h>
 
/* Check for HP Tru64 UNIX compilers */
#if defined(__DECC_VER) || defined(__DECCXX_VER)
/*
 * Handle the Alpha Processor Family
 */
#if defined(__alpha)
#include <alpha/builtins.h>
#define __TICKS __RPCC()
#endif /* __alpha */
 
/* Check for the HP-UX Compilers */
#elif defined(__HP_aCC) || defined(__HP_cc)
 
#if defined (__STDC_32_MODE__) 
#error "Must use +e or -Ae for 64-bit integer support"
#endif /* __STDC_32_MODE__ */
 
/*
 * Handle the Intel Itanium Processor Family
 */
#if defined(__ia64)
#include <machine/sys/inline.h>
#define __TICKS _Asm_mov_from_ar(_AREG_ITC)
 
/*
 * Handle the PA-RISC Processor Family
 */
#elif defined(__hppa)
#if defined (__HP_aCC)
#error "Inline assembler not supported in C++, compile using C"
#endif /* __HP_aCC */
 
#include <machine/inline.h>
#include <machine/reg.h>
 
#ifdef __cplusplus
inline
#else /* __cplusplus */
#pragma INLINE __TICKS_f
#endif /* __cplusplus */
 
static uint64_t __TICKS_f(void) {
 register uint64_t _ticks;
 _MFCTL(CR16,_ticks);
 return _ticks;
}
#define __TICKS __TICKS_f()
#else 
#error "Unknown chip for compiler."
#endif /* arch && ( __HP_aCC || __HP_cc ) */
 
#else
#error "Unknown Compiler"
#endif /* Complier Tests */
 
#endif /* __TICKS_H */

11.9    Application Stack

 

The stack pointer in the Itanium® processor programming convention is stored in General Register 12 (GR12); the alias of sp can also be used to access the register. The stack grows downward (from high memory to low memory), the same direction the stack grows on Tru64 UNIX. Offsets into the stack frame are positive relative to the current stack pointer. The stack pointer is always 16-byte aligned.

 

On Itanium®-based systems, the HP-UX 11i library libunwind.so lets you unwind the procedure call stack. For more information on this library, see unwind(5).

 

The PA-RISC versions of HP-UX 11i use General Register 30 (GR30) for the stack pointer; the alias of sp can also be used to access the register. PA-RISC-based systems use a stack which grows to higher addresses (up). Offsets into the stack frame are negative relative to the current stack pointer. The stack pointer is always 64-byte aligned.

 

On PA-RISC-based systems, libcl provides a stack unwinding interface. For more information on this library, see the UNWIND PA64 Functional Specification at:

http://devresource.hp.com/ drc/STK/docs/archive/unwind.pdf

or Chapter 8: Stack Unwind Library of The 32-bit PA-RISC Run-time Architecture Document at:

http://devresource.hp.com/drc/STK/docs/archive/rad_11_0_32.pdf

11.10    Register Stack

 

The Itanium® processor uses stacked register files to improve performance. The processor can allocate up to 96 general registers for use as a register stack. The registers in the stack can be renamed in a process called register renaming. This mechanism is used to pass parameters for function calls. This allows the parameters to be in the same physical registers in both the calling and the called function's register stack frame. In other words, the output parameters of the calling function become the input arguments to the called function and are stored in the same registers. The registers are automatically renamed in the called function to facilitate this operation. This reduces the overhead of function calls.

 

The register stack can be configured as a circular buffer with register renaming. If all the registers are used when a function is called, then the contents of previously used registers are spilled to memory by the Register Stack Engine (RSE). As the stack is unwound, the previous contents of the registers are restored by the RSE.

 

The register stack can also be used in software pipelining and loop unrolling. The various uses of register renaming allows the compiler to increase the efficiency of a section of code. The renaming of registers is transparent to the application programmer, but knowledge of this allows you to facilitate the efficient use of these resources.

 

For more information, see the Intel® Itanium® Architecture Software Developer's Manual, Volume 1, Sections 2 and 5.

11.11    Floating Point

Both the Itanium® and PA-RISC processors provide hardware support for two floating-point data types as well as instructions for converting between the types:

 

The Itanium® processor also provides hardware support for two additional types:

  • 82-bit floating-point

  • SIMD or parallel floating-point

Additionally the double-extended real (IEEE real type, 128-bit) floating-point format is supported. Depending on the implementation of the architecture, the IEEE-style quad-precision (128-bit) data type may be supported in either hardware or software.

Both methods of IEEE rounding precision are supported: converting the result to the double-extended exponent range and converting results to the destination precision. There are four rounding modes supported:

There are two functions in HP-UX 11i to select rounding modes:

fegetround(3) Gets the current rounding mode.
fesetround(3) Sets the rounding mode.

The macros to define the mode are contained in /usr/include/fenv.h. The default mode is round to nearest. The similar functions in Tru64 UNIX are read_rnd(3) and write_rnd(3). Rounding modes can also be selected with the HP-UX 11i -fpeval compiler option, which is similar to the Tru64 UNIX -fprm option on both Compaq C and Compaq Fortran.

 

The Itanium® architecture allows lower-precision operands to produce higher precision results as suggested in the IEEE standard. The architecture also allows higher-precision operands to produce lower-precision results. This is an extension to IEEE standard suggestions in this area. (The standard suggests that this type of operation not be allowed.) When these types of operations produce values beyond the ability of the lower-precision to represent, the floating-point status registers are updated to reflect this.

 

Itanium® processors use 82 bits internally to store floating-point values; Alpha processors use 64-bit internal representations. The Itanium® processor transforms the internal representation to the desired output representation when the value is stored from a register to a memory location. This allows higher-precision while manipulating values in registers, and the desired precision is achieved in the result stored in memory. In addition to the IEEE formats, Itanium® processors support IA-32 formats and can also store values using the register format in memory.

The floating-point registers contain three fields:

  • A significand in the lower 64 bits (b0-b63), which is composed of one bit for an explicit integer bit (b63) and 63 bits for the fractional part of the significand (b0-b62).

  • A 17-bit exponent (b64-b80). The exponent is biased by 65536 (0xFFFF). An exponent of all ones is used to encode IEEE signed infinity and NaNs. An exponent field of all zeros and a significand field of all zeros is used to encode IEEE signed zeros. An exponent field of all zeros and a nonzero significand field encodes the double-extended real denormals and a double-extended real pseudo-denormals.

  • A one-bit sign bit (b81). When floating-point values are stored from registers to memory, they are formatted using the standard IEEE memory formats: single (4 bytes), double (8 bytes), and double-extended (10 bytes).

 

The floating-point registers are used for all floating-point operations and for integer multiply and divide operations (unless the compiler realized it can use shifts). The Itanium® processor does multiplication in hardware, but uses software to perform division, square roots, and remainders. Intel® provides the algorithms and machine code to implement these operations. For applications, these operations are automatically provided by the compiler/libraries.

 

Using 82 bits in the floating-point registers allows both reals and integers to be stored in the 64-bit significand. The 17-bit exponent, instead of 15 bits as in the IEEE format, allows the Newton-Raphson approximations used in software divide/sqrt to be seeded with reciprocal approximation instructions that call handlers if they are not conformant 80-bit numbers (that is, a 15-bit exponent cannot represent the value). This allows the calculations to occur without conditionals in the Newton-Raphson. These extra bits can also be used to simplify the implementation of some basic numeral algorithms.

 

There are several advantages to implementing the divide/sqrt operations in software. One advantage is that because the complex operations are implemented by a series of simpler instructions, these instructions can be scheduled along with other instructions. This allows the opportunity for increasing the parallel execution of instructions. Another advantage is the ability to choose the algorithm used for the calculation. It is possible to optimize these operations for either speed or accuracy, depending on the requirements of the application.

 

Itanium® processors also have a fused multiply-add instruction. This allows increased precision when chaining several operations together. The processors also implement several IEEE suggested operations in hardware, including maximum, minimum, absolute maximum, and absolute minimum.

11.12    Pagesize Considerations

Applications that require a large amount of memory should consider the significant performance benefits available from using an increased pagesize. The Alpha, PA-RISC, and Itanium® designs are different in this area, as are the designs of HP-UX 11i and Tru64 UNIX. As the amount of memory available to applications continues to increase, changes will continue to occur in this area.

For proper operation of applications that use pagesize information ( mmap(2), private memory managers, and so on), the information should not be hard-coded into the application. POSIX.1c defines the sysconf(3) API to retrieve information from the operating system during execution. Using a _SC_PAGESIZE argument will return the pagesize currently used by the application.

The Alpha processor uses an 8 KB pagesize; most other processors currently use a 4 KB pagesize. The Itanium® processor allows for variable pagesizes (see Table 11-4). HP-UX 11i further extends this by allowing the pagesize to be controlled on a per-application basis. Use the +pd and +pi linker switches or the chatr tool (see Section 5.3.2) to select the pagesize you want for your application.

Table 11-4 shows the architecture supported pagesizes. All pagesizes supported by an architecture may not be supported in individual implementations of that architecture or by the operating systems running on them.

Table 11-4:  Architecture Supported Pagesizes

Pagesize Alpha Architecture PA-RISC Architecture Itanium® Architecture
4 KB Not Available X X
8 KB X X X
16 KB X X X
32 KB X Not Available Not Available
64 KB X X X
256 KB Not Available X X
1 MB Not Available X X
4 MB Not Available X X
16 MB Not Available X X
64 MB Not Available X X
256 MB Not Available Not Available X

While the architectures support multiple pagesizes, the operating systems do not. The core pagesize on HP-UX 11i is 4 KB; Tru64 UNIX uses 8 KB. Under HP-UX 11i the memory-management system can use SuperPages to manage multiple, adjacent 4 KB pages as a single page. This size is dynamically adjusted during the program's execution, and the starting size used is 4 KB unless the +pi or +pd options or the chatr tool is used to change the default size.

11.13    Atomic Memory Operations

An application executing on a multiprocessor system may need a method for coordinating data access across multiple instruction streams to ensure data consistency. For example, you may have a shared memory segment that is manipulated by multiple processes and needs to be protected by some kind of voluntary lock. Typically semaphores are used to provide this lock. The semaphore permits processes to coordinate their access to the shared memory structure, which prevents data corruption.

HP-UX 11i provides standard semop(2) semaphores, file-system locking ( lockf(3) and ioctl(3)) and memory-mapped semaphores. These methods work well for locks that are held for a long period of time, such as when writing to disk. For locks held for a short period of time, for example when updating to a few words in a shared memory area, these types of locks can become a severe performance bottleneck. A spinlock is a much more appropriate construct for mutual exclusion in a multiprocessor environment, when the lock is to be held for very short periods of time.

A spinlock is a special class of semaphore. When a process attempts to acquire a lock via semop(), the process becomes blocked and is later rescheduled. This incurs all of the system overhead associated with putting a process to sleep and later waking it up. Processes acquiring a lock via spinlock attempt the lock, and if blocked spin a few machine cycles and try again. These processes may eventually become voluntarily blocked and go to sleep, but only after many rapid attempts to acquire the lock.

This time spent spinning is wasted time, but the cost of spinning is much lower than the cost of blocking and rescheduling the process. Also, since a waiting process is spinning on a lock, it can acquire the lock immediately when it is released. Furthermore, a spinlock can be implemented in user code, while a semaphore call is a system call, which has more overhead and causes a context switch. Thus, on both uniprocessor and multiprocessor systems, the performance of many applications that need to hold a lock for very quick memory-to-memory operations can be greatly enhanced by replacing semaphores with spinlocks.

11.13.1    High-Level Language Support

Tru64 UNIX provides a series of compiler built-ins to provide low-level memory access for implementation of spin locks and semaphores. Similar constructs are not provided by default but may be created on HP-UX 11i using the assembler.

11.13.2    Assembly Spinlocks for PA-RISC

 

User spinlocks in PA-RISC are accomplished by using the load-and-clear instructions (ldcw or ldcws), which perform the spinlock load-and-clear operation in a single machine code instruction. Example 11-7 provides a sample implementation of spinlocks in PA-RISC assembly language.

Example 11-7:  PA-RISC Spinlocks in Assembly Language

;
;  spin.s:  Example assembly language routine for spinlock support.
;
    .code
;   .level 2.0W           ; use this option for 64-bit assembly
    .export load_and_clear,entry,priv_lev=3,rtnval=gr
    .proc
load_and_clear
    .callinfo no_calls
    .enter
 
; create a 16 byte aligned pointer to the load+clear word area
    addi        15,%arg0,%arg2  ; add 15 to pointer to round up
 
; Mask off the lower 4 bits
; Choose one of these statements and comment out the other:
    depi        0,31,4,%arg2    ; (32-bit version)
;   depdi       0,63,4,%arg2    ; (64-bit version)
 
; load and clear the spinlock.  If locked, return 0
    stbys,e     0,(%arg2)       ; scrub everyone else.s cache
                                ; important for performance
    ldcws       (%arg2),%ret0   ; load and clear the spinlock word
    nop                         ; 3 No-Op instructions
                                ; needed for older
    nop                         ; HP-PA chips
    nop
    bv,n        (%r2)
    .leave
    .procend

A complete set of example routines, including high-level access routines, test programs and instructions, are included in the Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC white paper. This paper describes the implementation of spinlocks and fences to guarantee memory consistency for shared memory applications on the Itanium® processor family, as well as PA-RISC platforms.

Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC is available from the DSPP Web site:

http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/ spinlocks.pdf

11.13.3    Assembly Spinlocks for the Itanium® Processor Family

 

Example 11-8 shows the assembly code to test and set a lock at a specified address on Itanium® processors. For the purposes of this example, [lock] indicates the address of lock. In complete code, the address would be loaded into a register that would then be used in Example 11-8.

Example 11-8:  Spinlock Code for Itanium® Processors

// Tests the lock at the specified address ([lock]),
// if it is 0 it is available. If it is 1, another 
// process is in the critical section.
//
spin_lock:
      mov ar.ccv = 0         // cmpxchg looks for avail (0)
      mov r2 = 1             // cmpxchg sets to held (1) 
 
spin:
      ld8 r1 = [lock] ;;     // get lock in shared state
      cmp.ne p1, p0 = r1, r2 // is lock held (lock == 1)?
(p1)  br.cond.spnt  spin ;;  // yes continue spinning
 
      // attempt to grab lock
      cmpxchg8.acq r1 = [lock], r2 ;; 
      cmp.ne p1, p0 = r1, r2 // was lock empty?
(p1)  br.cond.spnt spin ;;   // continue spinning
 
cs_begin:
      // critical section goes here
cs_end:
      st8.rel[lock] = 0 ;;   // release the lock

A complete set of example routines, including high-level access routines, test programs and instructions, are included in the Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC white paper, located at:

http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/ spinlocks.pdf

For more complex algorithms to achieve data access synchronization, see Section 13 of the Intel® Itanium® Architecture Software Developer's Manual; Volume 2:

http://developer.intel.com/design/itanium/manuals.htm

11.14    References

Intel® Itanium® Architecture Software Developer's Manual, Volumes 1 & 2, revision 2.1, October 2002:

http://developer.intel.com/design/itanium/manuals.htm

Intel® Itanium® Architecture Assembly Language Reference Guide, 2000-2001:

http://developer.intel.com/software/products/ opensource/tools1/tol_whte2.htm

HP-UX Floating-Point Guide, Edition 5, November 1997:

http://docs.hp.com/en/dev.html#Performance%20Tools%20and%20 Libraries

Implementing Spinlocks on the Intel® Itanium® Architecture and PA-RISC, Tor Ekquist and David Graves:

http://h21007.www2.hp.com/dspp/files/unprotected/Itanium/ spinlocks.pdf

PA-RISC 2.0 Architecture, Gerry Kane, Prentice Hall; ISBN: 0-13-182734-0; 1st Edition, 1995.

HP-UX Assembler Reference Manual:

http://docs.hp.com/en/dev.html#Assembler

PA-RISC 2.0 Architecture, Hewlett-Packard Company:

http://www.hp.com/dspp >> Technical Resources >> References and Manuals >> PA-RISC section