Dr. Dobb's Journal August, 2005
During the development of the AMD Opteron and Athlon64 (AMD64) family of 64-bit x86-compatible processors, AMD engineers realized they would need a robust program-development environment optimized for the extended 64-bit instruction set of these processors running 64-bit versions of Linux and Microsoft Windows. AMD required a strategy to satisfy science and engineering power users who need highly optimizing vector/parallel compilers and development tools. With this goal in mind, AMD engaged The Portland Group (the company we work for) to port its PGI Workstation compiler suite, including the PGF95 Fortran compiler, PGCC C and C++ compilers, PGDBG debugger, and PGPROF profiler to this 64-bit architecture. The project involved both retargeting the compilers and tools at these new processors and rehosting the compilers and tools themselves to take advantage of the new features. During this process, The Portland Group (PGI) encountered and resolved many of the same 32-bit to 64-bit porting issues that our customers are working through today in the process of porting multimillion line applications to these new high-volume, high-performance 64-bit platforms.
The PGI compilers are a suite of standard-conforming F95, C, and C++ compilers, with a long history of having been ported to many processors, including 64-bit architectures such as the Intel i860. More recently, the PGI compilers have been optimized for Pentium-class 32-bit x86 processors, specifically in the Linux and Microsoft Windows environments. The compilers have a relatively standard design, with a front end for each supported language, language- and target-independent high-level and low-level optimizations, and target-dependent final code generation. They include techniques such as aggressive function inlining, SIMD vectorization, OpenMP parallelization, interprocedural analysis, and profile-feedback optimizations to deliver the highest possible performance. Taken as a whole, the PGI compilers, tools, and runtime support libraries comprise well over a million lines of ANSI C source code.
A compiler is a large program (several hundred thousands of lines of code) that manages many interconnected data structures. One of the first decisions one must make when porting a compiler to 64 bits is which data structure fields must be increased from 32 to 64 bits. In this case, because a shared source-code base is used for all of the PGI compilers on all platforms, we had to determine whether this increase would affect the version of the compilers that targets 32-bit platforms.
As it turns out, in our case many of the fields did not need to be enlarged; for instance, compiling for AMD64 targets does not result in any increase in the number of Fortran or C language statement types, or in the number of data types of these languages. Fields for these values don't need to increase in size. Many of the fields are array indices or pointers to other data structures. Compiling for 64-bit targets doesn't mean we are compiling larger programs, so the array indices, such as symbol table numbers, don't need to increase in size, either. Fortunately, one of the key features AMD built into the AMD64 architecture is that it preserves full native support for 32-bit data. There is no penalty for using 32-bit indices where appropriate, and the resulting space savings for large data structures can be significant. We highly recommend that any 32-bit to 64-bit porting project to AMD64 or compatible processors take this measured approach, rather than taking a de facto decision to promote all data to 64 bits.
However, we obviously did need to enlarge pointer fields. One of the most common issues encountered during the port was the storing of a pointer value in an integer. This is a common problem when designing data structures and trying to save space. On 32-bit platforms, pointers and integers are both 32 bits; on 64-bit platforms, pointers are all 64 bits, but the default int datatype is 32 bits. On 64-bit Windows, the long datatype is also 32 bits, so simply redeclaring all int data to long is insufficient. Our solution was to use typedefs and macros to define an integer datatype ISZ_T that is large enough to hold a pointer value to address this problem.
A related problem arises when using fprintf to print out the values of a variable of type ISZ_T. Depending on the host and target, ISZ_T may be int, long, or long long. The format for such an item needs to be "%d," "%ld," or "%lld," respectively. We resorted to the inelegant but portable mechanism of defining a string macro ISZ_PF to be the format prefix for ISZ_T. An fprintf statement for an ISZ_T item then looks something like this:
fprintf( outfil, "val=%" ISZ_PF
"d\n", isz_type_item );
With appropriate definitions of ISZ_T and ISZ_PF, this works on any combination of 32-bit and 64-bit host and target. While the AMD64 architecture is Little-endian, this mechanism also works when cross compiling on a Big-endian architecture (such as a Sun SPARC) targeted at AMD64.
Even though memory is relatively inexpensive, and AMD64 architecture machines have large address spaces and usually large physical memories, it is still, of course, wise to take care regarding the size of data structures. When increasing an integer or pointer field from 32 bits to 64, the field needs to be aligned on an 8-byte memory address. This may introduce extra padding, which is best to avoid. In the port of the PGI compilers, we carefully reviewed each data structure to ensure the fields are properly aligned.
The aspect of the port, which consumed most of the effort, was retargeting the compilers to the AMD64 instruction set. Most of the PGI compiler infrastructure is relatively independent of the target; for instance, parsing, symbol table management, data-flow analysis, and many optimizations do not depend on the target machine.
Some optimizations change depending on tradeoffs between microarchitectural details. As an example, the AMD Opteron and Athlon64 perform floating-point arithmetic much more efficiently using SSE registers and instructions, as opposed to the legacy x87 floating-point stack. This is true even for cases where so-called "scalar" SSE instructions are used rather than the packed or "vector" SSE SIMD instructions. This is a detail of the implementation and has little to do with the additions to the instruction set; yet, a myriad of such subtle differences are important parts of the retargeting and optimization process for a new architecture.
This particular detail is one that has affected many PGI users due to the numerical differences that arise between calculations performed using the 80-bit IEEE x87 instructions versus 32-bit or 64-bit IEEE SSE instructions. While it is not widely recognized, all single-precision (32-bit) and double-precision (64-bit) floating-point data is promoted to 80-bit IEEE format for processing by the x87 floating-point unit. Many users, especially those with applications that process single-precision floating-point data, have become unknowingly dependent on the increased precision afforded by x87 instructions. Attempts to port these applications to AMD64 processors, where it is beneficial to use the default SSE instructions for best performance, generally have to be performed in phases by temporarily recasting certain variables to double precision during precision-critical parts of certain algorithms. The PGI compilers include a number of features designed to facilitate this phased process, including but not limited to options that alter the default precision used in the x87 unit or promote variables globally from single to double precision.
A major difference encountered during retargeting was the use of 64-bit integer arithmetic for array subscripting and other address computations. This required changes throughout the PGI compiler infrastructure. Given the nature of many of the optimization phases of the compilers, some of which are subtly dependent on the size of certain data structures, extreme care had to be taken in enabling certain of these optimizations. In the end, we resorted to an incremental approach that allowed us to ship fully functional 64-bit compilers on an accelerated timeline with various optimizations enabled in phases over several successive releases.
In the end, most of the new development for retargeting the PGI compilers to the AMD64 architecture went into the code generator, including register allocation, instruction scheduling, loop unrolling, instruction selection, SIMD vectorization, and various peephole optimizations. Rather than attempting to retarget our existing x86/x87 (IA32) code generator, we chose to start from scratch. The old code generator was written with many constraints, not the least of which was the lack of a useful integer register file.
There are also many differences in the application binary interface (ABI) for Linux between architecture processors. The calling sequence for the IA32 passes arguments on the stack; for the AMD64, the first six integer/pointer arguments and first eight floating-point arguments are passed in registers. This required appropriate changes in the compiler, as well as a significant rewrite of the runtime support libraries. The AMD64 also supports RIP-relative addressing, which we use extensively, for more efficient addressing of static data.
The task of rehosting the PGI compilers and tools to 64-bit AMD64 processors, which is closely related to the 32- to 64-bit porting efforts encountered by most of our customers, turned out to be quite straightforward. The ability to continue using 32-bit data structures where it makes sense, with absolutely no performance penalty, is a key feature of the AMD64 architecture and approach that should be leveraged wherever possible. In addition, the ability to run existing 32-bit executables at full speed was leveraged in several instances. For example, given that there was no pressing need for the PGDBG or PGPROF graphical user interfaces (GUIs) to be ported to 64 bits, we were able to continue to use 32-bit GUIs in combination with all new underlying fully 64-bit debugging and profiling tools for the first several releases of our products. We expect this capability to port applications incrementally to be one of the key factors that drives a rapid migration from 32-bit x86 to 64-bit AMD64 architecture and compatible systems, including EM64T processor-based systems from Intel for which the PGI compilers are fully supported and optimized.
The task of retargeting and optimizing the PGI compilers for AMD64 architecture was much more time consuming and is in fact ongoing. This is an aspect of 64-bit porting that most developers are shielded from by a good compiler. But, as we have noted, there are differences such as x87 to SSE migration that can and often do result in porting issues that can be difficult to diagnose and correct. For these cases, we have simply tried to supply some obvious porting aids in the compilers and a complete toolset, including 32-bit and 64-bit debugging and profiling tools, to assist in the process.
DDJ