Inside Flash Memory

Direct execution increases performance, lowers cost

Brian L. Dipert

Brian, an applications manager for high-density flash memory components at Intel, is co-author of Designing With Flash Memory and author of The PCI Handbook (both published by Annabooks Press). Brian can be contacted at brian_l_dipert@ccm.hf.intel.com or bdipert@aol.com.

Computer-hardware architectures have traditionally been defined by the need to accommodate several levels of cache, DRAM main memory, boot nonvolatile memory, and magnetic mass storage. This hardware approach has driven the design and implementation of software architectures, including UNIX, DOS, Windows, OS/2, and System 7--each of which is stored on hard disk drives (HDDs) or ROM and paged into DRAM for execution.

But what about embedded-systems applications that don't require (or can't accommodate) gigabyte hard drives or 16/64-Mbit-based DRAM arrays? To meet cost, performance, power, size, weight, reliability, and other requirements, these systems typically execute operating systems and applications directly, instead of downloading them from hard drives. Flash-memory component arrays, cards, and drives make such direct execution possible. Moreover, the current generation of Embedded Flash RAMs delivers read performance that matches or exceeds that of DRAM, eliminating the redundancy of slow, nonvolatile storage memory and fast, volatile execution RAM.

Direct-Execute Compilers

Although flash memory is optimized for updatable code storage/execution and mass-storage applications, RAM is still the memory solution of choice for very-often-updated temporary data (video memory, stack, interrupt vector tables, and the like). This means that for code storage and execution, the compiler must be able to partition code and data and direct them to different areas of the system memory map.

A number of compilers can place code segments in nonvolatile flash memory; see Table 1(a). This locating function is accomplished either via command-line options or in a configuration file accessed by the linker/locater. In Example 1, for instance, a TARGET.LD locator configuration file for the GNU/i960 toolset interfaces the Intel i960JF processor to Intel 28F016XS Embedded Flash RAM. This hardware architecture is common in laser printers, datacom hubs and routers, RAID controllers, graphical X-terminals, and similar applications.

Figure 1 and Figure 2 present a system-block diagram and system memory map, respectively. The upper 4 MBs of the system memory map (corresponding to the CPU boot location), made up of flash memory, contain system code segments and static (non-updated) data tables and constants. System RAM stores the stack and temporary data tables. RAM also contains initialized data tables, stored in flash memory but copied to RAM and updated during system operation. Software provided by compiler vendors lets you copy these initialized data tables from flash memory to RAM on system boot. Software with data and code-structure addresses that have not been hardcoded can be dynamically placed by the linker/locator for the specific target system. This flexibility also enables porting of legacy code to new architectures.

Direct-Execute Operating Systems and Applications

A variety of direct-execute operating systems are optimized for applications that include embedded PC systems, real-time environments, and handheld-computing devices; see Table 1(b). These systems are not constrained to the standardized computer disk/DRAM memory architecture.

Figure 3, for instance, shows the Intel 386EX CPU interface to an Intel 28F400BX BootBlock flash memory. This hardware architecture can be utilized in embedded PC designs, such as industrial controllers, point-of-sale terminals, and handheld computers. The techniques I'll discuss here are based on Datalight's ROM-DOS but apply to other operating systems as well. The memory maps in Figure 4 and Figure 5 indicate that the BIOS, ROM-DOS, and direct-execute RXE applications in the ROMDISK all reside within and are executed directly from the flash-memory device. The system thereby requires only a small amount of system RAM for scratchpad memory, stack, and the interrupt-vector table. Because the operating system and applications are stored in flash memory, they can be easily updated.

If the system does not require video RAM, the 4-Mbit flash memory can directly map from addresses 80000H-FFFFFH (Figure 4). Alternatively, half of the flash memory can map to extended memory, freeing up the A0000H-BFFFFH memory segment (Figure 5).

Normally, the system boots BIOS and ROM-DOS directly from the BootBlock flash-memory main blocks. Parameter blocks store nonvolatile system data and are useful for integrating EEPROM functionality in many applications.

If the system is reset or loses power during flash-memory update, main-block contents may be left in an undetermined state. Hardware inversion of flash-memory-block locations, as shown in Figure 4 and Figure 5 (recovery operation), enables system boot and recovery from the kernel code stored in the hardware-locked, flash-memory boot block. This inversion can be implemented via motherboard jumper, back-panel switch, or special keyboard sequence.

How Does ROM-DOS Work?

When a flash memory containing ROM-DOS is placed in a DOS-based computing system that might otherwise contain a floppy or hard disk, ROM-DOS will boot from the flash memory.

In a traditional, DOS-based computer boot sequence, the BIOS follows these steps:

Initializes itself.

Performs a Power On Self Test (POST).

Initializes the interrupt vector table.

Initializes all devices.

Searches upper memory (segments C000-FFFF) for BIOS extensions.

Boots the system via an INT 19H.

The BIOS has already pointed INT 19H to an internal BIOS routine that will load the beginning of the operating system from a floppy or a hard disk.

ROM-DOS, on the other hand, contains a BIOS extension (ROM scan), a small routine that the BIOS detects and executes after running the POST. The ROM-DOS BIOS extension routine (only about 15 lines of assembly-language code) changes the interrupt 19H vector to point to the entry point for ROM-DOS, rather than the internal BIOS routine. Then the ROM-DOS BIOS extension returns and the BIOS again takes control. When the BIOS is ready to boot, it executes an interrupt 19H, causing a direct jump to the ROM-DOS entry point and booting ROM-DOS on the target system.

One of ROM-DOS's features, RXE, is an executable file format that minimizes RAM usage by executing the code directly out of flash memory. An EXE file is a single program block loaded sequentially into RAM to execute. An RXE program is like an EXE, but has two distinct program blocks: the code block and the data block. The code block is "fixed up" to reside and execute in an absolute area of flash memory. The data block is relocatable and is loaded into RAM by DOS before execution begins.

A DOS EXE file created using a high-level language such as Microsoft C, Borland C, or QuickBasic can be directly converted to an RXE file using utilities from Datalight. Assembly-language portions of a program may require some modification to run as an RXE. The source code for the EXE must be available, and in some cases, the language startup source-code files or library source-code files may be necessary.

To convert code to an RXE file, the DATA segment must have at least one fixup and there must not be any self-modifying code or overlays. In addition, the program must not assume that the data segment immediately follows the code segment, and all of the program code must be addressable in memory at all times. This last restriction invalidates the use of RXE programs from a paged-extended memory diskette.

During the conversion process, the user (or utility program) specifies the absolute segment address where the RXE file will reside and the relative offset of the data block within the RXE. The division between the code and data blocks is typically specified as a segment name or segment class (from the .MAP file), or as an absolute number.

The conversion is performed as follows:

All of the code-block fixups are resolved and then removed from the fixup list.

The code block is placed at the provided fixed address.

All data-block fixups in the code block are emulated or flagged as errors. The data block's run-time address is unknown until DOS loads the data block into RAM, so references to the data-block segments are unknown at conversion time. If an instruction that loads a register with data-block segment values is encountered in the code block, the converter emulates this load with an INT 18H call. INT 18H loads the correct register with the appropriate segment value at run time and returns to the program. For instance, Example 2(a) is changed to Example 2(b). If the code block references a data-block segment but does not load a register with it, these references cannot be emulated and are flagged as errors; see Example 2(c).
Data-block fixups are adjusted and left in the fixup list, completing the conversion.

The loading and execution of an RXE follows the standard operation of a direct-executable program; see Figure 6. The code block executes out of flash memory and the program data resides in RAM. If any of the data is initialized, it is copied from flash memory to RAM before execution. This offers considerable RAM savings over an EXE file format, which must execute completely out of RAM. Direct execution also preserves some additional flash-memory space because the EXE header is considerably reduced when converted to an RXE.

Datalight's RXE utility provides the ability to execute a program in place: "XIP" (PCMCIA standards-group lingo for "eXecute In Place").

Flash Translation Layer (FTL)

For mass-storage applications, flash-memory file-system software comprehends and effectively compensates for flash memory's large blocking structure compared to a 512-byte sector size for hard drives. It also spreads file writes across the entire flash-memory media to eliminate excessive erasing (cycling) of a subset of the available flash-memory blocks. This technique, which is called "wear leveling," extends flash-memory life.

Since flash memory does not behave like traditional, magnetic media, special software is necessary for managing stored files. This flash-file-system software comprehends flash memory's bit-level program (1-->0), block-level erase (0-->1), and wear-leveling requirements. Table 1(c) lists a number of flash-file-system options available for various operating-system alternatives, including, but not limited to, DOS.

For example, Figure 7 interfaces 4 MB of resident flash memory to the system DRAM controller and uses FTL software for disk-drive emulation. Applications with low-density mass-storage requirements include data loggers, medical instrumentation, fax machines and scanners, ruggedized terminals, and audio recorders.

Intel's 28F016XD Embedded Flash RAM includes a DRAM-compatible hardware interface for easy system integration. By being located on the DRAM bus, 28F016XD accesses can also be cached for highest-effective performance, just like DRAM accesses. The system memory map is a variation of the ROM-DOS design described earlier; see Figure 8. FTL is approximately 20-30 KBs in size, plus optional PCMCIA drivers. Note that the EMS window is optional; FTL disk drives can be accessed either through a sliding window or linearly, in protected mode; see Figure 9(c).

In the DOS world, the resident flash disk can interface to the operating system in one of several implementations; see Figure 9. Card Services is the PCMCIA software layer responsible for allocating system resources, such as the memory space for removable flash-memory cards. The Card Services layer allows any PCMCIA memory or I/O card to be supported. The implementation in Figure 9(a) might be used in systems that require both a Resident Flash Disk (RFD) and PCMCIA cards. FTL can be used to communicate with both the removable flash-memory cards and RFD. However, each would require its own Socket Services or low-level driver, plus unique hardware-interface logic.

Socket Services, a PCMCIA software standard originally developed for PCMCIA sockets, configures the window that accesses the RFD. There's no "socket" as such for a RFD, but many FTL developers still refer to this driver as "Socket Services." Either an 82365SL-compliant PCMCIA-card-interface controller or a hardware-page register can be used to implement the RFD window. The implementation in Figure 9(b) might be used in systems that require both RFD and PCMCIA memory cards, but not PCMCIA I/O cards.

How Does FTL Work?

FTL uses the existing operating system for upper-level file-handling capabilities, such as translating file-based operations to sector-based versions. By translating received, sector-based requests, an FTL driver appears to the upper-layer software as a sector-based, magnetic hard drive.

A typical hard drive has a sector size of 512 bytes. Upper-layer software expects to be able to fully modify these sectors at any given time. It is a requirement of flash-memory technology that the block containing the sector be fully erased in order to change any stored 0s back to 1s.

When upper-layer software tries to modify a stored sector, FTL remaps the request to an available, fully erased 512-byte area within the flash-memory array; see Figure 10. These remapped sectors are called "virtual small blocks" (VSBs). FTL subdivides each 64-KB flash-memory block into smaller VSBs. For example, each 28F016XD Embedded Flash RAM contains 32 64-KB blocks, or 4000 VSBs (each 512 bytes in size).

Depending on the types of files stored to the RFD, some natural wear leveling may occur. For example, data files are often deleted and updated, and therefore tend to migrate throughout the RFD. Configuration and executable files, on the other hand, are read frequently but deleted/updated rarely (if at all). For this reason, FTL software solutions supplement natural "passive" wear leveling with "active" erase-cycling allocation schemes. These vendor-proprietary techniques keep track of cycle counts for each RFD block and relocate static files at vendor-defined block-to-block cycle differential points. With effective wear leveling, flash memory's 10,000-100,000 block erase-cycling specifications exceed the requirements of most system-lifetime file-update profiles. RFD reliability far exceeds that of alternative magnetic hard drives, given reasonable lifetime-cycling usage.

When a file is updated or deleted, the FTL file system marks the old file as "dirty." As FTL updates files, more and more of the total RFD area transitions from clean/available to dirty/unavailable.

At some vendor-specific point, FTL will determine that a block with many dirty virtual blocks should be erased and converted back to clean space. The file system first copies the remaining clean VSBs in the dirty block to another free or "spare" block before erasing and reclaiming the block. The more spare blocks reserved for garbage collection, the quicker the process, but the less available flash-memory storage capacity for a given array size.

FTL provides an effective solution for interfacing legacy operating systems (originally intended for magnetic disk drives) to flash memory with its unique benefits and characteristics. FTL's efficient usage of flash-memory eliminates hard-drive-like block-size requirements, enabling use of FTL with a range of flash-memory technologies optimized for lowest silicon cost.

Flash versus Other Memory Approaches

The redundancy of traditional memory hierarchies does little to optimize the performance of today's fast microprocessors. Both system boot time and application task-switching response bog down because of disk drive spin-up time and nonvolatile-memory-to-RAM file-load delays. Furthermore, DRAMs must be constantly refreshed by the memory controller to preserve their stored data. The hard drive draws current with every motor rotation and may actually draw more average current if it is "parked" too frequently, since spin-up causes high current draw.

Magnetic media contain moving parts and have narrow operating-temperature ranges. Though improving, these devices have unacceptably low tolerance to shock, vibration, and movement during read/write in ruggedized and mobile environments.

All in all, multiple levels of memory mean multiple sources of system-memory cost and multiple levels of potential component failure. Excessive-heat generation also impacts system lifetime, and the many board traces required are a significant manufacturing challenge and reliability headache. DRAMs are subject to single-bit errors caused by alpha-particle radiation, and mission-critical systems subsequently add costly EDAC circuitry to circumvent this problem.

But just as today's embedded-system designs are revolutionary, so too are their memory architectures. Incremental improvements to the traditional approach no longer measure up to the potential of a flash-memory-based alternative. Intel's ETOX flash-memory technology, for example, enables fast reads (as fast as an effective 30 ns) and writes (as fast as 30 MB/sec burst and 500 KB/sec sustained per component). A range of densities, from 256 Kbit to 32 Mbit, in x8 and x16 interface options and various blocking schemes, address a variety of system configurations.

Flash memory provides both the high read performance of DRAM and SRAM, and the nonvolatility of ROM and hard-drive-like write performance. Lengthy software-load overhead and task-switching delays are eliminated. Code runs as fast or faster than that in DRAM with less system-hardware complexity.

The price of a high-density flash-memory array is comparable to that of a DRAM/ROM or DRAM/small hard drive combination (including interface and control logic). Unlike ROM, flash memory is in-system updatable to keep system costs low both initially and throughout system lifetime.

Being nonvolatile, flash memory requires no periodic refresh and no constant application of power to retain stored information. The redundancy of multiple memory technologies in the traditional memory architecture results in multiple sources of power consumption. These multiple memories must be summed to determine the true memory subsystem power draw. With flash memory, there is only one very efficient memory technology, consuming little system power.

Compact flash memory TSOP packaging provides up to 18.4 MB/in² density capability, with components mounted on both sides of the system board. Eliminating memory duplication saves substantial board space and yields higher system ruggedness and reliability. Very-low power consumption reduces the size and weight of system batteries and power supplies.

--B.L.D.

Figure 1: Intel i960r JF microprocessor interface to Intel 28F016XS Embedded Flash RAM memory delivers 0 wait state burst read performance.

Figure 2: System memory map for Intel i960 JF/28F016XS Embedded Flash RAM memory design.

Figure 3: Intel386EX interface to Intel 28F400BX BootBlock flash memory enables high-performance, upgradable, direct-execute code.

Figure 4: System memory map for Intel 386EX/28F400BX BootBlock flash-memory design.

Figure 5: Alternative system memory map for Intel 386EX/28F400BX BootBlock flash-memory design.

Figure 6: Datalight's RXE utility and file format segments code and data portions of DOS applications to enable direct execution out of flash memory.

Figure 7: The 28F016XD Embedded Flash RAM memory enables DRAM SIMM pinout compatibility and a no-glue interface to system DRAM controllers.

Figure 8: System conventional memory map for Intel 28F016XS Flash RAM RFD design (28016XD array located at/above 1-MB address).

Figure 9: Implementation options for resident flash-disk-interface software (a) for systems that require both an RFD and PCMCIA memory and I/O cards; (b) for those that require an RFD and PCMCIA memory card, but not PCMCIA I/O cards; (c) for systems that require only an RFD.

Figure 10: FTL enables usage with a wide range of flash-memory block sizes and minimizes per-block cycling by storing new/updated file versions in available virtual small blocks.

Table 1: (a) Direct-execute software compiler/linker vendors; (b) direct-execute operating-system vendors; (c) flash-memory file-system vendors.

Application Category     Vendor                 Application
(a)
Compiler/linker          Cygnus
                         Embedded Performance
                         GreenHills
                         MetaWare
                         Microtec Research
                         Software Development Systems
Linking locators         Phar Lap
 and libraries           Systems & Software
 for x86 compilers

(b)
Direct-execute
 DOS/DOS-based OSs       General Software       Embedded DOS
                         Datalight              ROM-DOS
Handheld Computing OS    GeoWorks               GeoWorks
                         General Magic          MagicCap
                         Apple Computer         Newton OS
MS-DOS ROM Executable,
 MS-Windows ROM
 Executable              Annabooks Press
Real-time, embedded OSs  Chorus Systems         Chorus Nucleus
                         Lynx Real-Time Systems LynxOS
                         Microware Systems      OS/9
                         Integrated Systems     pSOSystem
                         JMI Software Systems   PSX
                         QNX Software Systems   QNX
                         Spectron Microsystems  SPOX
                         VentureCom             Venix
                         Microtec Research      VRTX
                         WindRiver Systems      VxWorks

(c)
Flash File System (FFS)  Datalight              CardTrick FFS
                         Annabooks              Microsoft MS-FFS2
                         SystemSoft             Microsoft MS-FFS2
Flash Translation Layer
 (FTL)                   Datalight              CardTrick FTL
                         SCM Microsystem        FTL-FFS
MS-FFS2 (Enhanced)       SystemSoft             SystemSoft FTL
                         M-Systems              TrueFFS

Example 1: TARGET.LD file segments code (into flash memory) and data (into RAM).

MEMORY
{
    FlashRAM:   o=0xFFC00000, 1=0x400000
    DataRAM:    o=0x00000000, 1=0x10000
}
SECTIONS
{
    .text:  ;Code Segment
    {
    } >FlashRAM
    .data:  ;Data Segment
    {       ;Initialized Data Variables
    _ram = .;
    } >DataRAM
    .bss:       ;Data Segment
    {       ;Uninitialized Data Variables
    } >DataRAM
}

Example 2: Converting a DOS EXE file to an RXE file.

(a)

MOV     AX,SEG DGROUP
(b)
INT     18H
DB      0

(c)
MOV     AX,DS_var
    .
    .
    .
DS_var: DW  DGROUP