February 1995/Programming Flash Memory

Real-Time/Embedded Systems

Programming Flash Memory

Mike Cepek

Mike is a Senior Software Design Engineer in the Research and Development department of Management Graphics, Inc., Minneapolis. Mike started programming in the 7th grader, and later earned his B.S. in Computer Science at the University of Minnesota/Institute of Technology in 1984. Mike can be reached on the Internet at cepek@mgi.com.

Introduction
Flash memory is an increasingly popular choice over EPROMs for system designers, especially for embedded systems. The devices available today allow in-circuit reprogramming of megabytes of "ROM" with few or no additional parts. This article describes the additional software required to program these devices.

Why We Like Flash
One major benefit from flash is faster software development turnaround time. Burning and installing EPROMs for each new version of firmware slows down development. While downloading the firmware to RAM can eliminate these steps, it assumes that we have adequate (i.e. more) RAM for the task, which is not always a good assumption. This is because in an embedded system, code and static data usually reside in ROM. Note also that the ROM and RAM versions of the code are not identical — the ROM and RAM address spaces will be different, leading to different firmware relocation addresses. Using flash memory avoids all these development issues.
Our embedded systems currently use two types of flash memory devices made by AMD: the Am29F010 and Am29F040. These devices are single-voltage (5v), byte-wide memories with 128 KB and 512 KB capacity, respectively. High storage capacity is just one of the many device features available in flash memories (see sidebar). For example, both of these devices divide their memory into eight equal size sectors, each of which can be independently erased and reprogrammed.
These two flash devices are compatible with the JEDEC pinout standard for EPROM devices. This allows us to use flash during the development process and to use off-the-shelf EPROMs in production where desired.
Our customers also appreciate flash because it enables them to install firmware updates without disassembling the product to replace EPROMs. We have even provided "media-less" firmware updates to customers via e-mail, which they can download to the product and save in the flash memory.
[1] and [2] discuss other applications for flash, and discuss common problems with and solutions for embedded system flash implementations.

Implementation Background
Code written for embedded systems is often highly specific to one platform. However, by applying principles of generality and modularization, I believe we have produced code that is reasonably portable and adaptable.
Our implementation uses two byte-wide flash chips in parallel, allowing direct 16-bit reads and writes with minimum access time (see Figure 1) . This leads to some interesting program timing constraints, to be covered later in this article.
The remainder of this article discusses the production code we use with these devices. I describe the data structures defined in the file fmutl.h first, and follow with a bottom-up presentation of the routines, defined in fmutl.c. (These two source files are available on this month's code disk. The disk includes another source module which uses these routines for diagnostic and unit testing.)

Data Structures
Listing 1 shows the header file, fmutl.h. It contains the following:

typedefs for the FMINFO and FMERR structures

an extern for an FMERR instance

prototypes for the global routines in fmutl.c
The FMINFO structure holds information describing the characteristics of the flash devices. This data allows the code to adapt to different flash devices still having similar overall characteristics (e.g., programming algorithm). The fm_status routine fills in the FMINFO member data.
The FMERR structure provides error information to the calling routine. A global instance of this structure is defined in the fmutl. c module, and fMerr. h provides an extern declaration.
Listing 2 shows the manifest constants and static variable definitions for the program. (On the code disk, Listing 2 through Listing 7 comprise fmutl. c.)
Because our code accesses two byte-wide devices in parallel, we've duplicated the command code bytes in the command constants — the commands will be written to, and data will be read back from, both devices simultaneously.
The constants FM_DATA1 and FM_DATA2 are offsets from the beginning (base) of the flash memory space, and will be resolved to actual system addresses in the flash memory address space at run time. The global definition of the fm_err structure allows fmutl. c to export error information to calling routines.
DevTable is an array of FMINFO structures that describes the flash devices supported by the program. The code uses this data at runtime to accommodate a variety of devices. (One of our product models actually does use two sizes of flash devices in the same unit.)

Flash Programming Routines
The routines in fmutl.c are ordered in traditional bottom-up-first fashion. Accordingly, I begin my discussion with the low-level routines, followed by mid- and high-level routines.
The fm_error routine (Listing 3) needs little explanation. It simply stores its parameters in the global fm_err structure. We make this routine global to allow calling modules to handle flash-related errors without duplication of code. For example, a diagnostic test verifying data written to the flash could handle a verification error by calling fm_error with its own unique error code.

Picking the lock
In normal operation, flash memory behaves like ROM: values may be read from ROM but writes are ignored. Attempts to write to ROM are usually programming errors, and are thus rare. Flash devices take advantage of this rarity by allowing for a series of special writes to act as "key." When a flash device recognizes the predetermined sequence, it becomes available for erasing, writing, and other operations.
The fm_cmd routine (Listing 3) recites the magic incantation to unlock the devices into the special mode selected by the third parameter. If any of the writes contain unexpected data, the devices revert to read-only mode.
We use the static variable pBase to simplify parameter passing. pBase is initialized by the two global routines fm_status and fm_write, to point to the base address of the flash memory space. (In our products with two sets of flash devices, one set starts at system address 0x00000000, the other at 0x64000000.)
Because flash memory is implemented in the system as normal read/write memory, there are no special timing considerations during normal operation. We can use normal memory read and write accesses from high-level code to perform all the special device functions.

Mid-Level Routines
Listing 4 shows the routines fm_protected and fm_status. The former uses a special mode of the devices to query if any of the memory sectors are protected. Protected sectors cannot be erased or reprogrammed, and typically store bootstrap or BIOS type code. (We could have implemented the fields in FMINFO for reporting protected sector information as bits within a word or a long, but this would place an artificial limit on the number of sectors in a device. Some flash devices have as many as 1024 sectors.)
FMINFO represents the protected sectors using a start sector and number of sectors. This method assumes that the case of non-contiguous protected sectors is not interesting, since bootstrap and BIOS code is usually in a single area located at the highest or lowest area of memory. We use the shift-right operator, as found in the first two lines, throughout this module to convert byte lengths to word lengths, since all pointer arithmetic is word-based.
To sense which sectors are protected, fm_protected calls fm_cmd to place the devices in AutoSelect mode. In this mode, reads from certain offsets in the flash address space will return chip status information instead of real data. The constants with the FM_AO_prefix identify these special address offsets.
Inside the loop, the routine reads a word from each sector. The FM_PROT_SECT_MASk bits within the word indicate whether that sector is protected in either device.
The fm_status routine is global. It uses AutoSelect mode to read back ID bytes indicating the device manufacturer and the device type. The routine scans the DevTable array of known device information for a match, and initializes the static FMInfo structure with device-specific information. fm_status returns a non-zero value if the devices are known, which can be used as a quick check for functioning flash.
Also, fm_status can optionally copy the FMInfo static structure to a structure designated by the caller. Diagnostics, for example, could display the pDevName string to confirm the type of devices installed.

Erasing Sectors
fm_sector_erase (the second routine in Listing 5) uses the Sector Erase feature of the devices to prepare them for being written with new data. Sector Erase erases all locations within a sector at once, which is where the term "flash" comes from. Conventional EEPROM technology erases each location separately. Any number of flash sectors can be erased concurrently.
One curious thing about flash is that this "electrically erased" condition results in all bits being set to 1. A related curiosity is that 1 bits erase faster than 0 bits. (See [7] for a good explanation of how flash memory operates at the cell level.) The time required for erasure also varies with device temperature and the number of lifetime erase/write cycles performed. Figure 2 shows how erase and write times increase with use for an Am29F010. See also [8] for more information on erasure time.
In anticipation of success — where fm_error is not called — fm_sector_erase and fm_write both clear the fm_err structure to all zeros. This action prevents uninitialized values from being left for the caller in fm_err. Note that fm_err cannot be initialized in its declaration, since embedded system compilers tend to place initialized data into ROM rather than RAM.
The construction and use of the SectorAddrMask variable deserves mention. The code uses this variable to convert a pointer into a flash sector to a byte offset from the base of that sector. This conversion assumes that the base address of the flash (FBASE) in the system address space is aligned to an even multiple of the flash address range (FSIZE) — i.e., FBASE mood FSIZE must be zero. This assumption is significant when supporting devices of various sizes.
Prior to the first loop in fm_sector_erase, the devices are unlocked for erasure. The first loop completes the sector erasure command sequence by writing a secondary command code to an erase within each sector to be erased. The devices consider the erase command sequence complete if they receive no secondary command within a certain time period (e.g. 80 msec), so it may be necessary to disable interrupt service routines during this time.
The second loop in fm_sector_erase waits for the erase operation to complete. The erase and write processes are controlled by an algorithm internal to the devices. When these internal algorithms are running, reading from the devices returns status information instead of ROM data. In particular, the high bit of the byte is inverted from the final, expected value. This operation guarantees that when the expected value occurs, the erase or write operation has completed successfully, which leads to a very simple polling loop for success. The erase is complete when the readback of any erased location returns FM_ERASE_VAL.

Device Time-outs
The fail-safe loop in fm_sector_erase ensures that malfunctioning hardware does not cause the firmware to hang. An earlier version of this routine assumed that completion or time-out were the only two possible outcomes. However, the specific tests used for completion and time-out ignore many other possible bit combinations which, though rare, should be tolerated by robust code.
A "feature" of that earlier code version was the lack of an implementation-dependent timing constant like E_FS_ITER. Using such a constant may not be desirable, but I don't know of a way to handle all three cases (success, time-out, and other) without assuming another implementation-dependent resource, such as a watchdog timer.
fm_sector_erase calls the fm_tofs_error routine (Listing 5) if the fail-safe loop expires. Its job is to determine whether or not a timeout occurred. This process sounds simple enough, but two things complicate the code here: timeout detection itself and parallel devices.
Figure 3 illustrates the byte data returned by polling an erased location during an erase cycle. While the erase is in progress, bit 7 (DQ7) shows the inverse of the final expected data. If a location cannot be successfully erased, bit 5 (DQ5) will go high when the device gives up trying and times out. The data book warns that DQ5 and DQ7 may change at the same time. Thus, a single read showing DQ5 high cannot be trusted as indicating a timeout, since DQ7 may be transitioning to final valid data, which might coincidentally have DQ5 set.
A further complication is that we are using two devices in parallel. The case where only one device times out must be properly handled. I will leave as an exercise for the reader to figure out the nuances of Boolean logic in the first three lines of code.

Programming New Data
Listing 6 shows the fm_cpy routine. fm_cpy writes new data into sectors which are assumed to have been erased. The data books refer to this operation as "programming" the data. (The parameter order purposely mimics memcpy.)
Each byte to be written requires a separate unlock sequence. We applied heavy optimization techniques to this routine to minimize the effect of this overhead. Our hand-optimization (mentioned in the function header comment) is aware of the kind of assembly code that will ultimately be generated here. For example, using a short instead of an int for the fail-safe counter allowed the compiler to generate a faster looping construct. An 11 percent improvement in speed resulted from this one change. This routine is then both optimal for our specific implementation and portable.
The bulk of this routine will look familiar from previous routines. For efficiency, we replicate a call to fm_cmd using register pointers and data. We then write the bytes to be programmed to the desired location. The completion polling loop and error handling mimics the code in fm_sector_erase.
As indicated by a comment statement, two program lines occuring before the inner loop perform a kind of parallel processing. After we write new data to the current location we must wait for the write to complete. We can execute more program statements in the meantime. Analysis shows that these two lines execute well before the write completes. This results in fewer polling iterations, which speeds up the innermost loop.

The High-Level Routine
The fm_write routine (Listing 7) makes use of all the other routines in fmutl.c to accomplish its function: writing an arbitrary sized buffer of data to an arbitrary area of the flash.
We chose to abstract to this level for a number of reasons. Writing new data to all of flash, our most common operation, requires a simple call:
fm_write(FLASH_BASE, FLASH_BASE, pNewData, DataSize, 0);
The complexities of sector sizes, erasing before writing, device identification, and so on are encapsulated.
Because sectored flash devices can only erase entire sectors, writes that are not aligned to sector boundaries require special handling. Rather than implement this complexity in various separate locations, we chose to encapsulate it as well.

Optional Scratch Buffer
The encapsulation approaches presented thus far cannot hide one aspect of sectored flash memory which differs from normal RAM: It's impossible to change just one byte within a sector.
Consider the case where a program calls fm_write to write a single word to a sector already containing healthy data. Erasing the sector and writing the new word leaves the remaining words in that sector changed to 0xFFFF — not likely what the caller expected.
To get around this problem, the pScrBuf parameter lets the caller pass a pointer to a buffer. If pScrBuf is not a NULL, uses it to save and restore any existing flash sector data not being reprogrammed.
If pScrBuf is NULL, fm_write transfers the caller's new data directly to the flash sector. Non-buffered transfers accommodate the case in which the caller wishes to transfer a large block of data to flash memory and doesn't care what happens to the remaining (unwritten) portion of the sector. Giving the caller this control is a performance feature, since not using the buffer saves time. Allowing the caller to provide the buffer also avoids forcing fm_write to acquire an implementation-dependent resource (memory) at run time. The caller can use fm_status and the SectorSize field of the FMINFO structure to size this buffer.

The fm_write algorithm
The function header comments describe the external interface. The following pseudo-code provides an overview of how the routine works:

fm_status(); if (bad parameters) fm_error(); if (not starting at a sector boundary) handle partial write to first sector; if (data doesn't end on a sector boundary) setup for partial last sector write; erase the remaining sectors; write the remaining user data; if (last sector was partial && pScrBuf) write pScrBuf data to last sector;
The bulk of this routine handles the partial sector cases. If the caller only writes complete sectors, the extra complexity isn't used.
The first major conditional, based on keep_before, handles the case where the beginning of the write doesn't start on an even sector boundary. In this case, the algorithm computes the following key length values:

keep_before = number of bytes before user data wri_len = number of bytes of user data keep_after = number of bytes after user data
These values reflect the start and end boundaries of the user data within the initial sector. If a scratch buffer was provided, the buffer is used to save the original contents of the sector. Next, the routine erases the sector and writes the new user data. Then, the original sector data before and/or after the new user data is restored, if a scratch buffer was provided.
The next major conditional, based on wri_len, handles the case where the write does not end on an even sector boundary. The wri_len and keep_after variables are recomputed. If a buffer was provided it is loaded with the original contents of the last sector.
Next, the routine erases the remaining sectors and writes them with the new user data. Finally, the routine writes the remainder of the final sector from the scratch buffer, if provided.

Testing
The code disk contains a third source module, flashtst.c, that we have used to unit test these routines. One routine performs a few basic memory tests (zeros, ones, unique data) suitable for a manufacturing check-out process. Another routine performs an exhaustive test of all combinations of writes on and off sector boundaries, which helps test the code integrity.
This module can be compiled as a stand-alone utility or as an independently linkable module. While modularized, it does rely on external routines available in our embedded system for user input and output.

The Tip of the Iceberg
The code presented here works well for our purposes, but there are many extra features supported by the devices that you may want to use:

Many flash devices support an erase suspend feature. Recall that during erase, data reads return status information rather than real data. With some hardware assistance an application could implement "background erasing." By using interrupt handlers to suspend and resume erasing, data could be read from sectors not being erased, which would allow the application to continue while the erase occured during idle bus cycles.

Creative algorithms could also take advantage of the fact that flash writes can change 1-bits to 0-bits (but not vice-versa) without an erase cycle.

"Wear leveling" is a technique commonly applied to heavy use flash implementations. You can maximize device lifetime with low-level remapping of the sectors as necessary to keep them all at roughly the same number of erase and write cycles. You can also automatically "retire" sectors as they become unusable.

File systems are a typical next step beyond using flash as a better EPROM. There are various hardware and software solutions available today implementing this logical next step.

Source Sources
Source code for interfacing with flash devices is available. Intel provides assembly and C code in their flash data books (see [6]). AMD offers Embedded Development Kit Driver Software free for the asking or via their California BBS (contact your local rep).

Conclusion
Flash memory allows embedded systems to easily adapt to changing requirements. After working with flash memory devices for awhile now in a fast-paced development environment, I can't imagine going back to burning EPROMs. And the features flash provides to our customers make our products even more competitive.

References
[1] Arvind Rana. "Designing and Debugging with Flash ROMs," Proceedings of the Sixth Annual Embedded Systems Conference, Volume 2, September 1994, pp. 145-153.
[2] Brian Dipert and Markus Levy, Designing With Flash Memory (Annabooks, 1994).
[3] Advanced Micro Devices, Inc. 1994/1995 Flash Memory Products Data Book/Handbook.
[4] Atmel Corporation 1994 Single Voltage Flash Memory Design & Application Book.
[5] Intel 1994 Flash Memory data books Volumes I & II.
[6] ibid., Vol. I, Chapters 3 and 4, Application Notes.
[7] ibid., Vol. II, ppg 9-1 to 9-5, Engineering Report ER-20.
[8] ibid., Vol. II, pg 9-11, Figure 5.