A C++ Class for Editing Large Files

How well does your editor handle 900-MB files?

by Vincent Van Den Berghe

Vincent is currently employed at Bureau van Dijk Electronic Publishing in Belgium, and his primary fields of interest are algorithms, optimization techniques, and compiler design. Vincent can be reached at vvdb@bvd.com.

The company I work for specializes in CD-ROMs, and we generate files with sizes that range from 100 MB to over 1 GB. These files have to be viewed and sometimes even edited. The edits are usually minor, but none of the editors we tried could handle such large files well. Many load the entire file into virtual memory. As a result, simply viewing a 900-MB file can be quite tedious. Since the editor's temporary files are at least as big as the original, disk space becomes a problem as well.

As a result, I began to look for a C++ class that could be used to build my own, more-efficient editing tools. I found classes that encapsulate standard file operations, but I also needed support for basic editing operations. I finally decided to design my own class, specifically tailored to meet the following "CD-ROM" requirements:

No temporary file or extra memory if the file is not changed.
Rapid file positioning and reading.
Resource consumption is a function of the number and size of edit operations, not the size of the file.
Portable.
Modifying a file does not greatly reduce the speed at which the modified file can be viewed.
Original file is never modified unless a "flush" operation is performed, or the file object is destroyed.
Can determine whether data at a certain position has been modified (for implementing "highlight the changes" features).

The complete source code for the class is available electronically.

The Interface

The main class is the TEditableFile class (see Listing One). The functions are straightforward and require little explanation:

The read, write, tell, and seek functions are analogous to the standard ANSI C functions.
The insert and remove functions provide basic editing operations.
The flush method is overloaded to either save the edits to the original file (with no arguments), or to a separate file (with a file-handle argument). In the latter case, the original file contents are left untouched. You can provide the original file handle as an argument to flush, but don't pass a duplicate handle to the same file.
The size function returns the current size of the file in bytes.
The eof function returns nonzero if current position is at end-of-file.
The failed function is described later.
The cancelChanges function cancels all edits and returns the file to its state when the program started or when the last flush() was executed.
The isUnmodified is used to test if certain data has been edited. It can optionally find the longest contiguous block of unedited data. The loop in Example 1 simply lists the modified and unmodified blocks in a file.

Example 1: Identifying modified sections of a file.

while(pos<file.size()) {
    long endPos;
    if(file.isUnmodified(pos,&endPos))
        cout << "Positions " << pos << " through " << endPos
             << " unmodified" << endl;
    else
        cout << "Positions " << pos << " through " << endPos
             << " modified" << endl;
    pos=endPos+1;
}

Data Structures

The underlying idea is what I call the "version control" principle: If you can't keep track of the entire file, keep track of the changes. More specifically, I view the file as a sequence of one or more "deltas." Each delta identifies the source of a block of data in the edited file. Each delta is either a file delta (specifying a block of data from the original file) or an edit delta (with data from an edit operation). The former will be written as F[x,n] (n bytes of data from position x of the original file). The latter will be written as E[n] (n bytes of data from an edit operation). The file position of the edit delta is deduced from its position in the delta list.

To see how this mechanism works, consider editing the 11-byte file containing "Hello World." Originally, there is a single delta F[0,11]. If you remove 3 bytes from position 2, the original delta is split into F[0,2], F[5,6]. If you then insert the 2 bytes "ol" at position 1, the first file delta will be split (analogous with the remove operation), and an edit delta will be inserted between the two cuts. The final file contains:

   F[0,1]="H"
   E[2]="ol"
   F[1,1]="e"
   F[5,6]=" World"

Every edit operation (write, insert, remove) causes the delta list for the file to be updated with the new information. Writing n bytes is equivalent to removing n bytes and inserting the new bytes at the same position.

Note that the file deltas only require storage space for the pointers, not the actual data. Only edit deltas actually store data.

Implementation

Two observations guided the development of the class. First, for any position p in the file, there is one and only one delta containing the data at that position. Second, the effects of an operation can propagate to deltas located to the left or right of this affected delta.

The first observation hinted at a very simple implementation of the TEditable file class: Find the correct delta and "delegate" the operation to that delta. Listing One shows how this works. The getDelta function locates the (unique) delta containing a file position. The read, write, insert, and remove operations call that function and pass the operation to the delta object. Note that the offset will be the offset within the delta, not the offset within the file.

The second observation led me to organize the deltas in a doubly linked circular list with a sentinel. There are several advantages to this structure:

Each element knows its left and right neighbor (needed to propagate an operation).
The sentinel eliminates special cases.
The sentinel delta can be designed to handle operations. For example, a read operation immediately returns end-of-file.

With all this in mind, Figure 1 illustrates the class hierarchy for the deltas.

Figure 1: Class hierarchy (classes in parentheses not implemented).

Chart

TDelta is an abstract class, defining a common protocol for all derived delta classes. This protocol is enforced by using pure virtual functions. The functions bytesLeft(), which returns the number of bytes that can be inserted, and unmodified(), which returns true if the delta contains unmodified data, identify the kind of delta. These functions are sufficient to identify file deltas from edit deltas, and are generic enough to be applicable to any other kind of delta you may dream up.

The TFileDelta implements file deltas, and is shown in Listing Two. The TSentinelDelta implements the sentinel for the doubly linked list. There will typically be only one instance of this class per deltaList.

The other classes implement edit deltas. Because inserted data might be stored in a variety of places—disk, extended memory, conventional memory—I chose to have a different class for each delta variety.

Having different kinds of edit deltas introduces the problem of determining which type of edit delta to create in each different situation. I have therefore defined a "factory" class TEditDelta. This class contains only one static function, called createNew. It will return a new edit delta of the appropriate type. The current implementation always creates a TDiskEditDelta (which is the only type I've implemented so far), but could be modified to create different kinds of deltas depending on such factors as memory availability.

The most interesting operation is flush. Consider a 100-byte file in which 10 bytes have been inserted at the beginning. The deltaList will be E[10], F[0,100]. If you write the deltas to the file sequentially, E[10] will overwrite the first 10 bytes of the file, but F[0,100] needs these first 10 bytes.

Another approach is to first flush the file deltas that have moved, and then flush the rest. But that doesn't work with the deltaList E[10], F[0,50], F[55,45]. This is the previous file, but with 5 bytes removed from position 50. If you write the moved file deltas in order, the last 5 bytes of the second file delta will overwrite the first 5 bytes of the third.

The solution is to use three passes:

Write all file deltas that were moved up in reverse order.
Write all file deltas that were moved down in forward order.
Write all other deltas.

The entire implementation has been split into six .cpp files, each with a corresponding header file, as shown in Table 1.

Table 1: Flush implementation reflecting the class hierarchy.

File Name	Implements	 	        Corresponding 
                                                Header File

EDITABLE.CPP	TEditableFile class 	        EDITBALE.H
DELTA.CPP	TDelta abstract base class      DELTA.H
FILDELTA.CPP	TFileDelta class	        FILEDELTA.H
SENDELTA.CPP	TSentinelDelta class	        SENDELTA.H
EDTDELTA.CPP	TEditDelta class	        EDTDELTA.H
DSKDELTA.CPP	TDiskEditDelta class	        DSKDELTA.H

These files directly reflect the class hierarchy. Your program only needs to include EDITABLE.H (which, itself, depends on no other header files). The rest of the files can be compiled into a library that can be linked with your program.

Error Handling

To detect and handle errors, you will have to check the member-function return values. Most TEditableFile member functions return the number of bytes processed. If this is a short count, something went wrong. To determine the cause of the error, check the (standard) errno variable, which holds the code of the last disk error. There is also a special member function, failed(), that returns nonzero if the constructor failed (remember that constructors do not return error codes).

Portability

The code has been written by making minimal assumptions about the current ANSI C++ standard. No advanced features, operating-system calls, or memory-model assumptions were used. Porting this code to other environments, including those with minimal C++ support, should require little effort.

One potential problem is the handling of operator new. The draft C++ standard states that if operator new fails, an exception (xalloc) is thrown. My code assumes the previous behavior, that operator new returns NULL if it fails.

The other potential problem is the file operations. The classes have been designed using (UNIX-like) handle-based file I/O, which is not covered by the ANSI standard. It's trivial to make the classes iostream or FILE oriented. Buffer size problems have been circumvented by using the ANSI-defined size_t type instead of the more usual int or long integral type. As a result, the maximum amount of data you will be able to transfer is whatever the maximum value of size_t is.

Unfortunately, there is no constant describing the maximum value of size_t or the maximum number of bytes that can be read in one read() operation. On a 16-bit PC compiler, size_t is an unsigned short (if the memory model is not huge), but the maximum read size is slightly less than 64 KB, so that constant had to be hard coded. In a Windows environment, you may be able to use hread() and hwrite() to handle I/O requests of more than 64 KB at a time.

Performance Bottlenecks

My class assumes that there are few edits. More edits lead to a larger deltaList and slow the search for the correct delta. The deltaList length will, therefore, be the main performance bottleneck.

One way to address this is to design a more-efficient search structure for the deltaList. For example, each delta could keep the actual file position, and you could keep the list in "Least Recently Used" order. This would allow the deltaList to self organize, such that the most frequently used deltas are found first.

Another approach is to minimize the length of the list by delaying the creation of new deltas (as much as possible) and by collapsing deltas. I have already implemented delayed creation to some extent, but the algorithm currently does not collapse deltas. For example, the deltaList of a 100-byte file in which 10 bytes were inserted at position 50, then immediately removed, would be F[0,50], F[50,50], which can be collapsed to F[0,100].

Since you can only collapse deltas of the same type, you need to identify the type of the delta. You could use run-time type identification (RTTI) supported by some compilers, or you could add a pure virtual function in TDelta such as virtual int typeId() =0; redefined to return a unique ID (or a string constant) for each delta type.

Although removing bottlenecks is not trivial, it does not fundamentally change the algorithm described here, and is, therefore, left as an exercise for you. Simpler improvements, like adding a cache to speed I/O, should not be a problem.

Extensions

One interesting addition would allow editing files directly on CD-ROM. The current class saves all modifications to the file in the deltaList. You could add the ability to save just the deltaList to the hard disk, and return to your "edited" CD-ROM file at a later date. This could be the basis of a file-annotation mechanism for CD-ROM files. You could also extend the flush operation to save overwritten data to a separate file, allowing for "reverse deltas" that would take you back to any previous version of the file.

Conclusion

The TEditableFile class presented here is ideally suited for small edit operations on large files. When used correctly, it can act as a drop-in replacement for a standard file class and simplify file modifications, including insertions and deletions.

Listings