April 1993/Mapping Functions for Repetitive Structures

Features

Mapping Functions for Repetitive Structures

Steven K. Graham

Steven K. Graham has worked for Hewlett-Packard, served on the Faculty at UMKC, and is currently a Senior Engineer with CSI.
Higher-order functions take other functions as arguments. They are usually used to apply a function to some repetitive data object, such as a list or a tree. Such use of higher-order functions is referred to as mapping, and the higher-order functions are called mapping functions. The key benefit of higher-order functions is the extra layer of abstraction they introduce. In essence, mapping functions allow you to create custom control structures that can be applied to data structures routinely used in a particular environment. Mapping functions do not simplify code intended for a single use or a single purpose. Only when a data object is processed repeatedly in similar ways —where the control flow is the same, but the functions performed vary — is a mapping function valuable. As an example, consider the general case of processing records in a particular kind of data file. This can be regarded as mapping the processing function onto the records of the file. A mapping function could be built that takes any given record processing function and applies it to all the records of a file.

Uses for Mapping Functions
Despite sparse use in traditional programming languages, mapping functions are used in many familiar contexts. In mathematics, a higher-order function is used to express iterated sums:

The summation applies the function F to each integer from 1 to n, and accumulates the sum. If n were 5 and F(x) = x x, then sum = 11 + 22 + 33 + 44 + 55. It can be argued that a large part of the success of mathematics is due to developing effective language and notations to express complex ideas. In UNIX, the notion of higher-order commands has appeared in commands that take other commands as arguments and control their application. sed is a good example of this.
Some languages provide convenient methods for expressing higher-order functions. Lisp and its dialects such as Scheme, provide functions as first class objects — functions can be passed as arguments, returned as values, assigned as array elements and so on. So Lisp programmers commonly use mapping functions and are accustomed to exploiting their power. Other languages restrict the use of functions to varying degrees, but higher-order functions are possible in languages readily available, such as C. Because of C's (relatively) strong typing, mapping functions force the processing functions they apply to return a particular type, and it is necessary to coerce types to pass a function argument to a mapping function.

Example Programs
Listing 1 illustrates a C version of summation. It defines two mapping functions, sum and dsum, and three processing functions which are mapped onto a range of integers, pi, id, and sq. Note that sum and dsum are basically for loops that process the integers from a to b, at each step accumulating the result of applying the function parameter term to the current value of i. Because of C's typing, you need two versions to accommodate different return types despite the identical control structures, so dsum is used to accumulate a floating-point sum. One of the processing functions, id, is particularly noteworthy, because it is only in the context of higher-order functions that a programmer ever needs an identity function that does nothing but return its argument value. However, id is needed for sum to simply add the integers from a to b.
The call to dsum, approximates p using a partial sum of an infinite series. This is drawn from the identity that

which can be written as

with the function pi (in Listing 1) providing the value for each term, given the value for i.
Higher-order functions become mapping functions when applied to repeating structures, such as lists or trees. A list is comprised of an item and a pointer to the rest of the list (which is also a list). A tree is comprised of a node, with subtrees as children. This mapping approach can be applied to the directory tree of a UNIX file system (see Listing 2) . The function mapdir is a recursive function that maps its argument filefn onto all the files in the hierarchy below a commandline directory argument.
This example applies the function suidfile that prints out the name of files that have the setuid bit set. The setuid bit indicates that when the file is executed, it will change its effective user ID to the owner of the file when it is run. Programs that set the user ID can lead to gaping holes in system security. (Sun's Programmer's Guide to Security Features comments on its use, "1. Don't do it unless absolutely necessary.") Using mapdir, it is simple to search an entire directory structure for programs that set the user ID. This example could be part of a suite of programs designed to explore the file system for potential security problems. A more elaborate system could be devised to check other file attributes, such as comparing file permissions to the permissions of the directory that contains them. The real power of this approach is that the mapdir function is independent of any particular use. Simply by calling mapdir with the function printfile (in Listing 3) as an argument, you have a program to display directories recursively. By creating routines that check for unusually large files, for unusually small files, or for large numbers of files within a single directory, you can build tools to help manage disk usage.
The functions and structures used in mapdir are drawn from the standard include files, dirent.h, sys/dirent.h and sys/stat.h. These examples were developed on a Sun SPARCstation using SunOS 4.1. dirent.h provides the DIR struct, along with the functions opendir, closedir, and readdir to manipulate it. sys/dirent.h provides the struct dirent, for directory entries:
struct dirent {
    off_t       d_off;
/* offset of next disk dir entry */
    unsigned long  d_fileno;
/* file number of entry */
    unsigned short d_reclen;
/* length of this record */
    unsigned short d_namlen;
/* length of string in d_name */
    char       d_name[255+1];
/* name (up to MAXNAMLEN + 1) */
};
For the example programs, the field d_name, a string that specifies the file name for a particular directory entry, was the only field used. stat.h provides the following structure to store information about files:
struct stat {
    dev_t  st_dev;
/* device file resides on */
    ino_t  st_ino;
/* file serial number */
    mode_t  st_mode;
/* file mode */
    short  st_nlink;
/* # of hard links to file */
    uid_t  st_uid;
/* owner's user ID */
    gid_t  st_gid;
/* owner's group ID */
    dev_t  st_rdev;
/* dev identifier for *
/* dev identifier for *
 * special files */
    off_t  st_size;
/* file size in bytes *
 * - (off_t is a long) */
    time_t st_atime;
/* last access time */
    int   st_spare1;
    time_t st_mtime;
/* last modify time */
    int    st_spare2;
    time_t  st_ctime;
/* last status change time */
    int   st_spare3;
    long  st_blksize;
/* preferred block size */
    long  st_blocks;
/* # of blocks allocated */
    long  st_spare4[2];
};
The example programs use the st_mode field, as well as some of the macros and constants defined in stat.h. In particular, the macros S_ISDIR, S_ISLNK, and the bit mask S_ISUID are used to examine the st_mode field, which records file types and permissions. stat.h also declares the stat function that reads the file information. stat has the declaration
int stat(path, buf)
char *path;
struct stat *buf;
The mapdir program can be made more general. In Listing 3, mapif.c, the mapdir function has been modified to accept three functions as arguments: a predicate function, a then function and an else function. The purpose is to not only map these functions across the file system, but also provide alternative actions depending on the results of the predicate function. The example uses a function that tests for the setuid bit, a printfile function, and a "do nothing" function, to reproduce the behavior of the earlier example from Listing 2.

Variations to Mapping Functions
In these example, the functions that are mapped rely on a single filename argument. This can be varied. A file descriptor could be passed. The complete path could be passed in addition to the filename. Other parameters might specify the level in the file hierarchy or information about the directory containing the file. Global variables or additional parameters could be set using command-line arguments and used to provide thresholds for searching for unusually large or small files, or perhaps a search string for building a grep-like function out of mapdir and a routine that searches a single file. Besides operating on files within the file system, a mapping function could be written that processes the directories rather than (or in addition to) processing the files, for example, counting the number of files in each directory.
The file system isn't the only regularly structured data in a UNIX system. A simple procedure based on the mapping notion could be written that applies a function to each line within a file. Using such a routine, functions could operate on the passwd file, checking for users with no password, or checking directory protections on the home directories of each user. A function could be written to process every user's .cshrc file, perhaps to incorporate changes uniformly.
UNIX, perhaps more than any other system, is characterized by an abundance of tools. As I mentioned at the outset, many of these tools incorporate the notion of function or command arguments. The find command, with an - exec parameter, processes the file system in much the way that mapdir does. So why bother with mapdir? First, the key point is not the particular example, but the general idea: whenever possible, abstract general behaviors over regular structures into mapping functions. Second, tools built in C rather than as shell scripts or command invocations are more flexible. Their input and output can be varied; alternatives can be handled; and programs can be tailored to particular uses. Finally, the performance of different approaches will vary. In some cases, a program may be more efficient than a command with respect to some essential resources.
(C) 1993 by Steven K. Graham