Five Ways to Find Files

The Perl Journal July 2003

By Andy Lester

Andy manages programmers for Follett Library Resources in McHenry, IL. In his spare time, he works on his CPAN modules and does technical writing and editing. He can be contacted at andy@petdance.com.

For most command-line utilities you'll write in Perl, chances are you'll need to operate on multiple files in the filesystem. As you'd expect, there's more than one way to do it, and you should choose your method based on the specifics of the task at hand. Do you want a list of files, or do you want to iterate over them? Do you need to operate on directories as well as files? Do you need to find files in subdirectories, too? In this article, I'll show you five different ways your Perl program can find files.

No matter which method you choose, remember that searches will be relative to your current directory. Make sure that any solution you choose gives you filenames with path information. It doesn't help to get a filename of "foo.pl" if the file is three directory levels down.

UNIX find

UNIX is filled with tools designed to do one thing well, and the find utility fits this description. If you're comfortable with find's options, then it may make sense to use find from inside your Perl program and parse its output. This approach also has the advantage of making it easy for you to debug the rules you're using in the shell, and then wrap it up in your Perl code.

find can be a bit cumbersome to use from the command line, but for simple searches, it does just fine. The following works from the shell for our example:

$ find . -name '*.pl' -o -name '*.pm' -o -name '*.t' -print
    ./lib/WWW/Mechanize.pm
    ./t/00.load.t
    ./t/99.pod.t
    ... etc ... 

find takes a starting directory, and then a series of options that tell which files you're interested in. The -name '*.pl' tells find to find files that match the pattern *.pl. The -o is the "or" option. Chained together, find finds anything that matches *.pl, *.pm, or *.t, and then -print tells find to print the found file to standard output. If you're using GNU find, you can leave off the -print, as it's assumed to be the action to take.

Note that you must single quote the pattern, or else the shell will expand the wildcard. For example, if you use:

find . -name *.pl -print 

and there is one file in the current directory, and it is named "foo.pl," the shell will expand into

find . -name foo.pl -print 

which tells find to only find files named "foo.pl," which is certainly not what you wanted.

Now that I have a working find command, I copy it into my Perl program and surround it with the backtick operators. To get the output from find into a list, use the backtick operator and read the lines into an array.

my @files = 'find  -name '*.pl' -o -name '*.pm' -o -name '*.t''; 

Each line of the output from the find will be put into the @files array. They're not ready to be used yet, because each line has a "\n" at the end, so we just have to chomp it off.

chomp @files; 

opendir/readdir/closedir

For a purely Perl solution, you have three options: readdir, globbing, or modules. I'll start with readdir.

Perl has the concept of dirhandles that act like filehandles, but let you iterate through entries in a directory in much the same way as iterating through lines of a text file. dirhandles are opened with opendir, read from with readdir, and closed with closedir.

In scalar context, readdir returns the next directory entry, or undef if there are no more. In list context, it returns all the directory entries. What makes readdir a challenge is that "directory entry" could be anything: a plain file, a directory, a symlink, special directories such as "." and "..", and so on.

For single-directory searching, readdir is pretty handy. Since it can return a list, it's easy to filter your files through grep, as in:

opendir( DIR, $dir ) or die "Can't open $dir";
    my @files = grep -f "$dir/$_" && /\.(pl|pm|t)$/, readdir DIR;
    closedir DIR; 

Each entry from readdir is checked to see if it's a plain file and ends in .pl, .pm or .t. Note that we have to prepend the $dir before we check the -f operator. Because readdir only returns the basename, not the full path, if $dir is anything other than ".", the test results will be inaccurate. This bit me during testing, so keep it in mind.

This approach with readdir works nicely for single directories, but for entire trees you need to create a recursive function.

my @files = get_files( "." ); 
sub get_files {
        my $dir = shift; 
        opendir( DIR, $dir ) or die "Can't open $dir";
        my @entries = readdir(DIR);
        closedir DIR; 
        my @files;
        for my $entry ( @entries ) {
            # Skip current & parent directories, lest we loop
            next if $entry eq "." || $entry eq ".."; 
            my $fullpath = "$dir/$entry";
            if ( -f $fullpath ) {

                if ( $entry =~ /\.(pm|pl|t)$/ ) {
                    push( @files, $fullpath );
                }
            } elsif ( -d $fullpath ) {
                push( @files, get_files( $fullpath ) );
            }
        } # for 
        return @files;
    } 

The get_files subroutine is basically the same as grepping through the list of readdir entries, but it recursively calls itself to get the contents of subdirectories. The full path name for each file must be built in each call to get_files, since readdir only returns filenames, not paths. Also, I need to readdir the entire directory at once, or else reading the global DIR in the recursed call to get_files will mess up the parent.

globbing

The glob operator <> and its spelled-out cousin glob can be an improvement over readdir, both for brevity and for providing shell filename-matching semantics. For example, glob does not return the "." and ".." directory entries, or any other file that starts with ".". It understands character classes like *.p[lm], alternation such as *.{pl,pm,t}, and will expand the tilde to a home directory.

The first readdir example would be written with glob as:

my @files = grep -f, <*.{pl,pm,t}>; 

I prefer the glob keyword in most cases. It looks less like line noise, and can't get confused with the diamond operator in file-reading context.

Updating the sample get_files from above to use glob is left as an exercise to the reader, but only if the reader wants to do more work than necessary. glob is meant only for single directories, so to handle directory structures, use readdir. Or, you can use our next file-finding method, File::Find.

File::Find

The File::Find module takes care of all the mess with directory descending that I've shown you in the previous examples, plus it adds a number of handy features. File::Find has been included with Perl since Perl 5.00307, and is available on CPAN if you have an older version.

File::Find only has two functions, find and finddepth. They're identical except for finddepth performing a depth-first search. find takes a reference to a callback subroutine, often called "wanted" and a list of starting directories to start searching through. For each file or directory find finds, wanted is called, allowing you to perform actions and check information about the file.

When wanted is called, the following global variables will be set:

One of the simplest ways to use File::Find is to accumulate a list of wanted files, like I showed in previous examples, but without having to deal with the recursion into subdirectories. The following prints out all of the files from the current directory downward:

use File::Find; 
my @files;
    sub wanted {
        print $File::Find::name,"\n" if -f;
    } 
find( \&wanted, "." ); 

Note that I need to print $File::Find::name and not just $_ , or I would only get the filename without the path.

Since wanted can do anything, and you can specify multiple starting directories, you have a lot of flexibility. This program figures out how many bytes are used in the files in three different module distributions.

use File::Find; 
my @modules = qw(
        www-mechanize
        html-lint
        marc-record
    ); 
my @dirs = map { "/home/andy/$_" } @modules; 
my ($size, $n);
    find( \&wanted, @dirs );
    print "$size bytes in $n files\n"; 
sub wanted {
        if ( -d ) {
            $File::Find::prune = 1 if $_ eq "CVS";
        } elsif ( -f ) {
            $size += -s;
            ++$n;
        }
    } 

Note the use of $File::Find::prune. Setting it to a true value tells find not to descend into the directory. In this case, we want to omit files in the CVS directory from our totals, since they're not part of the actual distribution.

The subroutine reference need not refer to a named subroutine. An anonymous subroutine will work just as well, and often makes for easier-to-follow logic for simple wanted functions. For example, to count the number of files in a given directory:

use File::Find; 
find( sub { $n++ if -f }, "." );
    print "$n files found\n"; 
or even from the command line, as I discussed in the May issue: 
perl -MFile::Find -le'find( sub {$n++ if -f}, "." );' \
        -e'END {print "$n files found"}' 

File::Find::Rule

If you frequently use File::Find to return lists of files, and you spend a lot of time rewriting the same wanted functions over and over again, you may want to turn to Richard Clamp's File::Find::Rule. Instead of creating wanted functions, you'll create sets of rules, not unlike the find command, in order to retrieve lists of files.

In true Perl style, File::Find::Rule makes it easy to do the common tasks, such as finding all the *.pl files below the current directory:

use File::Find::Rule;
    my @files = File::Find::Rule->file->name('*.pl')->in( "." ); 

Each function in the chain (file and name) acts as both a constructor and a link in a method chain, and then the in method evaluates all the rules for a given directory. You can also use individual method calls on a File::Find::Rule object, like so:

my $rule = File::Find::Rule->new;
    $rule->file;
    $rule->name( '*.pl' );
    my @files = $rule->in( "." ); 

The various methods are extremely flexible. For example, to get *.pl, *.pm, and *.t files, call name with an anonymous array of specifications:

$rule->name( ['*.pl','*.pm','*.t'] ); 

This ability to add rules one at a time makes it easy to build rulesets on the fly. For example, you might have command-line options in your program that tell which files you want, and the "_big" option shows that you only want files larger than 20 MB:

$rule->size( '>20M' ) if $opt_big; 

If the object style of specification doesn't suit you, File::Find::Rule also supports a functional interface. To find all the files in my home directory beginning or ending with a tilde, which are good candidates for deletion:

my @files = find( name => ['~*','*~'], in => '/home/andy' ); 

The rules you use can be combined using Boolean logic with the and and or functions. If I wanted to find files beginning or ending with tildes, or files with 0 bytes, I can use:

my $rule = File::Find::Rule->new;
    $rule->file;
    $rule->or(
        find( name => ['*~','~*'] ),
        find( size => 0 ),
    ); 

Sometimes, you want to iterate over the list of files, like File::Find, rather than gathering a whole list of them. File:: Find::Rule supports this with the start and match methods:

$rule->start( "/home/andy" );
    while ( my $file = $rule->match ) {
        # do something with $file
    } 

For more on File::Find::Rule's options and help on how to create your own extensions, see http://search.cpan.org/author/RCLAMP/File-Find-Rule/.

Conclusion

I've taken you on a tour of five different ways to gather lists of, and operate on, files in the filesystem based on your specific criteria. Which method is best? As with any tool, it depends on your context.

For simple lists of files with rudimentary filename matching, globbing is the way to go. If you need to include directory names in your results as well as files, use readdir. As soon as you want to start finding files in subdirectories, one of the two modules will make life easier. Both File::Find and File::Find::Rule handle directories nicely, although File::Find::Rule may be better for complex logic that doesn't need to be wrapped up in a wanted function.

TPJ