Data Manipulation & Perl Command-Line Options

The Perl Journal May 2003

By Andy Lester

Andy manages programmers for Follett Library Resources. He can be contacted at andy@petdance.com.

In his article "Something For Nothing" (TPJ, March 2003), Simon Cozens talked about using the tools on CPAN to avoid reinventing the wheel. Even without CPAN, Perl itself provides a number of command-line options that do the heavy lifting for many data- and file-manipulation tasks. In this article, I'll provide an overview of Perl's most useful and commonly used data-manipulation options.

-e Command

The most useful way to use the command-line options is by writing Perl one-liners right in the shell. The -e option is the basis for most command-line programs. It accepts the value of the parameter as the source text for a program:

$ perl -e'print "Hello, World!\n"'
Hello, World!

Since this is a single statement in a block, you can omit the semicolon. Also, when the -e option is used, Perl no longer looks for a program name on the command line. This means you can't mix code with -e and a program file.

The -e option is repeatable, which lets you create entire scripts on the command line:

$ perl -e'print "Hello, ";' -e'print "World!\n"'
Hello, World!

When chaining together multiple -e options, make sure you keep your semicolons in the right place. I reflexively put semicolons in my -e lines just for safety's sake, even if it's not strictly necessary because there's only one -e option.

With the -e option, any shell window becomes a Perl IDE. Use it as your calculator to figure out how many 80-line records are in a megabyte:

$ perl -e'print 1024*1024/80, "\n"'  
13107.2

Escaping Shell Characters

When you're creating command-line programs, it's important to pay attention to quoting issues. In all my examples, I've quoted with single quotes—not double quotes—for two reasons. First, I want to be able to use double quotes inside my programs for literals, and double quotes don't nest in the shell. Second, I have to prevent shell interpolation, and single quotes make it easy. For example, if I use double quotes, then

$ perl -MCGI -e"print $CGI::VERSION"

gets the $CGI interpolated as a shell variable. Consequently, unless you have a shell variable called $CGI, Perl sees

print ::VERSION

You can escape the shell variables with a backslash:

$ perl -MCGI -e"print \$CGI::VERSION"

but that gets to be tough to maintain. That's why I stick with single quotes:

$ perl -MCGI -e'print $CGI::VERSION'

Windows has slightly different quoting issues. Windows doesn't have shell variable interpolation, so there's no need for escaping variables with dollar signs in them. On the other hand, you can use only double quotes under Windows, which can be a challenge if you want to use double quotes in your program. Under Windows, your "Hello, World" would look like this:

C:\> perl -e"print \"Hello, World!\n\""

The inner double quotes are escaped with backslashes.

The Diamond Operator

Perl's diamond operator, <>, has a great deal of magic built into it, making operations on multiple files easy.

Have you ever written something like this:

for my $file ( @ARGV ) {
    open( my $fh, $file ) or die "Can't open $file: $!\n";
    while ( my $line = <$fh> ) {
       # do something with $line
    }
    close $fh;
 }

 $ perl myprog.pl file1.txt file2.txt file3.txt

so that your program can operate on three files at once? Use the diamond operator instead. Perl keeps track of which file you're on, and opens and closes the filehandle as appropriate. With the diamond operator, it's as simple as:

while ( my $line = <> ) {
    # do something
}

Perl keeps the name of the currently open file in $ARGV. The $. line counter does not reset at the beginning of each file.

The diamond operator figures prominently in much Perl command-line magic, so it behooves you to get comfortable with it.

-n and -p: Automatic Looping Powerhouses

The -n and -p options are the real workhorse options. They derive from the Awk metaphor of "Do something to every line in the file," and work closely with the diamond operator.

The following program prepends each line with its line number:

while (<>) {
    $_ = sprintf( "%05d: %s", $., $_ );
    print;  # implicitly print $_
}

The construct of "Walk through a file, and print $_ after you do some magic to it" is so common that Perl gives us the -p option to implement it for us. The previous example can be written as:

#!/usr/bin/perl -p

$_ = sprintf( "%05d: %s", $., $_ );

or even shorter as:

$ perl -p -e'$_ = sprintf( "%05d: %s", $. $_ )'

The -n option is just like -p, except that there's no print at the bottom of the implicit loop. This is useful for grep-like programs when you're only interested in selected information. You might use it to print only commented-out lines from your input, defined as beginning with optional whitespace and a pound sign:

$ perl -n -e'print if /^\s*#/'

The next program prints every numeric value that looks like it's part of a dollar value, as in "$43.50."

#!/usr/bin/perl -n

while ( /\$(\d+\.\d\d)/g ) {
    print $1, "\n";
}

This while loop is inside the implicit while(<>) loop. If you want to do something before or after your main while() loop, use BEGIN and END blocks. For example, to total up all those dollar values:

#!/usr/bin/perl -n

BEGIN { $total=0 }
END { printf( "%.2f\n", $total ) }

while ( /\$(\d+\.\d\d)/g ) {
    $total += $1;
}

The order of the BEGIN and END blocks doesn't matter, so don't worry about having them in the right order. You can specify them with the -e option, too. Here's a quick one that I used while writing this article to strip verbatim paragraphs from POD:

$ perl -n -e'BEGIN {$/=""}; print unless /^\s+/;' article.pod

If you want an empty while loop with -n or -p, you still must specify the -e option, or else Perl waits for the body of the loop to be entered on standard input. Using -e1 gives Perl a dummy loop body that does nothing.

-l: Line-Ending Handling

When you're working with lines in a file, you'll find you're doing lots of chomping and print $something, "\n". Perl has the -l (dash el) option to take care of this for you.

In the simplest sense, adding -l when you're using -n or -p automatically does a chomp on the input record, and adds a "\n" after everything you print. It makes command-line one-liners much easier, as in:

perl -l -e'print substr($_,0,40)'

This example only shows the first 40 characters of each line in the input, whether or not the line is longer than 40 characters, not counting the line-ending "\n". For command-line programmers, -l is a godsend because it means you can use print $foo instead of print $foo,"\n" to get the results you want.

The mechanics of how the print ... "\n" happens are a little more complex. The -l actually sets $\, the output record separator, to $/, the input record separator. You can override this by specifying the octal value for $/ on -l. For instance, if you wanted to have all output lines have a Ctrl-M as the record terminator, specify -l015 (that's "dash el zero one five").

-i: Edit in Place

All the options you've learned so far are great for writing filters, where a number of files or standard input get fed out to standard output. Unfortunately, you're left to do the foo-move dance, as in:

perl -p -e"slick code" input.txt > foo mv foo input.txt

Perl comes to the rescue again with the -i option. Adding -i tells Perl to edit your files in place, so you can replace the previous example with

perl -p -i -e"slick code" input.txt

You can tell Perl to keep a copy of the original file(s) by specifying a string to tack on to the end of the file. Common examples are -i~ or -i.bak. Perl doesn't treat ".bak" as an "extension" in the sense of replacing one extension with another:

perl -i.bak input.txt

leaves the original file called "input.txt.bak." Of course, the -i option does the right thing if you specify multiple files by creating a backup file for each of the files processed.

-0[octal]: Specify Input Record Separator

Often when working on the command line, you'll want to specify your input record separator. Although this is possible with e'BEGIN {$/=...}', it's easier with the -0 option. (That's dash-zero, not dash-oh.) To specify an input record separator of chr(13), use -015. Two special values for the -0 option are -00 for paragraph mode, equiavlent to $/="", and -0777 to slurp entire files, equivalent to $/=undef.

The earlier example for filtering POD paragraphs:

$ perl -n -e'BEGIN {$/=""}; print unless /^\s+/;' article.pod

can now be shortened to:

$ perl -n -00 -e'print unless /^\s+/;' article.pod

-a and -F: Autosplit Input Records

The -a and -F options only work with -n and -p. Specifying -a tells Perl to run @F=split on your input line. Without the -F option, this means breaking up the input line on whitespace, which is most handy for log files. If you don't specify the -l option, your final element in @F has a "\n" at the end of it, which is probably not what you want.

Here's a quick way to count the bytes that your Apache server has sent out:

$ perl -l -a -n -e'$n+=$F[9];END{print $n}' access_log

Each line of the Apache log file is broken up on whitespace, and the number of bytes is the 10th field in the line.

If you don't want to split on whitespace, specify the regex to use with the -F option. This example walks through the /etc/passwd file, printing all usernames that have a login shell. The fields in /etc/passwd are separated by colons, with the user's name as the first field and login shell as the last.

perl -l -n -a -F: \
    -e'print $F[0] unless $F[-1] eq "/bin/false"' /etc/passwd

Even though there are no slashes here, -F: still means that the regex is /:/.

Option Stacking

If you want to make the most of your keystrokes on the command line, you may want to stack your options. Single-character options may be combined with the following option. For example, our /etc/passwd-processing examples that start:

perl -l -n -a -F: -e'....'

can be written as:

perl -lnaF: -e'....'

I don't recommend combining options because it adds a layer of complexity for the small benefit of saving a few keystrokes. There are also pitfalls when combining options, especially with the -i option. For example, say you have a program where you're editing a file in place to truncate each line to 40 characters:

$ perl -p -i -l -e'$_=substr($_,0,40)' myfile.txt

This works just fine. Now, combine those options overoptimistically into:

$ perl -pil -e'$_=substr($_,0,40)' myfile.txt

The -p option is just fine, but now you've told the -i options to append the letter "l" at the end of the backup file's name, and lost the -l functionality of handling line endings. The results aren't pretty.

-[mM][-] Module

-m and -M are the module-loading options. They obviate the need to have a -e'use ModuleName;'.

The -mModuleName performs a use ModuleName (); before your program executes. -MModuleName is the same, but without the parentheses. The difference can be subtle, depending on the import semantics of the module you're importing, as we'll see below. You can also do a "no ModuleName" with -M-ModuleName.

The -M option is also a handy way to find out if a module is installed, and what version. Want to see which version of the CGI module you have installed?

$ perl -MCGI -le'print $CGI::VERSION'
2.89

Of course, if you don't have the module installed, Perl will give an error.

Many modules have specific tricks built in for use on the command line. Probably the most common example is the CPAN module, where you can do:

$ perl -MCPAN -e'install "Module::Name"'

Text::Autoformat exports the autoformat function by default, making it easy to write a one-liner to format a block of text from standard input:

$ perl -MText::Autoformat -e'autoformat'

Applying What You've Learned

With all these marvelous file-mangling command-line options at your disposal, you have a great deal of power at hand. For instance, a program to convert standard line endings to the Mac's Ctrl-M takes only 24 characters:

$ perl -i.bak -l015 -pe1 *.txt

A global search and replace of all occurences of "FOO" to "BAR" in .html files in a directory is as easy as:

$ perl -i -pe's/FOO/BAR/g' *.html

Wrapping Up

A few final notes about command-line options: Perl respects command-line options on the #!perl line of your script, so a script that you write as:

$ perl -i -pe's/FOO/BAR/g'

could also be written as:

#!/usr/bin/perl -i -p

s/FOO/BAR/g;

Even in operating systems that don't use the #!perl line (like Windows), Perl still checks for it and will respect the options.

For future reference, if you don't have this article handy, you can run perl -h to get an alphabetical list of options, or run perldoc perlrun for the manpage.

Now go forth with your newfound power and keep getting something for nothing.

TPJ