February 23, 2006

Turbocharged awk

Author: Keith Winston

In a previous article, I covered the basics of awk and presented a small application to reformat address book data. Now, I'll show you how to turbocharge awk. You can improve the performance of your awk programs by uncovering bottlenecks in your code with the help of a profiler, hunting for bugs with XREF, and using Awka to increase speed.

If your awk program isn't running at an acceptable speed, the first thing to do is look for a logic error. You may have created a loop that is executing more often than it should or processing a set of records more than once. The GNU version of awk includes a profiler -- usually found in /bin/pgawk -- to help you find these kinds of programming bugs.

To use the profiler, call pgawk instead of gawk when running your program, or use the --profile option with gawk. By default, the profiler creates a file in the current directory called awkprof.out, which contains a copy of the program with execution counts for each statement in the left margin. For example, here's the profiling output for the program from my previous article (extract-emails.awk):

        # gawk profile, created Tue Jan 17 13:38:25 2006

        # BEGIN block(s)

        BEGIN {
     1          FS = ":"
        }

        # Rule(s)

  1120  /^FN/   { # 2
    84          fullname = tolower($2)
    84          split(fullname, names, ",")
    84          name = (names[2] names[1])
        }

  1120  /^EMAIL/        { # 2
    82          mail = tolower($2)
    82          print ((mail ",") name)
        }

The numbers in the left margin indicate how many times each statement executed. The statement in the BEGIN block executed once, as expected. Each regular expression test executed 1,120 times -- once for each record in the input file. Each block executed only when a record matched the corresponding regular expression test.

Notice that the /^FN/ code block executed two more times than the /^EMAIL/. I examined the data and found that two records in the input data contain a full name record but no corresponding email address:

BEGIN:VCARD
VERSION:2.1
X-GWTYPE:GROUP
FN:GREEN
N:GREEN
X-DL:GREEN(82 items)
END:VCARD

Records are only printed inside the email blocks, so there was no problem with the results. However, this discovery did bring to light the existence of these two "group" records.

Complete cross-reference

For larger programs where you may pull in several awk script libraries, you can hunt deeper bugs by creating a cross-reference of all functions and variables. Phil Bewig created XREF, a public domain awk program, to take valid awk programs as input and write a cross-reference to standard out. To run it, use a command like:

awk -f xref.awk [ file ... ]

For ordinary variables and array variables, XREF prints a line of the form

count var(func) lines ...

for each function, where count is the number of times the variable is used, var is the name of the variable, func is the function name to which the variable is local (a null func indicates that the variable is global), and lines is the number of each line where the variable appears.

Here is the output of XREF when run against the sample program:

    1 FS() 1
    2 fullname() 8 9
    2 mail() 15 16
    2 name() 10 16
    3 names() 9 10
    2 tolower(0) 8 15

It shows the number of times each variable or function was used and the line numbers where they appear.

XREF is invaluable as a tool to help you see if you have any variable scoping issues or a function name defined in two places.

Bolting on a turbo

Most awk applications run quickly with no performance tuning. However, things might slow down if you're crunching enormous text files or doing lengthy calculations. You can squeeze the maximum amount of performance out of awk by using Awka to translate your program into ANSI C and compile it.

Awka is a General Public License (GPL) program distributed as source code, so you need to install it with the ./configure, make, make install routine. Compiling Awka creates the Awka binary and the libawka.so shared library that is used with each generated program. Note: You may need to add /usr/local/lib to the /etc/ld.so.conf file and run ldconfig before the system can find libawka.so.

A compiled program will show typical performance gains of 50% or more over native gawk, though it depends on the mix of operations and data. Arrays, nested loops, and user-defined functions show great speed increases. The Awka Web site features a performance comparison chart of various operations.

I used this Awka command to create a binary version of the sample program above:

awka -X -o extract-emails -f extract-emails.awk

Then, I ran my own comparison test. The compiled Awka version ran about three times as fast as native gawk, though the wall-clock time difference was negligible with such a small program and small data set. To see if the results scaled, I increased the data set first by a factor of 100, then 1,000. With 100 times the data, Awka was still three times as fast. With 1,000 times the data, Awka was four times as fast (3.811 seconds on average vs. 0.932 seconds on average).

The Awka project has not been updated in several years, but it still works on modern distributions. It's a good tool to have available if you want to use the simple awk language and need the raw speed of a compiled program.

Click Here!