January 16, 2006

CLI Magic: Learn to talk awk

Author: Keith Winston

User level: Advanced

When it comes to slicing and dicing text, few tools are as powerful, or as underutilized, as awk. The name "awk" was coined from the initials of its authors, Aho, Weinberger, and Kernighan -- yes, the same Kernighan of the famous Kernighan and Ritchie "C Programming Language" book. In the Linux world, every distribution includes the GNU version, gawk (/bin/awk is usually a symbolic link to /bin/gawk). The GNU version has a few more features than the original. Let's play with some of the core features common among POSIX-compliant awks.

In this article, when I reference awk, I am really using gawk.

The awk utility is a small program that executes awk language scripts, which are often one-liners, but just as easily may be larger programs saved in a text file. For example, to execute an awk script saved in the file prg1.awk and have it process the file data1, you could use a command such as:

awk -f prg1.awk data1

The result is written to standard out, or it may be piped to a result file.

The parameter -F changes the default field separator of a blank space. The field separator can also be changed within an awk program. To tell awk how to split data into fields from a comma-separated value (CSV) file, you would use:

awk -F"," -f prg1.awk data1

You may also include more than one data file to process, and awk will keep running until it runs out of data:

awk -F"," -f prg1.awk data1 data2 data3 data4 data5

If you want to assign a value to a variable before execution of the program, use the -v option:

awk -v AMOUNT=100 prg1.awk data1

Behold the power

The power of awk comes from how much it does automatically for you when crunching text files, and from the simple elegance of the language. When you feed awk a text file, it does the following:

  • Opens and reads all input files listed on the command line
  • Handles memory management for all variables
  • Parses each line and splits it into fields using the field separator
  • Presents each line of text to your program as variable $0
  • Presents each field from each line in predefined variables, starting with
    $1, $2, ... $N
  • Maintains many internal variables for your use, such as (but not limited to):
    • RS = record separator
    • FS = field separator
    • NF = number of fields in the current record
    • NR = number of records processed so far
  • Automatically handles conversion between internal data types
    (string, floating point, array)
  • Executes the BEGIN block before processing any records (a good place to
    initialize variables)
  • Executes the END block after processing all records (a good place to
    calculate report totals)
  • Closes all input files listed on the command line

The awk language uses only three internal data types: strings, floating point numbers, and arrays. Variables do not have to be defined before they are used. Awk handles converting data from one type to another as necessary. If you add two strings together using the addition operator (+) and they contain numeric values, you get a numeric result. If a string is used in an arithmetic operation but can't be converted to a number, it is converted to zero. Usually, awk does what you want when handling data conversion.

Awk can open, read, and write to more files than those listed on the command line by using the getline function or redirecting output from within a program. It has access to a set of internal functions that include math, string manipulation, formatted printing (similar to the C language printf), and miscellaneous functions like pseudo-random numbers. You can also create your own functions or function libraries that can be used in several programs. All of this is packed into an executable usually about 500k in size.

Programmers can typically become proficient in awk within a day. Complete references are available in a single book. You don't need a "bookshelf" of dead trees and CDs to master awk.

Implementations or ports of awk are available on nearly every platform, making your scripts reasonably portable.

Awk in the real world

Here is a short example of a recent awk application I created to import a list of email addresses and names from Novell Groupwise to PHPList, a mailing list manager. The list was exported from Groupwise in vCard File format (VCF), a text based format. Here is an example entry from the VCF file:

BEGIN:VCARD
VERSION:2.1
X-GWTYPE:USER
FN:Bar, Foo
ORG:;GREEN
EMAIL;WORK;PREF:foobar@yahoo.com
N:Bar;Foo
X-GWUSERID:foobar
X-GWADDRFMT:0
X-GWIDOMAIN:yahoo.com
X-GWTARGET:TO
END:VCARD

The target format was a CSV file that PHPList could import into an existing mailing list. I needed to extract the name from the record that starts with "FN" and the email address from the record that starts with "EMAIL."

I started construction of the script by setting up a custom record separator and a block of code to handle each record type. I saved the script in a text file called extract-emails.awk. Note that the .awk file extension is just convention; the file containing awk commands can be named anything. This was the beginning of the script:


BEGIN { FS = ":" }

/^FN/ {
# handle name here
}

/^EMAIL/ {
# handle email address here
}

The BEGIN block is run once before any records are read. It sets the field separator to a colon so awk will split the fields of the file when it encounters a colon.

The regular expressions /^FN/ and /^EMAIL/ tell awk to look for the characters "FN" or "EMAIL" at the start of a record, and if a match is found, run the associated block of code between the curly braces. This kind of regular expression match is common in awk but not required. A block of code with no match expression is run for every record processed by awk. I added a couple of comments (lines starting with "#") to document what each part of the script does.

Looking at the VCF data, I noticed that the "FN" record always precedes the "EMAIL" record, so I ordered the code blocks to process the records that way. Awk reads and executes a script in the order it appears. Many times, the order of the code will not matter, but in this case it does. The name is related to the email and I need to retain that relationship as the file is read, so I saved the name in an internal variable, then wrote both the email address and name to standard out while processing the email record.

Getting back to the task, let's complete the name section. The goal is to reformat the name from "lastname, firstname" into "firstname lastname," removing the comma. Here was my code:


/^FN/ {
# handle name here
fullname = tolower($2)
split(fullname, names, ",")
name = names[2] names[1]
}

Knowing that awk has split up the incoming records into fields using a colon as the field separator, the field variables for the example "FN" record contain the following:


$1 = "FN"
$2 = "Bar, Foo"

Working with the $2 variable, I used a built-in awk function, towlower(), to convert the names to lowercase and stored the result in a variable called "fullname." Next, I used the split function to break the name into first and last name parts, with the result stored in an
array called "names." Finally, I glued the name back together in the desired order, without the comma, and stored that result in a variable called "name."

There is very little to do inside the email code block. Awk provides the email address to us in the $2 variable (note that $2 in the "EMAIL" record is different than $2 in the "FN" record). For consistency, I converted it to lowercase, then used the print function to write both the email address and name to standard out, with a comma separating the values. Here is the complete script:


BEGIN { FS = ":" }

/^FN/ {
# handle name here
fullname = tolower($2)
split(fullname, names, ",")
name = names[2] names[1]
}

/^EMAIL/ {
# handle email address here
mail = tolower($2)
print mail "," name
}

A sprinkle of shell glue

To pull it all together, we need a little shell glue. A small shell script allows us to call awk with the parameters we want and to easily redirect the output to a file. It is also handy to run a shell script when you are testing.


#!/bin/sh
# Extract e-mail addresses from VCF file for PHPList.
awk -f extract-emails.awk groupwise.vcf > phplist-emails.txt

Awk can be used as an intermediate step in a larger shell script where the output is fed into another utility, such as sort, grep, or another awk script.

Finally, here is a sample of the output:

foobar@yahoo.com, foo bar
barbaz@yahoo.com, bar baz

Where awk falls short

There are certain tasks that are beyond the capabilities of awk. For instance, if you need to do anything that communicates using network sockets, awk is not your best bet. The same is true if you need to process binary files. The latest version of GNU awk does have
some rudimentary network capabilities, but Perl, PHP, and Ruby are much better equipped for those tasks.

Awk is an expert tool for text processing, and the roots of Perl are clear in its design. It is powerful enough to handle almost any kind of text crunching or reporting, while being easy to learn and use.

There are many choices when it comes to scripting languages, but I find awk the best choice for many problems. Although awk is employed most often for smaller problems, it can be used for large applications. I have worked on a 12,000-line awk application used to adjudicate dental claims. This application was the core system for a successful million dollar business. If you take the time to learn awk, the rewards will last a lifetime.

Click Here!