June 16, 2005

Pipes and filters

Author: Mark Alexander Bain

I still remember the day, many years ago, when a wise old programmer looked over my shoulder and said, "Ah, Grasshopper, you need a pipe!" and so set me on the path to true enlightenment.

A pipe is a means by which the output from one process becomes the input to a second. In technical terms, the standard output (stout) of one command is sent to the standard input (stdin) of a second command. If you are not sure of the advantages this creates, then let's look at a simple example.

In this example, we'll send a directory listing to an email account.

ls -l ~ > ls.tmp
mail -s "Home directory listing" info@markbain-writer.tk < ls.tmp

This works well, but it's rather cumbersome and requires the creation of an interim file. The use of a pipe allows a simpler command structure and needs no extra files:

ls -l ~ | mail -s "Home directory listing" info@markbain-writer.tk

You will notice that a pipe is defined by the | symbol -- not an uppercase i or the number one, but a vertical bar.

Introducing the filter

A pipe can pass the standard output of one operation to the standard input of another, but a filter can modify the stream. A filter takes the standard input, does something useful with it, and then returns it as a standard output. Linux has a large number of filters. Some useful ones are the commands awk, grep, sed, spell, and wc.

If we look back at the our pipe example from above, we see that it gives an output something like:

drwxr-xr-x   4 bainm users    4096 2005-06-05 16:31 Desktop/
drwxr-xr-x   5 bainm users    4096 2004-11-15 00:00 GNUstep/
drwx------  11 bainm users    4096 2005-06-04 18:02 Mail/
-rw-r--r--   1 bainm users   10240 2005-01-06 20:36 New_database.kexi
drwxr-xr-x   5 bainm users    4096 2005-05-27 12:53 OpenOffice.org1.1.2/
-rwxr-xr-x   1 bainm users  548788 2004-10-20 19:45 Project1*
drwxr-xr-x   3 bainm users    4096 2004-10-18 10:52 Projects/
-rw-r--r--   1 bainm users    4242 2004-10-20 19:45 Unit1.dcu
drwxr-xr-x   3 bainm users    4096 2005-05-24 11:59 XamXpm/
drwxr-xr-x  11 bainm users    4096 2005-06-03 10:26 articles/
drwxr-xr-x   2 bainm users    4096 2005-05-30 15:09 backup/

Let's say that in our email we require only files (not directories) sorted by the largest first and showing only the file name, owner, date last modified, and file size (in that order). To do this, we can use three of the Linux filters: awk (to format), grep (to remove the unwanted lines) and sort (to get the lines in the correct order). In between each filter, we can use a pipe to pass on the result from the individual operations.

The first filter (grep) removes any directories from the list by excluding any lines that start with a leading "d":

grep -v "^d"

The next filter (awk) extracts the required fields (file name, user name, access date and time, and file size). It also places the file size at the start line so that the data is ready for sorting:

awk '{print $5, $8, $3, $6, $7}'

Obviously, the next filter sorts the data:

sort -nr

And the final filter (another awk) formats the data ready to be emailed:

awk '{print $2 "\t" $3 "\t" $4, $5 "\t" $1}'

Finally, all we have to do is join the filters together with pipes:

ls -l ~ |
grep -v "^d" |
awk '{print $5, $8, $3, $6, $7}' |
sort -nr |
awk '{print $2 "\t" $3 "\t" $4, $5 "\t" $1}' |
mail -s "File List" info@markbain-writer.tk
}

The result is something like:

backup.zip      bainm   2005-05-30 13:03        1139563
Project1*       bainm   2004-10-20 19:45        548788
Delphi_job_spec.rtf     bainm   2004-10-14 13:37        217524
output.ps       bainm   2004-12-01 21:22        166465
print.pdf       bainm   2005-03-06 20:50        47266
kstars.png      bainm   2005-03-05 17:35        20586
driving.htm     bainm   2004-11-04 21:46        14977
comp.htm*       root    2004-08-05 18:29        11101
New_database.kexi       bainm   2005-01-06 20:36        10240
projections.sxc bainm   2004-12-21 13:33        7597
testhtml.sxw    bainm   2005-01-06 11:33        5529

The pipes and filters allow us to create an elegant piece of scripting. Now, instead of five individual commands, we have a single, flowing process.

Some useful filters

There are many Linux commands that are filters, in addition to awk, grep, and sort. Two filters to consider are tr (translate) and sed (stream edit). Both commands allow you to modify the stream -- tr for simple changes and sed for the more complex. For example, you can use tr [a-z] [A-Z] to convert everything to uppercase, or sed s/"*"//g to remove the stars from the names of executable files.

Another filter to consider is tee, which enables you to split a stream between stdout and a file. For example:

ls -l | tee file.lst | wc -l

This will create a file (file.lst) containing the result from ls -l and will display the number of files to the screen (or pass it on to another filter, if you require).

Creating your own filters

So far, we have learned how to use pipes and simple filters together. The next step is to learn how to build a filter for a specific job. The above example will send a list of all the files in the home directory. However, let's assume that we're interested only in files that are greater than 10,000 bytes in size. We need to add in a new filter:

ls -l  ~ |
grep -v "^d" |
awk '{print $5, $8, $3, $6, $7}'  |
only_big_files |
sort -nr  |
awk '{print $2 "\t" $3 "\t" $4, $5 "\t" $1}' |
mail -s "File List" info@markbain-writer.tk
}

The filter must first read the standard input. To do this, enclose any functionality within a "while read" loop. Any fields passed to the filter must be placed into variables:

while read SIZE FILE NAME DATE TIME
do...
done

Having read the standard input, we can now create the body of the filter. Here we simply check to see if the file is greater than 10,000. If it is, we send the data to the standard output. If not, we move onto the next line:

if [ $SIZE -gt 10000 ]
then
	echo $SIZE $FILE $NAME $DATE $TIME
fi

The completed filter is:

function only_big_files {
while read SIZE FILE NAME DATE TIME
do
	if [ $SIZE -gt 10000 ]
	then
		echo $SIZE $FILE $NAME $DATE $TIME
	fi
done
}

You could, of course, use the awk filter to do the same:

awk '{if ($1>10000) {print $0}}'

Final thoughts

I find pipes and filters invaluable. Their uses range from simple processes (such as ls -l | more) through to the highly complex. Like so many things in Linux, you'll wonder how you ever managed to live without them.

Click Here!