March 29, 2010

Learning GNU Text Utilities

A few weeks ago we looked at some of the GNU utilities that you can use to work with files, check MD5/SHA1 sums and check your disk usage. This time around I want to cover some of the utilities that you'll use for working with text files.

Why text files, specifically? Well, if you're doing much work at the shell on Linux, you'll start encountering a lot of text or files that can behave like text. Log files, configuration files and output from many commands can all be manipulated with the GNU textutils.

At one time, the GNU textutils were broken out into their own package, but a few years ago they were merged into the GNU coreutils. But for convenience sake, I'm going to keep the old moniker because that's a handy way of thinking about those tools.

Assuming you're running any of the major Linux distros and have a default installation, you should have the GNU coreutils package installed already. Some minimalist Linux distros ship Busybox instead. If that's the case, you may have some versions of the tools we're discussing here installed, but they may not have all of the same options as the GNU textutils. Both Busybox and the GNU textutils are actually implementations of tools that were initially developed for proprietary UNIX.

Some of the utilities are more useful than others today, so I'm going to focus on the utilities that are most likely to be useful to you. For instance, the base64 utility for converting in/out of base64 isn't something I've had any call to use in the last 10 years. But fmt, nl and uniq still prove useful on a regular basis.

A few have been covered in the previous GNU tutorial as well. Specifically, md5sum, head and tail, so please go back and check the Getting to Gno GNU Utilities tutorial if you need to brush up on those.

Understanding cat and tac

The cat utility is short for "concatenate." I think we can all agree it was wise to trim that one down a bit. OK, but what does it mean? Basically, cat will take a file or standard input and send it to standard output. If you don't redirect the output from cat it will print to the terminal, or you can use one of the redirectors and send the output to a file or another utility.

You can also use cat to join files to work on them together. For instance, if you want to process a couple of logfiles at once, you could use something like:


cat filename1 filename2 | sort

The tac command is basically just cat backwards. This might be useful if you want to process a logfile from the newest entries to the oldest.

Some purists will note that cat is overused when simple file redirection would do. For example, you'll often see something like this:


cat filename | sort

What's happening there is using cat to pipe (|) to sort. Actually, you can do the same thing by running sort < names and save yourself a bit of typing. However, do whatever works best for you.

Formatting Files

Many of the textutils are dedicated to formatting text for printing. Few folks are dealing with line printers and using CLI tools to send to printers these days (though not an obsolete art, to be sure), but you may still find these tools useful in many situations.

The fmt command will reformat text for writing to standard out or another file. This is primarily used to reflow text to a default column width. Say you want to reflow a text file so each line is only 72 characters: You can do this easily with fmt like so:


fmt -w72 filename.txt

This will only break lines on whitespace characters: It won't break a whole string (word or other non-broken set of characters) unless it's longer than 72 characters.

The cut utility will remove sections from a file. So if you want to deal with only a section of a logfile, for instance, you can use the cut utility to chop out only the bits that you want to work with. The cut utility works on each line. So, given input like this:


[Sun Mar 21 16:36:21 2010] [error] [client] File does not exist: /var/www/components
[Sun Mar 21 17:24:42 2010] [error] [client] File does not exist: /var/www/joomla
[Sun Mar 21 17:37:15 2010] [error] [client] script '/var/www/index.php' not found or unable to stat
[Sun Mar 21 18:06:59 2010] [error] [client] File does not exist: /var/www/robots.txt can use cut to separate individual fields. So if you only want to work with, say, the IP address or the error message, you can use cut.

Need to trim a file by lines instead? The split utility will take a logfile or other text file and output it to smaller files. The default is to take the input and spit out files of 1,000 lines each. Unless specified, the files will be named "xaa," "xab" and so on until split finishes the file or runs out of suffixes.

Why would you want to do this? One reason is to make it easier to work with logfiles, break them down into smaller chunks for archiving or processing.

Finding Unique Entries with uniq

When you're wading through logfiles or processing other text files, you'll often have files with a lot of similar entries. If you want to winnow those down to unique entries, uniq is the tool to reach for. Let's start with a simple example, like a text file of 1,000 email addresses. You know you have duplicates but don't feel like sorting through the file by hand and probably wouldn't spot all the duplicates anyway. No problem; just filter the list using uniq:


uniq < emails.txt

The uniq utility will omit duplicates. If you want to see how many times you have duplicate lines, use the -c option to count the number of times a line appears.

But wait. You notice that some lines may be duplicates after all! That's because uniq looks for two (or more) lines together. If you have duplicate lines that are separately listed, they'll be missed. Unless you combine uniq with another GNU textutil: sort.

Sort It Out With sort

The sort utility does just what you'd expect: It takes input and sorts it according to the criteria you give it. The default is to sort by "dictionary" order, but it can also sort by numeric value, in reverse order, etc. See the man page for the full range of options, but rest assured if there's a way you want to sort a file the option probably exists.

Here's how we'd combine sort and uniq to get rid of those pesky duplicates:


sort emails.txt | uniq > sorted_emails.txt

Pretty easy, right? You simply pipe the output from sort to uniq. By chaining the commands, we can start doing some really useful work.

Putting Them Together

Individually, the textutils are useful but you might be wondering what the big deal is. You can't do a lot with cat or uniq individually. But when you start chaining the commands, you can do some pretty powerful stuff.

Let's say you want a report of all the unique IP addresses that have appeared in a log file in the last day or so. We're going to use cut, sort and uniq to get all the unique IP addresses. As an added bonus, we'll throw in nl, a utility that will number lines for you:


cut -d ' ' -f 8 error.log | sort -n | uniq | nl

This takes the file error.log, and runs it through cut first. The options to cut tell the utility to use the space character as a delimiter (-d ' ') and to only spit out the 8th field (-f 8) from the file. So if you look at the file, the IP address is the 8th field if you count the fields separated (delimited) by spaces.

Then it runs that through sort to sort the IP addresses numerically (-n). Then, it removes duplicate IP addresses. Finally, it runs the result through nl to give a count.

That will produce output like this:



Good, but not great just yet. I don't like the trailing bracket. So let's throw in a bonus utility, sed, which is a stream editor:


cut -d ' ' -f 8 error.log | sort -n | uniq | sed s/']'// | nl 

Now you'll get the same result, but without the annoying end bracket. The sed command simply uses a search and replace. If you followed the Vim 101 beginner's guide a few months ago, that should look familiar. The s/ starts the search, and then ']' tells sed to look for a closing bracket. Why did I use the single quotes? Because the Bash shell treats ] as a special character. Putting it in single quotes tells the shell to treat it literally (i.e., not to interpret it). Finally, the closing // tell sed to replace the bracket with nothing.

Once again, this really only scrapes the surface of what the GNU utils can do. But I hope it's given you a rough guide to how you can use the GNU textutils and how they can be useful to you. In future tutorials, we'll take a look in more depth at sed if there's interest, let me know in the comments! In the meantime, take a little while to familiarize yourself with the GNU utilities. You'll find that it's very well worth it!

Click Here!