GNU grep’s new features

222

Author: Michael Stutz

If you haven’t been paying attention to GNU grep recently, you should be happily surprised by some of the new features and options
that have come about with the 2.5 series. They bring it functionality you can’t get anywhere else — including the ability to output only matched patterns (not lines), color output, and new file and directory options.

Granted, the addition of this feature set caused a number of bugs that made it necessary to rewrite part of the code, but the latest 2.5.1a bugfix release is eminently usable.

One highlight of the new version is its ability to output only matched patterns. This is one of the most exciting features, because it adds completely new functionality to the tool. Remember, “grep” is an acronym — it got its name from a function in the old Unix ed utility, global / regular expression / print — and its purpose was to output lines from its input that match a given regular expression.

It remains such, but the new -o option (or --only-matching) specifies that only the matched patterns themselves are to be output, and not the entire lines they come on. If more than one match is found on a single line, those matches are output on lines of their own.

With this new option, suddenly GNU grep is transformed from a utility that outputs lines into a tool for harvesting patterns. You can use it to harvest data from input files, such as pulling out referrers from your server logs, or URLs from a file:

egrep -o '(((http(s)?|ftp|telnet|news|gopher)://|mailto:)[^()[:space:[[+)' logfile

Or grab email addresses from a file:

egrep -o '@/:[:space:[[+>@[a-zA-Z_.]+?.[a-zA-Z]{2,3}' somefile

Use it to pull out all the senders from an email archive and sort into a file of unique addresses:

grep '^From: ' huge-mail-archive | egrep -o '@/:[:space:[[+>@[a-zA-Z_.]+?.[a-zA-Z]{2,3}' | sort | uniq > email.addresses

New uses for this feature keep popping up. You can use it, for instance, as a tool for testing regular expressions. Say you’ve whipped up a complicated regexp to do some task. You think it’s the world’s greatest regexp, it’s going to do everything short of solving all the world’s problems — but at runtime, it doesn’t seem to go as planned.

Next time this happens, use the -o option when you’re in the design stage, and have grep read from the standard input, where you can feed it test data — you’ll see right away whether or not it matches exactly what you think it does. Since grep will be tossing back to you not the matched lines but the actual matches to the expression, it’ll give you a pretty good clue how to fix it.

Output matches in color

Use the --color option to display matches in the input in color (red, by default). Color is added via ANSI escape sequences, which don’t work in all displays, but grep is smart enough to detect this and won’t use color (even if specified) if you’re sending the output down a pipeline. Otherwise, if you piped the output to (say) less, the ANSI escape sequences would send garbage to the screen. If, on the other hand, that’s really what you want to do, there’s a workaround: use the --color=always to force it, and call less with the -R flag (which prints all raw control characters). That way, the color codes will escape correctly and you’ll page through screens of text with your matched patterns in full color:

grep --color=always "regexp" myfile | less -R

The GREP_COLOR environment variable controls which color is used. To change the color from red to something else, set GREP_COLOR to a numeric value according to this chart:

30	black
31	red
32	green
33	yellow
34	blue
35	purple
36	cyan
37	white

For example, to have matches highlighted in a shade of green:

GREP_COLOR=32; export GREP_COLOR; grep pattern myfile

Use Perl regexps

One of the biggest developments in regular expressions to occur in the last few decades has been the Perl programming language, with its own regular expression dialect. GNU grep now takes Perl-style regexps with the -P option. (It’s not always compiled in by default, so if you get an error message of “grep: The -P option is not supported” when you try to use it, you’ll have to get the sources and recompile.)

To search for a bell character (Ctrl-g), you can now use:

grep -P 'cG' myfile

This is considered a “major variant” of grep, as with the -E and -F options (which are the egrep and fgrep tools, respectively), but it doesn’t yet come with an associated program name — perhaps new versions will have a prep binary (it sounds much better than pgrep) that will mean the same thing as using -P.

Dealing with input

A number of new features have to do with files and input. The new --label option lets you specify a text “label” to standard input. Where it’s really useful is when you’re grepping a lot of files at once, plus standard input, and you’re making use of the labels that grep prefixes its matches with. Normally, standard input would be the only one with a label you couldn’t control — it’s always prefixed with “(standard input)” as its label. Now, it can be prefixed with whatever argument you give the --label option.

grep changes quick reference

-Cx prints context
lines before and after matches and must have argument x.

--color outputs matches in color
(default red).

-Daction specifies an
action to take on device files (the default is “read”).

--exclude=filespec
excludes files matching filespec.

--include=filespec only
searches through files matching filespec.

--label=name makes
name the new label for stdin.

--line-buffered turns on line
buffering.

-mX stops searching
input after finding X matched lines.

-o outputs only matched patterns,
not entire lines.

-P uses Perl-style regular
expressions.

When searching through multiple files, you can control which files to search for with the --include and --exclude options. For example, to search for “linux” only in files with .txt extensions in the /usr/local/src directory tree, use:

grep -r --include=*.txt linux /usr/local/src

When you’re recursively searching directories of files, you’ll get errors when grep comes across a device file. With the new --devices option, you can specify what you want it to do on these files, by giving it an optional action. The default action is “read,” which means to just read the file as any other file. But you can also specify “skip,” which will skip the file entirely. Those are currently the only two methods for handling devices.

To search for “linux” in all files on the system, excluding special device files, use:

grep -r --device=skip linux /

Finally, the --line-buffered option turns on line buffering, and --m (or --max-count) gives the maximum number of matched lines to show, after which grep will stop searching the given input. For example, this command searches a huge file with line buffering, exiting after at most 10 matched lines occur:

grep --line-buffered -m 10 huge.file

POSIX updates

Some of the other new updates were made are so that GNU grep conforms to POSIX.2, including subtle changes in exit status.

One of these changes is that the interpretation of character classes is now locale-dependent. That means that ranges specified in bracketed expressions like [A-Z] don’t mean the same thing everywhere. If the system’s current locale environment calls for its own characters or sorting, these settings will override any default character range.

Another related update is a change to the old -C option, which outputs a specified number of lines of context before and after matched lines. In the past, when you used -C without an option, grep would output two lines of before-and-after context, but now you have to give an argument; if you don’t, grep will report an error and exit. That’s something to look out for if you’ve got any old shells scripts or routines sitting around that call grep.

Conclusion

GNU grep is a great tool that keeps getting better, as the latest major enhancements show. The bad news? There are still a few bugs, due to the addition of the features in 2.5, but GNU grep is still very workable; according to its makers and maintainers, it remains “the fastest grep in the west.”