August 1, 2005

CLI Magic: Regular expressions and metacharacters

Most of us probably use regular expressions -- pattern that describes a set of characters -- every day without realizing it. Chances are, however, that you aren't really using them to their full potential.

Consider how you search a file for a word. You probably type something like:

grep expression *.html

This command searches all files in the current directory for the word "expression." This is the simplest form of regular expression: a search for literal characters, which are the letters, numbers, and spaces that make up the search strings. "a", "cat", and "sat on the mat" are all patterns of literal characters.

It's not just grep than uses regular expressions. Most Linux filters (such as sed) use regular expressions, as well as many programming languages, including Perl, JavaScript, and (dare I say it) VBScript.

While using regular expressions with literal characters is useful, metacharacters give regular expressions more power. A metacharacter is simply a character with a special meaning that adds an extra element of control.

Something to bear in mind is that there are two types of regular expressions -- basic and extended. Extended does what basic does, but with some extra metacharacters -- |, ?, and +. (You can read up on all the differences in The Open Group Base Specifications Issue 6.) To take full advantage of metacharacters, you should use extended regular expressions. To do so with grep, use egrep or grep -E. With sed, use the -r option. For other commands, check the man pages.

Using metacharacters in regular expressions

Suppose you need to search a text file for two similar words -- let's say "Linux" and "Linus." You might run two separate greps, or use the grep -e option (grep -e Linux -e Linus). But a better way might be to use the square brackets metacharacters:

grep Linu[sx] myfile.txt

The square brackets allow you to match a single character against a choice of literals, so this regular expression means "match the pattern Linu followed by either s or x." As well as matching a character against a choice of literals you can match against a range of literals:

  • [A-Z] matches a single character against all uppercase letters
  • [a-z] matches a single character against all lowercase letters
  • [0-9] matches a single character against all (single-digit) numbers
  • [a-zA-z0-9] matches a single character against all upper and lowercase letters, and all single-digit numbers

Patterns don't have to be complete ranges: [Ab-gH-Z] would match the character against the pattern "A or (b to g) or (H to Z)."

What if you want to search for a square bracket? Square brackets are metacharacters, you will get an "Invalid regular expression" or "Unterminated command" error if you were to try something like:

grep "[" myfile.txt

or

sed s/[///g

In this case you must use another metacharacter -- the backslash (\) -- which converts a metacharacter into a literal:

grep "\[" myfile.txt or sed s/\[///g

If you want to match against a backslash, just use a double backslash (\\).

Flavor or flavour/color or colour

Consider this grep statement:

grep "colour is red" myfile.txt

If myfile.txt were written in the U.S., this statement probably would fail to find the string, because American English spells the word "color." You can create a regular expression that works regardless of spelling by using the question mark (?) metacharacter:

grep -E "colo(u)?r is red" myfile.txt

The pattern to be matched in this case is colo followed by u (if it's there), and then an r.

Parentheses can also contain a group of letters. To understand this better think about searching for a month. January, for instance, could be written as Jan, Jan., or January. The catchall regular expression for this would be:

Jan(\.)?(uary)? e.g. sed -r s/"Jan(\.)?(uary)?"/01/g

Notice the backslash in front of the period (\.). This is because the period is a metacharacter, and matches any single character -- in other words, a wildcard. This is also a good point to mention word boundaries. Do a search for "Jan" and you will get matches for both Jan and January. Use \b to denote a word boundary:

grep -E "\bJan\b" myfile.txt

The match will now be only with the completer word "Jan".

Any single character can be matched using a period; for example, the pattern for three-letter words starting with "c" and ending with "t" would be "\bc.t\b". What about longer words -- perhaps the pattern must match all words beginning with "c" and ending with "t" regardless of length. To do, consider three more metacharacters -- the asterisk (*), plus (+), and caret (^).

The asterisk and plus do similar jobs. Both are repetition metacharacters; the asterisk repeats the preceding character zero or more times, while the plus repeats it 1 or more times. Now you may say, "Oh, I've got it -- the regular expression is going to be '\bc.+t\b' or '\bc.*t\b'". Sadly you'd be wrong. Those would match with both of the following:

cat
can you see why there is more than just cat

The pattern to match with a "c" at the start of a word, followed by any number of characters (including spaces), finishing with a "t" at the end of a word. Somehow we must make the pattern more exact to eliminate embedded space -- and this is where the caret comes in. The caret is a negation metacharacter: [^a] means any character that is not "a". The regular expression should be "\bc[^ ]*t\b" -- match a pattern starting with "c" at the start of a word, followed by any character that is not a space, and finishing with "t" as the final character in the word.

In conclusion

Just to recap -- the metacharacters that we've looked at are:

  • Square brackets: []
  • The backslash: \
  • The caret: ^
  • The dot (full stop or period): .
  • The pipe (vertical bar): |
  • The question mark: ?
  • The asterisk: *
  • The plus sign: +
  • Parentheses: ()

This is by no means the full list of metacharacters; we have really just dipped our toes into the subject. I'll leave you with one very useful application -- a sed statement that will strip all of the HTML tags out of a file, leaving you with just plain text:

sed s/\<[^\<]*\>//g

Doesn't that show you just how simple and yet how powerful a regular expression can be?

Click Here!