August 5, 2011

Weekend Project: Intro to Using sed Regular Expressions

One of the keys to using GNU sed successfully is knowing how to use its regular expressions. If you look over sed scripts without knowing regular expressions, the effect can be pretty disconcerting. Don't worry — it's not as confusing as it looks. This weekend, spend some time with GNU sed's regular expressions and put some real power into your text processing.

In the first tutorial on GNU sed we looked at some of the basic syntax, most-used commands, and options. But we didn't look over regular expressions in any detail, because I wanted to spend a bit more time on that topic to ensure that we could give regular expressions the time they deserve.

Almost any extensive use of sed is going to require the use of regular expressions to match patterns of text. For instance, you might be looking through a file trying to match and replace (or remove) HTML elements, IP addresses, phone numbers, or variables. Maybe you're trying to use sed to "scrape" an RSS feed for useful information. Whatever you're trying to do, you're going to bump up against regular expressions sooner or later.

One note of caution before we begin, the regular expressions I'm showing here apply to GNU sed in particular. They may or may not carry over exactly to other implementations of sed — so if you're trying this on a BSD, Mac OS X, or using something like Busybox, your mileage will vary. Most of the expressions should work, of course, but it's not 100% guaranteed.

Actually, let's make that two words of caution. When running a regular expression you're not entirely sure of, do a test run first before you make permanent changes to a file. If you're working on files on disk (as opposed to a stream coming from the output of a process), run the expression first using the -n command (quiet) and just use p to print out the results that match the expression rather than editing the files. If you get the results you expect, then make changes. To get you started, I'll use these to illustrate some of my examples.

Start at the Beginning, End at the End

One thing that comes in handy is being able to tell sed that you want to match a string at the beginning or end of a line. If you're a Vim fan, you probably already have a good idea what we're going to be using here.

To match the beginning of a line, use ^. To match the end of a line, use $. Here's an example:

sed -n '/^[a-z]/ p' filename

This tells sed to match any line that begins with a lower case letter. Say you have a file that looks like this:

 

Sed
sed
1sed

 

If you're new to regular expressions and/or sed, this probably looks more confusing than it is, because the expression follows '/ — which isn't part of the expression at all. The ' is there to begin statement that will be passed to sed without being interpreted by the shell. The / tells sed to search for a pattern.

The expression we're using (^[a-z]) says "at the beginning of the line, match any single character that is a lower case letter."

If you want to match something at the end of the line, you'll want to use $. Note that it has to be at the end of your expression as well. Note that GNU sed will allow you to use it at the end of a "subexpression" where you're matching multiple lines — but it's not necessarily going to work on other implementations of sed.

One, None, Many

Let's look at matching one or more characters, or none. If you want to match a single instance of a character or string, you can use it literally. For example, "example" will match (you guessed it) example: but it won't match Example. Using sed -n '/example/ p' would match any string in a file or in output that has that string in it. It would also match examples or any longer string with that literal set of characters.

What if you want to match zero or more instances of a character, use the * expression like so:

sed -n '/I*sed/ p' filename

This would match "sed" or "Ised" but not "Ibed" or something else beginning with "I" but not followed by "sed".

If you want to match any character, use the . — the humble dot, like so:

sed -n '/s.d/ p' filename

This would match "sad" or "sed" but not "Sad" or "Sed".

Want to match multiple instances of any character? You combine the .* characters. This would match zero or more characters followed by "ed."

You can also match specific numbers of instances using brackets. You have to give at least one number, and you can provide a range or a exact number. Here's how that works:

  • If you write /a\{4\}/ you'll match exactly four instances of the character "a" — no more, no less.
  • If you use /a\{2,4\}/ you'll match at least two, no more than four instances.
  • If you use /a\{2,\}/ says "match at least two and keep going.

Simple, right? Yeah, it looks really messy — in large part because of all the delimiters that are needed.

Lists

Let's talk about matching a list. That's when you want to match any character in a list. Let's say you want to match any character from "A" to "M" but nothing after that. (Maybe you want to find all names in a file and sort them by first or last name.) Here's how you'd do that:

sed -n '/^[A-M]/ p' filename

That's sed-speak for "match A through M at the beginning of the line." (Remember the ^ tells sed to look for the beginning of the line.)

Here you'd match any names (or other words) at the beginning of the line that begin with those uppercase letters, but nothing that begins with a non-alpha character or lower-cased letter.

For good measure, throw in sort at the end of the command with a pipe to get the names in alphabetical order.

What if you just want to match a few characters, but not a range? Then you could use something like this:

sed -n '/^[AEIOU]/ p' filename

That way you could find all the names that start with a vowel.

If you want to match all the names that don't start with a vowel, you can use the ^ character again to tell sed not to match the range, like so:

sed -n '/^[^AEIOU]/ p' filename

Note that this expression will not only match strings that start with uppercase letters that aren't in the range, but also lowercase letters and numbers.

Matching an HTML Tag

Let's look at one last example. What if I want to match an HTML tag that starts the beginning of a line, but isn't a <p> tag? I'd use this:

sed -n '/^<[^p][^>]*>/ p' file.html

That will match all tags at the beginning of the line, unless they are a "p" tag.

Practice, Practice, Practice

There's really only one way to become adept at using regular expressions and that's practice. Take some time this weekend to work with sed's regular expressions and see if you can't build some muscle memory so the next time you need to match just about any collection of characters in a file or stream of text.

Regular expressions can get pretty hairy, but as unpleasant as they may seem at the beginning, they're well worth learning. GNU sed masters will notice I haven't covered every possible regular expression here. That's because I want to provide a gentle introduction, rather than trying to boil the ocean in one tutorial. If you're already familiar with using regular expressions and/or sed, this might seem pretty basic. But I remember my first introduction to regular expressions, and wish that other tutorials (that is, the ones I learned from) had been a bit less of the kitchen sink variety.

Next week, we'll wrap up the sed series with a look at some more advanced operations with sed that combine what we've already learned about sed's basic commands and regular expressions, and some additional features we haven't covered yet.

Click Here!