Linux.com

Feature

CLI Magic: Regular expressions and metacharacters

By on August 01, 2005 (8:00:00 AM)

Share    Print    Comments   

Most of us probably use regular expressions -- pattern that describes a set of characters -- every day without realizing it. Chances are, however, that you aren't really using them to their full potential.

Consider how you search a file for a word. You probably type something like:

grep expression *.html

This command searches all files in the current directory for the word "expression." This is the simplest form of regular expression: a search for literal characters, which are the letters, numbers, and spaces that make up the search strings. "a", "cat", and "sat on the mat" are all patterns of literal characters.

It's not just grep than uses regular expressions. Most Linux filters (such as sed) use regular expressions, as well as many programming languages, including Perl, JavaScript, and (dare I say it) VBScript.

While using regular expressions with literal characters is useful, metacharacters give regular expressions more power. A metacharacter is simply a character with a special meaning that adds an extra element of control.

Something to bear in mind is that there are two types of regular expressions -- basic and extended. Extended does what basic does, but with some extra metacharacters -- |, ?, and +. (You can read up on all the differences in The Open Group Base Specifications Issue 6.) To take full advantage of metacharacters, you should use extended regular expressions. To do so with grep, use egrep or grep -E. With sed, use the -r option. For other commands, check the man pages.

Using metacharacters in regular expressions

Suppose you need to search a text file for two similar words -- let's say "Linux" and "Linus." You might run two separate greps, or use the grep -e option (grep -e Linux -e Linus). But a better way might be to use the square brackets metacharacters:

grep Linu[sx] myfile.txt

The square brackets allow you to match a single character against a choice of literals, so this regular expression means "match the pattern Linu followed by either s or x." As well as matching a character against a choice of literals you can match against a range of literals:

  • [A-Z] matches a single character against all uppercase letters
  • [a-z] matches a single character against all lowercase letters
  • [0-9] matches a single character against all (single-digit) numbers
  • [a-zA-z0-9] matches a single character against all upper and lowercase letters, and all single-digit numbers

Patterns don't have to be complete ranges: [Ab-gH-Z] would match the character against the pattern "A or (b to g) or (H to Z)."

What if you want to search for a square bracket? Square brackets are metacharacters, you will get an "Invalid regular expression" or "Unterminated command" error if you were to try something like:

grep "[" myfile.txt
or
sed s/[///g

In this case you must use another metacharacter -- the backslash (\) -- which converts a metacharacter into a literal:

grep "\[" myfile.txt or sed s/\[///g

If you want to match against a backslash, just use a double backslash (\\).

Flavor or flavour/color or colour

Consider this grep statement:

grep "colour is red" myfile.txt

If myfile.txt were written in the U.S., this statement probably would fail to find the string, because American English spells the word "color." You can create a regular expression that works regardless of spelling by using the question mark (?) metacharacter:

grep -E "colo(u)?r is red" myfile.txt

The pattern to be matched in this case is colo followed by u (if it's there), and then an r.

Parentheses can also contain a group of letters. To understand this better think about searching for a month. January, for instance, could be written as Jan, Jan., or January. The catchall regular expression for this would be:

Jan(\.)?(uary)? e.g. sed -r s/"Jan(\.)?(uary)?"/01/g

Notice the backslash in front of the period (\.). This is because the period is a metacharacter, and matches any single character -- in other words, a wildcard. This is also a good point to mention word boundaries. Do a search for "Jan" and you will get matches for both Jan and January. Use \b to denote a word boundary:

grep -E "\bJan\b" myfile.txt

The match will now be only with the completer word "Jan".

Any single character can be matched using a period; for example, the pattern for three-letter words starting with "c" and ending with "t" would be "\bc.t\b". What about longer words -- perhaps the pattern must match all words beginning with "c" and ending with "t" regardless of length. To do, consider three more metacharacters -- the asterisk (*), plus (+), and caret (^).

The asterisk and plus do similar jobs. Both are repetition metacharacters; the asterisk repeats the preceding character zero or more times, while the plus repeats it 1 or more times. Now you may say, "Oh, I've got it -- the regular expression is going to be '\bc.+t\b' or '\bc.*t\b'". Sadly you'd be wrong. Those would match with both of the following:

cat
can you see why there is more than just cat

The pattern to match with a "c" at the start of a word, followed by any number of characters (including spaces), finishing with a "t" at the end of a word. Somehow we must make the pattern more exact to eliminate embedded space -- and this is where the caret comes in. The caret is a negation metacharacter: [^a] means any character that is not "a". The regular expression should be "\bc[^ ]*t\b" -- match a pattern starting with "c" at the start of a word, followed by any character that is not a space, and finishing with "t" as the final character in the word.

In conclusion

Just to recap -- the metacharacters that we've looked at are:

  • Square brackets: []
  • The backslash: \
  • The caret: ^
  • The dot (full stop or period): .
  • The pipe (vertical bar): |
  • The question mark: ?
  • The asterisk: *
  • The plus sign: +
  • Parentheses: ()

This is by no means the full list of metacharacters; we have really just dipped our toes into the subject. I'll leave you with one very useful application -- a sed statement that will strip all of the HTML tags out of a file, leaving you with just plain text:

sed s/\<[^\<]*\>//g

Doesn't that show you just how simple and yet how powerful a regular expression can be?

Share    Print    Comments   

Comments

on CLI Magic: Regular expressions and metacharacters

Note: Comments are owned by the poster. We are not responsible for their content.

Little Mistake.

Posted by: Anonymous Coward on August 01, 2005 06:50 PM
grep expression *.html

This command searches all files in the current directory for the word "expression."

No, it searches all files in the current directoy ending with ".html".

#

Re:Little Mistake.

Posted by: Anonymous Coward on August 01, 2005 09:02 PM
And another -- the HTML tag stripper should be:

sed s/\<[^\>]*\>//g

(the middle < should be a >)

#

Re:Little Mistake.

Posted by: Anonymous Coward on August 02, 2005 02:00 AM
The stripper in the article does work -- I've just tried it.

#

Re:Little Mistake.

Posted by: Anonymous Coward on August 04, 2005 05:43 AM
Both versions are "right" but both versions are "wrong". They work fine for normal HTML, but aren't robust when you get malformed HTML such as:

<a <b <c d>
or
<a >b >c d<

Although I haven't tested either, I think that both versions will leave some of the < or > characters lying around. A sufficiently skilled attacker might actually be able to slip some HTML through the HTML stripper code, by crafting sufficiently malformed HTML.

If you want to be absolutely sure that you've removed all the tags, after you run the regexp above, strip out all of the remaining > and < characters just to be safe. There shouldn't be any raw < or > characters that aren't part of a tag in HTML anyway -- they should have been converted to &gt; or &lt;.

-drane

#

Re:Little Mistake.

Posted by: Administrator on August 04, 2005 08:48 PM
What are the back slashes for in this example? It looks like they're supposed to escape the less-than character but the less-than character is not a special character, is it? (I tried asking this question using the actual characters in another post but it got so mangled by the posting system that it was unintelligible.)

#

Regex Tutorial

Posted by: Anonymous Coward on August 01, 2005 07:44 PM
For a great regex tutorial, check out <a href="http://www.regular-expressions.info/" title="regular-expressions.info">http://www.regular-expressions.info/</a regular-expressions.info>

#

Regular Expressions editor

Posted by: Anonymous Coward on August 01, 2005 08:54 PM
If you are a java freak like I am you will love this Open Source Editor tailored for Java

<a href="http://jregexptester.sourceforge.net/" title="sourceforge.net">http://jregexptester.sourceforge.net/</a sourceforge.net>

#

first example is misleading

Posted by: Anonymous Coward on August 04, 2005 05:49 AM
This example is misleading:

> grep expression *.html<nobr> <wbr></nobr>...
> This is the simplest form of regular expression

When I first read this, I thought the author was calling '*.html' a regular expression. Instead, he meant that "expression" is a regular expression -- one with no metacharacters.

If I remember correctly, '*.html' is called a glob. (Read the bash man page.)

If ' *.html' were treated as a regular expression, you'd be searching for files with
zero or more spaces ( *) followed by any character (.) followed by html (html). Obviously this is absurd for this example.

The author might be more careful about his examples. Using:

    grep expression myfile.txt
would have been much clearer.

-drane

#

Re:first example is misleading

Posted by: Administrator on August 06, 2005 11:41 PM
The author was showing an example of searching for the word "expression" in all files ending in '.html'. I agree that searching in a single file would be better as a first example.

#

Re:Are the backslashes necessary?

Posted by: Anonymous Coward on August 04, 2005 05:56 AM
Unless your post got damaged by text conversion by the posting system, you misquoted the author, who wrote:

sed s/\[///g

Actually, what the author gave won't work properly from the command line either. The shell (bash?) will eat the backslash before sed sees it.

The author probably meant:

sed 's/\[///g'

(The single quotes tell bash not to mess with the contents of the single quotes.)

In this case, the backslash is telling sed not to treat [ as a special character. As the author wrote, if you don't put that in, sed will think that [ means the start of a character range, and will die when it can't find the matching ].

-drane

#

Strip HTML fails

Posted by: Anonymous Coward on August 09, 2005 01:37 AM
The "strip HTML" command is unnecessarily cluttered with backslashes. He suggests:


        sed s/\<[^\<]*\>//g

He put the backslashes (\) in there to prevent the shell from interpreting the redirection operators, < and >. It would have been better to wrap the sed command in 'single quotes' to prevent the shell from using them, like this:


      sed 's/<[^<]*>//g'

Much easier to read. However, the [character-set] is also a problem, because when it meets a pattern like this:


      <b>here</b>

the first 2 characters of the search pattern ("<[^<]*") will match the string "<b>here", and then the 3rd character (">") will kick in. Not finding a terminating >, the engine will have to backtrack to reach "<b>". To optimize the script, it is better to use the other angle bracket:


      sed 's/<[^>]*>//g'

Finally, this script doesn't work at all in cases where the HTML tags cross line boundaries:


      <a class="first"

      href="http://gnu.org">

      like this one here

      </a>

or in those perverse cases where an HTML tag legally contains embedded angle brackets:


      <img src="x.jpg" alt="<<like this>>" >

For more help on using sed, visit <a href="http://sed.sourceforge.net/" title="sourceforge.net">http://sed.sourceforge.net/</a sourceforge.net>

—Eric Pement

#

Good Choice of Subject

Posted by: Administrator on August 01, 2005 08:09 PM
Very nice article. Though I've been using regular expressions for more than 20 years, I still learned a few things.

(You're probably going to get a bunch of "Editors" expressing their disbelief that you could possibly include some typos or erroneous references in your article. Please ignore them. Just glad you had the courage to write an article on such a vast subject -- and for such a tough crowd.<nobr> <wbr></nobr>;-)

#

Are the backslashes necessary?

Posted by: Administrator on August 04, 2005 12:07 AM
In the example

sed s/\//g

What are the '\' characters for?
Presumably they're to escape the '' characters.
But '' are not a special characters. Are they?
xyz

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya