Patterns and string processing in shell scripts

16471

Author: Peter Seebach

Shell programming is heavily dependent on string processing. The term string is used generically to refer to any sequence of characters; typical examples of strings might be a line of input or a single argument to a command. Users enter responses to prompts, file names are generated, and commands produce output. Recurring throughout this is the need to determine whether a given string conforms to a given pattern; this process is called pattern matching. The shell has a fair amount of built-in pattern matching functionality.

This article is excerpted from the newly published book Beginning Portable Shell Scripting.

Furthermore, many common Unix utilities, such as grep and sed, provide features for pattern matching. These programs usually use a more powerful kind of pattern matching, called regular expressions. Regular expressions, while different from shell patterns, are crucial to most effective shell scripting. While there is no portable regular expression support built into the shell itself, shell programs rely heavily on external utilities, many of which use regular expressions.

Shell patterns

Shell patterns are used in a number of contexts. The most common usage is in the case statement. Given two shell variables string and pattern, the following code determines whether text matches pattern:

case $string in $pattern) echo "Match" ;; *) echo "No match";; esac

If $string matches $pattern, the shell echoes “Match” and leaves the case statement. Otherwise, it checks to see whether $string matches *. Since * matches anything in a shell pattern, the shell prints “No match” when there was not a match against $pattern. (The case statement executes only one branch, even if more than one pattern matches.)

For exploring pattern matching, you might find it useful to create a shell script based on this. The following self-contained script performs matching tests of a number of words against a pattern:

#!/bin/sh pattern="$1" shift echo "Matching against '$pattern':" for string do case $string in $pattern) echo "$string: Match." ;; *) echo "$string: No match." ;; esac done

Save this script to a file named pattern, make it executable (chmod a+x pattern), and you can use it to perform your own tests:

$ ./pattern '*' 'hello' Matching against '*': hello: Match. $ ./pattern 'hello*' 'hello' 'hello, there' 'well, hello' Matching against 'hello*': hello: Match. hello, there: Match. well, hello: No match.

Remember to use single quotes around the arguments. An unquoted word containing pattern characters such as the asterisk (*) is subject to globbing (sometimes called file name expansion), where the shell replaces such words with any files with names matching the pattern. This can produce misleading results for tests like this.

Pattern-matching basics

In a pattern, most characters match themselves, and only themselves. The word hello is a perfectly valid pattern; it matches the word hello, and nothing else. A pattern that matches only part of a string is not considered to have matched that string. The word hello does not match the text hello, world. For a pattern to match a string, two things must be true:

    Every character in the pattern must match the string.
    Every character in the string must match the pattern.

Now, if this were all there were to patterns, a pattern would be another way of describing string comparison, and the rest of this chapter would consist of filler text like “a … consists of sequences of nonblank characters separated by blanks,” or possibly some wonderful cookie recipes. Sadly, this is not so. Instead, there are some characters in a pattern that have special meaning and can match something other than themselves. Characters that have special meaning in a pattern are called wildcards or metacharacters. Some users prefer to restrict the term wildcard to refer only to the special characters that can match anything. In talking about patterns, I prefer to call them all wildcards to avoid confusion with characters that have special meaning to the shell. Wildcards make those two simple rules much more complicated; a single character in a pattern could match a very long string, or a group of characters in the pattern might match only one character or even none at all. What matters is that there are no mismatches and nothing left over of the string after the match.

The most common wildcards are the question mark (?), which matches any character, and the asterisk (*), which matches anything at all, even an empty string.

The ? is easy to use in patterns; you use it when you know there will be exactly one character, but you are not sure exactly what it will be. For instance, if you are not sure what accent the user will greet you in, you might use the pattern h?llo, in case your user prefers to write hallo or hullo. This leaves you with two problems. The first is that users are typically verbose, and write things like hello, there, or hello little computer, or possibly even hello how do i send email. If you just want to verify that you are getting something that sounds a bit like a greeting, you need a way to say “this, or this plus any other stuff on the end.”

That is what * is for. Because * matches anything, the pattern hello* matches anything starting with hello, or even just hello with nothing after it. However, that pattern doesn’t match the string well, hello because there is nothing in the pattern that can match characters before the word hello. A common idiom when you want to match a word if it is present at all is to use asterisks on both sides of a pattern: *hello* matches a broad range of greetings.

If you want to match something, but you are not sure what it is or how long it will be, you can combine these. The pattern hello ?* matches hello world but does not match hello alone. However, this pattern introduces a new problem. The space character is not special in a pattern, but it is special in the shell. This leads to a bit of a dilemma. If you do not quote the pattern, the shell splits it into multiple words, and it does not match what you expected. If you do quote it, the shell ignores the wildcards. There are two solutions available; the first is to quote spaces, the second is to unquote wildcards. So, you could write hello" "?*, or you could write "hello "?*.

In the contexts where the shell performs pattern matching (such as case statements), you do not need to worry about spaces resulting from variable substitution; the shell doesn’t perform splitting on variable substitutions in those contexts. (A disclaimer is in order: zsh‘s behavior differs here, unless it is running in sh emulation mode.)

Character classes

The h?llo pattern has another flaw, which is that it is too permissive. While your friends who type with a thick accent will doubtless appreciate your consideration, you might reasonably draw the line at hzllo, h!llo, or hXllo. The shell provides a mechanism for more restrictive matches, called a character class. A character class matches any one of a set of characters, but nothing else; it is like ?, only more restrictive. A character class is surrounded in square brackets ([]), and looks like [characters]. The greeting described previously could be written using a character class as h[aeu]llo. A character class matches exactly one of the characters in it; it never matches more than one character.

Character classes may specify ranges of characters. A typical usage would be to match any digit, with [0-9]. In a range, two characters separated by a hyphen are treated as every character between them in the character set; mostly, this is used for letters and numbers. Patterns are case sensitive; if you want to match all standard ASCII letters, use [a-zA- Z]. The behavior of a range where the second character comes before the first in the character set is not predictable; do not do that.

Sometimes, rather than knowing what you do want, you know what you don’t want; you can invert a character class by using an exclamation mark (!) as its first character. The character class [!0-9] matches any character that is not a digit. When a character class is inverted, it matches any character not in the range, not just any reasonable or common character; if you write [!aeiou] hoping to get consonants, you will also match punctuation or control characters.

Wildcards do not have special meaning in a character class; [?*] matches a question mark or an asterisk, but not anything else.

Character classes are one of the most complicated aspects of shell pattern matching. Left and right square brackets ([]), hyphens (-), and exclamation marks (!) are all special to them. A hyphen can easily be included in a class by specifying it as the last character of the class, with no following character. An exclamation mark can be included by specifying it as any character but the first. (What if there are no other characters? Then you are specifying only one character and probably don’t need a character class.) The left bracket is actually easy; include it anywhere, it won’t matter. The right bracket (]) is special; if you want a right bracket, put it either at the very beginning of the list or immediately after the ! for a negated class. Otherwise, the shell might think that the right bracket was intended to close the character class. Even apart from the intended feature set, be aware that some shells have plain and simple bugs having to do with right brackets in character classes; avoid them if you can.

If you want to match any left or right bracket, exclamation mark, or hyphen, but no other characters, here is a way to do it:

[][!-]

The first left bracket begins the definition of the class. The first right bracket does not close the class because there is nothing in it yet; it is taken as a plain literal right bracket. The second left bracket and the exclamation mark have no special meaning; neither is in a position where it would have any. Finally, the hyphen is not between two other characters in the class because the right square bracket ends the definition of the character class, so the hyphen must be a plain character.

Many users have the habit of using a caret (^) instead of ! in shell character classes. This is not portable, but it is a common extension some shells offer because habitual users of regular expressions may be more used to it. This can create an occasional surprise if you have never seen it used, and want to match a caret in a class.

Table 2-1 explains the behavior of a number of characters that may have special meaning within a character class, as well as how to include them literally in a class when you want to.

Table 2-1. Special Characters in Character Classes

Character Meaning Portability How to Include It
] End of class Universal Put at the beginning of the class (or first after the negation character)
[ Beginning of class Universal Put it anywhere in the class
^ Inversion Common Put after some other character
! Inversion Universal Put after some other character
Range Universal Put at the beginning or end of the class

Ranges have an additional portability problem that is often overlooked, especially by English speakers. There is no guarantee that the range [a-z] matches every lowercase letter, and strictly speaking there is not even a guarantee that it matches only lowercase letters. The problem is that most people assume the ASCII character set, which defines only unaccented characters. In ASCII, the uppercase letters are contiguous, and the lowercase letters are also contiguous (but there are characters between them; [A-z] matches a few punctuation characters). However, there are Unix-like systems on which either or both of these assumptions may be wrong. In practice, it is very nearly portable to assume that [a-z] matches 26 lowercase letters. However, accented variants of lowercase letters do not match this pattern. There is no generally portable way to match additional characters, or even to find out what they are. Scripts may be run in different environments with different character sets.

Some shells also support additional character class notations; these were introduced by POSIX but so far are rare outside of ksh (not pdksh) and bash. The notation is [[:class:[[, where class is a word like digit, alpha, or punct. This matches any character for which the corresponding C isclass() function would return true. For example, [[:digit:[[ is equivalent to [0-9]. These classes may be combined with other characters; [[:digit:][:alpha:]_] matches any letter or number or an underscore (_). Additional similar rules use [.name.] to match a special collating symbol. (For instance, some languages might have a special rule for matching and sorting certain combinations of letters, so a ch might sort differently from a c followed by an h) and [=name=] to match equivalence classes, such as a lowercase letter and any accented variant of it.) These rules are particularly useful for internationalized scripts but not sufficiently widely available to be used in portable scripts yet. To avoid any possible misunderstandings, avoid using a left bracket followed immediately by a period (.), equals sign (=), or colon (:) in a character class. Note that this applies only to a left bracket within the character class, not the initial bracket that opens the class; [.] matches a period. (This is more significant in regular expressions, where a period would otherwise have special meaning.)

Character classes are, as you can see, substantially more complicated than the rest of the shell pattern matching rules.

Shell patterns are quite powerful, but they have a number of limitations. There is no way to specify repetition of a character class; no shell pattern matches an arbitrary number of digits. You can’t make part of a pattern optional; the closest you get to optional components is the asterisk.

Patterns as a whole generally match as much as they can; this is called being greedy. However, if matching too many things with an asterisk prevents a match, the asterisk gives up the extra characters and lets other pattern components match them. If you match the pattern b* to the string banana, the * matches the text anana. However, if you use the pattern b*na, the * matches only the text ana. The rule is that the * grabs the largest number of characters it can without preventing a match. Other pattern components, such as character classes, literal characters, or question marks, get first priority on consuming characters, and the asterisk gets what’s left.

Some of the limitations of shell patterns can be overcome by creative usage. One way to store lists of items in the shell is to have multiple items joined with a delimiter; for instance, you might store the value a,b,c to represent a list of three items. The following example code illustrates how such a list might be used. (The case statement, used here, executes code when a pattern matches a given string.)

list=orange,apple,banana case $list in *apple*) echo "How do you like them apples?";; esac How do you like them apples?

This script has a subtle bug, however. It does not check for exact matches. If you try to check against a slightly different list, the problem becomes obvious:

list=orange,crabapple,banana case $list in *apple*) echo "How do you like them apples?";; esac How do you like them apples?

The problem is that the asterisks can match anything, even the commas used as delimiters. However, if you add the delimiters to the pattern, you can no longer match the ends of the list:

list=orange,apple,banana case $list in *,orange,*) echo "The only fruit for which there is no Cockney slang.";; esac [no output]

To resolve this, wrap the list in an extra set of delimiters when expanding it:

list=orange,apple,banana case ,$list, in *,orange,*) echo "The only fruit for which there is no Cockney slang.";; esac The only fruit for which there is no Cockney slang.

The expansion of $list now has a comma appended to each end, ensuring that every member of the list has a comma on both sides of it.

Category:

  • Shell & CLI