proctut3 - Simple Regular Expressions, Part II
Regular expressions are integral to effective procmail recipes. In Simple Regular Expressions, Part I, we covered the essential and basic parts of nearly all regular expressions. We continue to learn more of the core regular expression syntax in this second part of the regular expression tutorials.
We recall from our previous discussion that procmail uses regular expressions to "find things" in message headers or message bodies. To save us a lot of work, regular expressions allow us to substitute special characters[1] to match many other characters.
For example, the dot character (".") matches any character (except newlines), and allows us to write patterns like this:
My dog has .....
which will match any of the following:
My dog has fleas
My dog has foots
My dog has phhht
With quantifiers, we can extend the power of a pattern and make it more concise. Consider this pattern:
My dog has .+
This saves a few characters, is easier to read, and matches many more phrases than our previous pattern. Patterns (the combination of regular characters, regular expression tokens, and quantifiers) are the expressive power of regular expressions.
It certainly is nice to be able to substitute a single character or multiple characters with a single dot and a quantifier, but what if we don't want to be so cavalier in our matching? That is to say, what if we only want to match a certain set of things?
For example, we want to match the following two phrases and only the following two phrases:
My dog has fleas
My dog has feet
Here are our current options:
My dog has f.+
This suffers from inaccuracy. It will match much more than our two phrases:
My dog has feet
My dog has friends
My dog has foul halitosis
My dog has fleas
My dog has feet
This doesn't agree with our sense of decency: regular expressions ought to be able to help us make this more efficient.
The answer, of course, is we need a way to lump things together. In the regular expression world, we call these "groups" or "sets". The following three sections discuss how we can lump things together in procmail regular expressions.
We spoke a half-truth[2] earlier when we mentioned that quantifiers apply to the immediately preceeding character. While that statement is true, quantifiers may also apply to groups of characters as a whole.
Consider the problem of matching a sentence where an entire phrase is optional:
Help stamp out and abolish redundancyL<[3]|proctut3.pod/Note_3>
really ought to be:
Help abolish redundancy
We'd like to find either phrase without writing two patterns. Possible? With groups it is:
Help (stamp out and )?abolish redundancy
We introduce now the parentheses; they "lump things together" and when something is lumped together, we can then apply a quantifier to the entire group.
By tweaking the pattern just a little:
Help (stamp out and )*abolish redundancy
We can now match highly redundant variations on a theme:
Help abolish redundancy
Help stamp out and abolish redundancy
Help stamp out and stamp out and abolish redundancy
Help stamp out and stamp out and stamp out and abolish redundancy
We can apply any quantifier to a group and it will work just as if the entire group were a single character (as we described in the previous tutorial).
Armed with groups, we can almost solve our original problem of matching the following two phrases:
My dog has fleas
My dog has feet
We might try:
My dog has (fleas)?(feet)?
which would get us closer. This certainly will match the two above sentences, but, unfortunately, it also matches these two sentences:
My dog has
My dog has fleasfeet
What this regular expression needs is a good five-cent alternation. Alternation is like having two or more groups lumped together and letting the regular expression engine pick one among the groups. Here is our answer:
My dog has (fleas|feet)
The pipe character ("|") separates two or more things in a group. We can now easily extend our pattern to match more things without matching too much more:
My dog has (fleas|feet|worms)
Poor doggy! Ah, well, despite our dog's troubles, we can match all of them in a single, tidy regular expression. Can we apply quantifiers to an alternation? You betcha: it behaves exactly like a group (alternations are actually just a special kind of group). Consider this pattern:
Subject:.*(saw|about)? your (web ?)?site
Look familiar? This will match a formerly common spam subject line. Let's study it carefully[4]:
The phrase "Subject:" followed by anything (dot-star matches
anything) OPTIONALLY followed by the word "saw" OR "about"
followed by " your " (with spaces) OPTIONALLY followed by the word
"web" with an optional space, followed by the word "site".
So this will match exactly these phrases and only these phrases (we're ignoring the dot-star pattern for now):
Subject: your site
Subject: your website
Subject: your web site
Subject: saw your site
Subject: saw your website
Subject: saw your web site
Subject: about your site
Subject: about your website
Subject: about your web site
How's that for a nice solution! With groups and alternation, we can do so much more than we could with just quantifiers and characters. To round off our lesson, we'll talk about a special way of grouping characters called character classes.
Character classes sounds technical, but it's a simple concept to understand. We already understand groups. Say we have another problem with spam. Subject lines like this keep getting through:
Subject: 0nline casino
Notice the leading "0" in "0nline" is a zero, not the letter "o"? Tricksy, nasty spammers! So we counter with our groups:
Subject: (online|0nline) casino
But now the spammer sends this:
Subject: 0nline casin0
A trailing zero! Alright, how about this pattern:
Subject: (online|0nline) (casino|casin0)
That will work. For now:
Subject: 0n1ine casin0
The ell in "online" is the digit "one" ("1"). Whew! We can make really long alternations. Or, we could make smaller groups:
Subject: (o|0)n(1|l)ine casin(o|0)
That's not too bad, but it's ugly. And it's inefficient for procmail to do it this way. There's a better way: character classes. A character class works like this:
Subject: [o0]n[1l]ine casin[o0]
An open square brace, followed by all the possible characters that we should look for, followed by a closed square brace. This is a character class. Our pattern above has two character classes:
[o0] (used twice)
[1l]
But character classes are not limited to listing all possible characters in the class. We also have the range operator that helps us define a range of characters to include. You'll often see character classes like this:
[a-z]
which means match any character "a" through "z" (inclusive). You'll see:
[0-9]
which of course matches all digits. Quantifiers also work with character classes:
[0-9]+
will match any series of numbers (not just a single repeating digit, like "8+"; the character class means "any of these" and when a plus quantifier is applied it means "one or more of 'any of these'").
Another common thing you'll see in many recipes:
[ ]
This is a space character followed by a tab character. It's a little hard to read, unfortunately, but it's commonly used. It's useful in a lot of places, especially email headers:
Subject:[ ]*some subject
From:[ ]*.*joe@schmoe.org
These match this way:
The phrase "Subject:" followed by zero or more spaces or tabs
(remember, they can be mixed), followed by the phrase "some
subject".
What about matching any character not in a set? For example, we want to match subject lines that do not begin with numbers:
Subject:[ ]*[a-z]+
We'll, that's a good start, but letters and numbers aren't the only things on the keyboard. There's lots of punctuation marks, and there are the extended ASCII characters (umlaut, grave and acute accented characters, etc.). Listing all of these would be a big pain, and error-prone.
The solution is to negate your character class by putting a caret ("^") as the first character in your character class:
Subject:[ ]*[^0-9]+
This means literally:
The phrase "Subject:" followed by zero or more spaces, followed
by one or more characters that are not digits.
Negated character classes completes our understanding of "things lumped together".
Regular expressions are made more powerful and concise with groups, alternation, and character classes. When combined with quantifiers, these allow us to match entire words or phrases (or simply sets of characters) in a clean, readable manner.
Simple Regular Expressions, Part I
Simple Regular Expressions, Part III
procmail(1), procmailrc(5), procmailex(5), regex(3)
Scott Wiersdorf <scott@perlcode.org>
Copyright (c) 2003 Scott Wiersdorf. All rights reserved.
$Id: proctut3.pod,v 1.6 2003/10/23 19:24:23 deep Exp $