NAME

proctut3 - Simple Regular Expressions, Part II

SYNOPSIS

Regular expressions are integral to effective procmail recipes. In Simple Regular Expressions, Part I, we covered the essential and basic parts of nearly all regular expressions. We continue to learn more of the core regular expression syntax in this second part of the regular expression tutorials.

DESCRIPTION

We recall from our previous discussion that procmail uses regular expressions to "find things" in message headers or message bodies. To save us a lot of work, regular expressions allow us to substitute special characters[1] to match many other characters.

For example, the dot character (".") matches any character (except newlines), and allows us to write patterns like this:

    My dog has .....

which will match any of the following:

    My dog has fleas
    My dog has foots
    My dog has phhht

With quantifiers, we can extend the power of a pattern and make it more concise. Consider this pattern:

    My dog has .+

This saves a few characters, is easier to read, and matches many more phrases than our previous pattern. Patterns (the combination of regular characters, regular expression tokens, and quantifiers) are the expressive power of regular expressions.

A New Problem

It certainly is nice to be able to substitute a single character or multiple characters with a single dot and a quantifier, but what if we don't want to be so cavalier in our matching? That is to say, what if we only want to match a certain set of things?

For example, we want to match the following two phrases and only the following two phrases:

    My dog has fleas
    My dog has feet

Here are our current options:

This doesn't agree with our sense of decency: regular expressions ought to be able to help us make this more efficient.

Things Lumped Together

The answer, of course, is we need a way to lump things together. In the regular expression world, we call these "groups" or "sets". The following three sections discuss how we can lump things together in procmail regular expressions.

Groups

We spoke a half-truth[2] earlier when we mentioned that quantifiers apply to the immediately preceeding character. While that statement is true, quantifiers may also apply to groups of characters as a whole.

Consider the problem of matching a sentence where an entire phrase is optional:

    Help stamp out and abolish redundancyL<[3]|proctut3.pod/Note_3>

really ought to be:

    Help abolish redundancy

We'd like to find either phrase without writing two patterns. Possible? With groups it is:

    Help (stamp out and )?abolish redundancy

We introduce now the parentheses; they "lump things together" and when something is lumped together, we can then apply a quantifier to the entire group.

By tweaking the pattern just a little:

    Help (stamp out and )*abolish redundancy

We can now match highly redundant variations on a theme:

    Help abolish redundancy
    Help stamp out and abolish redundancy
    Help stamp out and stamp out and abolish redundancy
    Help stamp out and stamp out and stamp out and abolish redundancy

We can apply any quantifier to a group and it will work just as if the entire group were a single character (as we described in the previous tutorial).

Alternation

Armed with groups, we can almost solve our original problem of matching the following two phrases:

    My dog has fleas
    My dog has feet

We might try:

    My dog has (fleas)?(feet)?

which would get us closer. This certainly will match the two above sentences, but, unfortunately, it also matches these two sentences:

    My dog has
    My dog has fleasfeet

What this regular expression needs is a good five-cent alternation. Alternation is like having two or more groups lumped together and letting the regular expression engine pick one among the groups. Here is our answer:

    My dog has (fleas|feet)

The pipe character ("|") separates two or more things in a group. We can now easily extend our pattern to match more things without matching too much more:

    My dog has (fleas|feet|worms)

Poor doggy! Ah, well, despite our dog's troubles, we can match all of them in a single, tidy regular expression. Can we apply quantifiers to an alternation? You betcha: it behaves exactly like a group (alternations are actually just a special kind of group). Consider this pattern:

    Subject:.*(saw|about)? your (web ?)?site

Look familiar? This will match a formerly common spam subject line. Let's study it carefully[4]:

    The phrase "Subject:" followed by anything (dot-star matches
    anything) OPTIONALLY followed by the word "saw" OR "about"
    followed by " your " (with spaces) OPTIONALLY followed by the word
    "web" with an optional space, followed by the word "site".

So this will match exactly these phrases and only these phrases (we're ignoring the dot-star pattern for now):

    Subject: your site
    Subject: your website
    Subject: your web site
    Subject: saw your site
    Subject: saw your website
    Subject: saw your web site
    Subject: about your site
    Subject: about your website
    Subject: about your web site

How's that for a nice solution! With groups and alternation, we can do so much more than we could with just quantifiers and characters. To round off our lesson, we'll talk about a special way of grouping characters called character classes.

Character Classes

Character classes sounds technical, but it's a simple concept to understand. We already understand groups. Say we have another problem with spam. Subject lines like this keep getting through:

    Subject: 0nline casino

Notice the leading "0" in "0nline" is a zero, not the letter "o"? Tricksy, nasty spammers! So we counter with our groups:

    Subject: (online|0nline) casino

But now the spammer sends this:

    Subject: 0nline casin0

A trailing zero! Alright, how about this pattern:

    Subject: (online|0nline) (casino|casin0)

That will work. For now:

    Subject: 0n1ine casin0

The ell in "online" is the digit "one" ("1"). Whew! We can make really long alternations. Or, we could make smaller groups:

    Subject: (o|0)n(1|l)ine casin(o|0)

That's not too bad, but it's ugly. And it's inefficient for procmail to do it this way. There's a better way: character classes. A character class works like this:

    Subject: [o0]n[1l]ine casin[o0]

An open square brace, followed by all the possible characters that we should look for, followed by a closed square brace. This is a character class. Our pattern above has two character classes:

    [o0] (used twice)
    [1l]

But character classes are not limited to listing all possible characters in the class. We also have the range operator that helps us define a range of characters to include. You'll often see character classes like this:

    [a-z]

which means match any character "a" through "z" (inclusive). You'll see:

    [0-9]

which of course matches all digits. Quantifiers also work with character classes:

    [0-9]+

will match any series of numbers (not just a single repeating digit, like "8+"; the character class means "any of these" and when a plus quantifier is applied it means "one or more of 'any of these'").

Another common thing you'll see in many recipes:

    [   ]

This is a space character followed by a tab character. It's a little hard to read, unfortunately, but it's commonly used. It's useful in a lot of places, especially email headers:

    Subject:[   ]*some subject
    From:[      ]*.*joe@schmoe.org

These match this way:

    The phrase "Subject:" followed by zero or more spaces or tabs
    (remember, they can be mixed), followed by the phrase "some
    subject".

Negative Character Classes

What about matching any character not in a set? For example, we want to match subject lines that do not begin with numbers:

    Subject:[   ]*[a-z]+

We'll, that's a good start, but letters and numbers aren't the only things on the keyboard. There's lots of punctuation marks, and there are the extended ASCII characters (umlaut, grave and acute accented characters, etc.). Listing all of these would be a big pain, and error-prone.

The solution is to negate your character class by putting a caret ("^") as the first character in your character class:

    Subject:[   ]*[^0-9]+

This means literally:

    The phrase "Subject:" followed by zero or more spaces, followed
    by one or more characters that are not digits.

Negated character classes completes our understanding of "things lumped together".

SUMMARY

Regular expressions are made more powerful and concise with groups, alternation, and character classes. When combined with quantifiers, these allow us to match entire words or phrases (or simply sets of characters) in a clean, readable manner.

NOTES

Note 1

When these special characters are joined together we call them "patterns".

Note 2

A little inaccuracy sometimes saves tons of explanation -- H. H. Munroe

Note 3

This first quote is apparently anonymous and is taken from the FreeBSD fortune file. I think my most recent favorite I stumbled across was a quote from Ralph Waldo Emerson: "I hate quotations."

Note 4

As the author, I can assert that you really ought to study these "English" regular expressions, but it's up to you, of course. I'm not making you read this tutorial (but it's good for you!).

PREVIOUS

Simple Regular Expressions, Part I

NEXT

Simple Regular Expressions, Part III

SEE ALSO

procmail(1), procmailrc(5), procmailex(5), regex(3)

AUTHOR

Scott Wiersdorf <scott@perlcode.org>

COPYRIGHT

Copyright (c) 2003 Scott Wiersdorf. All rights reserved.

REVISION

$Id: proctut3.pod,v 1.6 2003/10/23 19:24:23 deep Exp $