NAME

proctut4 - Simple Regular Expressions, Part III

SYNOPSIS

We finish our "Simple Regular Expressions" series by learning about anchors that procmail uses in matching. Anchors "tie down" a regular expression so it doesn't float left or right on the line. We also cover some real-world examples that will bring together everything we've covered about regular expressions to date.

DESCRIPTION

We've learned about the dot operator, quantifiers, and groups, which include alternation and character classes (as well as negative character classes).

What more could we possibly need? Funny you should ask. Let's pretend that as part of our spam fighting arsenal, we trash messages that contain certain words in the subject line. The regular expression we use to do that looks like this:

    Subject:.*your (web ?)?site

This is an incomplete recipe, of course, missing the flags and action lines. As we've done in previous tutorials where we want to look just at the regular expression, we also remove the asterisk the the beginning of every condition.

So far, so good. Our condition is successfully matching spam messages and trashing them. Our unwitting friend, however, bumblingly sends us an email message with the following subject:

    Subject: Dude, I think your site needs work!

Will it match our anti-spam recipe? Of course it will. Our regular expression reads in English:

    Match the phrase "Subject:" followed by anything ("Dude, I think
    "), followed by the word "your ", followed by the word "site"
    (remember that "web" and space are optional).

That matches our subject line alright. "But there's more stuff after 'site'," you say. Yes, there is, but our pattern doesn't care about that: we reached the end of our regular expression successfully and that's good enough for procmail.

So how can we allow our friend's email message through but still block spam?

Two Solutions

Our naive solution is to have our friend resend his message, but change his subject line; we call him up and ask if he could send that again. He obliges promptly and send the message again. We wait a few moments, but the message never arrives. We check our spam log again, and, dang it, there's his message again! What's going on?

His message had this subject line:

    Subject: second try: check out yo' space

Hmmm... that doesn't seem like it would match our regex. What else could possibly trigger our recipe? As we examine his message further, we notice the following line in his message headers:

    Old-Subject: Dude, I think your site needs work!

Ha! His email client preserved the old subject line and prepended "Old-" in front of it. But why did it catch in our spam line again? "Old-Subject:" doesn't match "Subject:" does it? Funny you should ask.

Leftmost, Shortest Match

The procmail regular expression engine[1] matches (roughly) like this: It compares the first character in the pattern (our letter "S" in "Subject") with the first character in the header line ("O" in "Old-Subject"). If they're the same (they're not in this case), the next letter of the pattern (still "u") is compared with the next letter of the header and so forth until the end of the pattern is reached.

If the first character of the pattern does not match the first character of the line we're matching (as it is in our case) the next character of the header ("l"--ell in "Old-Subject") is compared with the first character of the pattern. We continue looking at our pattern's first character but slide along to the right in the string we're examining. The regular expression engine wants to match.

As we move along in our line, we finally get to the "S" in "Old-Subject" and we begin our comparison from there. This is where our match happens and why the pattern "Subject:" matches the string "Old-Subject:". What we need is a way to tell the procmail regular expression engine to force the match to occur only at the beginning of the line; the regular expression token that accomplishes this is called an anchor.

Wildcards (e.g., the dot character, which represents any character (except newline) or character classes are considered "wildcards") accompanied by either an asterisk ("zero or more") or plus sign ("one or more") will match as few characters as possible when matching. This is different than most other regular expression engines.

Anchors

Like an anchor for a boat, regular expression anchors prevent the pattern from "drifting". They force the pattern to either the beginning (left end) of the string or the end (right end) of the string.

Procmail has several anchors; in this tutorial we will cover two of them: the beginning-of-line anchor or caret ("^") and the end-of-line anchor or dollar-sign ("$").

The Caret

You might remember that the caret already does something special for us. It is our character class negation operator. When the caret is found as the first character in a character class (using square braces: "[" and "]"), it inverts or negates the character class.

However, when the caret is used as the first character in a regular expression pattern, it becomes a beginning-of-line anchor. This means that the pattern following the anchor must match ("be anchored") at the beginning of the line.

Recall our pattern:

    Subject:.*your (web ?)?site

and how it matched "Old-Subject:". We don't want this, so we anchor the pattern to the beginning of the line with a caret:

    ^Subject:.*your (web ?)?site

Now our pattern reads (in English):

    The beginning of the line, immediately followed by "Subject:",
    followed by zero or more characters (anything but a newline),
    followed by "your ", optionally followed "web" and an optional
    space, followed by "site".

That's precisely what we want. We make this minor adjustment to our recipe and sleep well knowing that we won't be catching any "Old-Subject:" lines again in our spam traps.

The Dollar Sign

Unfortunately, our friend has a short memory. He has already forgotten that he shouldn't send any email messages with the subject "your site" or "your web site" in it. He sent again the following subject:

    Subject: Dude, I think your site needs work!

which of course even matched our new pattern. What can we do to allow these email messages through but still block unsavory messages? Well, we notice, for one thing, that all of the spam subjects end with the word "site", that is, we never get any subjects that go beyond that:

    Subject: your web site
    Subject: Re: your web site
    Subject: Fw: Re: your website

So, we want to match phrases that end with the word "site". This is possible after we introduce our second anchor, the dollar sign ("$"). The dollar sign matches the "end-of-line"[2].

So for us to make our pattern tight, we turn this:

    ^Subject:.*your (web ?)?site

into this:

    ^Subject:.*your (web ?)?site$

which reads in English:

    Match the beginning of the line (the caret), immediately followed
    by "Subject:", followed by zero or more characters (anything but
    a newline), followed by "your ", optionally followed "web" and an
    optional space, followed by "site", immediately followed by a
    newline character.

Now our pattern matches only the spams we've been getting and messages from our friend no longer match, since the character after "site" is not a newline. Compare this paragraph with the equivalent for the beginning-of-line anchor above.

Correct use of the caret and dollar sign will make your patterns match quicker because the regular expression engine can immediately tell after one comparison whether the pattern will match the current line or not.

EXAMPLES

As promised, we'll now cover a few real-world examples from my own procmail recipe file (slightly simplified) to illustrate dots, quantifiers, character classes, and anchors.

Examples will be given first as a procmail recipe, followed by an English translation for you to read along with. I have found English translations of regular expressions invaluable in learning how to "speak" regex.

Example 1

Hoping that current spam legislation will pass and all email advertisers will comply, I have created a recipe that will put any subject line with 'ADV' at the beginning of the subject in a special file that I will read later.

    :0:
    * ^Subject:[        ]*.?ADV.?:?[    ]+
    for_when_im_bored
The English translation:

    The beginning of the line, immediately followed by "Subject:",
    followed by zero or more spaces or tabs, optionally followed by
    any single character, followed by "ADV" (case-insensitive),
    optionally followed by any single character, optionally followed
    by a colon, followed one or more spaces or tabs.
For a case-sensitive version, we add the 'D' flag:

    :0 D:
    * ^Subject:[        ]*.?ADV.?:?[    ]+
    for_when_im_bored
This will match only when 'ADV' is really 'ADV' (not, for example, 'Adv' or 'adv').

Example 2

Being naturally suspicious of mail not destined for me, I decide to shunt email messages destined to "undisclosed-recipients" somewhere else.

    :0:
    * ^To:[     ]*undisclosed[\.\-]recipients
    not_for_me
The English translation:

    The beginning of the line, followed by zero or more spaces or
    tabs, followed by "undisclosed", followed by a literal dot or a
    hyphen, followed by "recipients".
Example 3

I don't ever confuse a web browser for an email client, so I don't often get around to reading HTML email messages--email is not for sending web pages[3]. I'd like to put email messages written exclusively in HTML or Word documents in a folder I can convert to plain text later.

    :0 B:
    * ^Content-type:[   ]*(text/html|application/ms-word)
    html_email
The English translation reads:

    The body of the email message must have a line that begins with
    "Content-type:", followed by zero or more spaces, followed by
    "text/html" or "application/ms-word".

SUMMARY

Anchors "tie down" our patterns to the beginning or the end of the line, making our patterns match (or not match) quicker (since the engine doesn't have to walk all the way down the string to see if the next character matches) and more accurate. The caret anchor ties our pattern to the beginning of the line, which the dollar sign anchor ties our pattern to the end of the line.

NOTES

Note 1

Procmail is unlike most other regular expression engines, including Perl's, egrep's and many other popular programming languages, which implement a leftmost, longest match. Procmail uses a leftmost, shortest match, which makes it a much faster parser than many other languages.

The only time procmail will match leftmost-longest is when a pattern appears to the right of the \/ (match) operator. The next tutorial covers this completely.

Note 2

The dollar sign has many uses in procmail. It indicates a variable, when attached to letters immediately after it. It turns on shell interpolation when it is all by itself at the start of a condition. These uses and others will be covered in other tutorials.

Note 3

Yet another of your author's strong opinions. There's a lot of debate about this, of course. The crux of my position comes from an old security adage that says "keep code and data separate." Nearly all of our security problems today, including the recent spread of Windows worms/viruses stem from the fact that Microsoft forgot this simple principle when they designed their operating system, web browser, and email clients.

PREVIOUS

Simple Regular Expressions, Part II

NEXT

Thinking Regex

SEE ALSO

procmail(1), procmailrc(5), procmailex(5), regex(3)

AUTHOR

Scott Wiersdorf <scott@perlcode.org>

COPYRIGHT

Copyright (c) 2003 Scott Wiersdorf. All rights reserved.

REVISION

$Id: proctut4.pod,v 1.6 2003/10/18 02:52:03 deep Exp $