=head1 NAME

proctut4 - Simple Regular Expressions, Part III

=head1 SYNOPSIS

We finish our "Simple Regular Expressions" series by learning about
anchors that procmail uses in matching. Anchors "tie down" a regular
expression so it doesn't float left or right on the line. We also
cover some real-world examples that will bring together everything
we've covered about regular expressions to date.

=head1 DESCRIPTION

We've learned about the L<dot operator|proctut2.pod/The_Dot>, 
L<quantifiers|proctut2.pod/Quantifiers>, and
L<groups|proctut3.pod/Groups>, which include
L<alternation|proctut3.pod/Alternation> and
L<character classes|proctut3.pod/Character_Classes> (as well as
L<negative character classes|proctut3.pod/Negative_Character_Classes>).

What more could we possibly need? Funny you should ask. Let's pretend
that as part of our spam fighting arsenal, we trash messages that
contain certain words in the subject line. The regular expression we
use to do that looks like this:

    Subject:.*your (web ?)?site

This is an incomplete recipe, of course, missing the
L<flags|proctut1.pod/Flags> and L<action|proctut1.pod/Action> lines.
As we've done in previous tutorials where we want to look just at the
regular expression, we also remove the asterisk the the beginning of
every L<condition|proctut1.pod/Conditions>.

So far, so good. Our condition is successfully matching spam messages
and trashing them. Our unwitting friend, however, bumblingly sends us
an email message with the following subject:

    Subject: Dude, I think your site needs work!

Will it match our anti-spam recipe? Of course it will. Our regular
expression reads in English:

    Match the phrase "Subject:" followed by anything ("Dude, I think
    "), followed by the word "your ", followed by the word "site"
    (remember that "web" and space are optional).

That matches our subject line alright. "But there's more stuff after
'site'," you say. Yes, there is, but our pattern doesn't care about
that: we reached the end of our regular expression successfully and
that's good enough for procmail.

So how can we allow our friend's email message through but still block
spam?

=head2 Two Solutions

Our naive solution is to have our friend resend his message, but
change his subject line; we call him up and ask if he could send that
again. He obliges promptly and send the message again. We wait a few
moments, but the message never arrives. We check our spam log again,
and, dang it, there's his message again! What's going on?

His message had this subject line:

    Subject: second try: check out yo' space

Hmmm... that doesn't seem like it would match our regex. What else
could possibly trigger our recipe? As we examine his message further,
we notice the following line in his message headers:

    Old-Subject: Dude, I think your site needs work!

Ha! His email client preserved the old subject line and prepended
"Old-" in front of it. But why did it catch in our spam line again?
"Old-Subject:" doesn't match "Subject:" does it? Funny you should ask.

=head2 Leftmost, Shortest Match

The procmail regular expression engineL<[1]|proctut4.pod/Note_1>
matches (roughly) like this: It compares the first character in the
pattern (our letter "S" in "Subject") with the first character in the
header line ("O" in "Old-Subject"). If they're the same (they're not
in this case), the next letter of the pattern (still "u") is compared
with the next letter of the header and so forth until the end of the
pattern is reached.

If the first character of the pattern does not match the first
character of the line we're matching (as it is in our case) the
I<next> character of the header ("l"--ell in "Old-Subject") is
compared with the I<first> character of the pattern. We continue
looking at our pattern's first character but slide along to the right
in the string we're examining. The regular expression engine I<wants>
to match.

As we move along in our line, we finally get to the "S" in
"Old-Subject" and we begin our comparison from there. This is where
our match happens and why the pattern "Subject:" matches the string
"Old-Subject:". What we need is a way to tell the procmail regular
expression engine to I<force> the match to occur only at the beginning
of the line; the regular expression token that accomplishes this is
called an I<anchor>.

Wildcards (e.g., the dot character, which represents I<any> character
(except newline) or character classes are considered "wildcards")
accompanied by either an asterisk ("zero or more") or plus sign ("one
or more") will match as I<few> characters as possible when matching.
This is different than most other regular expression engines.

=head2 Anchors

Like an anchor for a boat, regular expression anchors prevent the
pattern from "drifting". They force the pattern to either the
beginning (left end) of the string or the end (right end) of the
string.

Procmail has several anchors; in this tutorial we will cover two of
them: the I<beginning-of-line anchor> or caret ("^") and the
I<end-of-line anchor> or dollar-sign ("$").

=head2 The Caret

You might remember that the caret already does something special for
us. It is our L<character class negation
operator|proctut3.pod/Negative_Character_Classes>. When the caret is
found as the first character in a L<character
class|proctut3.pod/Character_Classes> (using square braces: "[" and
"]"), it I<inverts> or I<negates> the character class.

However, when the caret is used as the first character in a regular
expression pattern, it becomes a beginning-of-line anchor. This means
that the pattern following the anchor I<must> match ("be anchored")
at the beginning of the line.

Recall our pattern:

    Subject:.*your (web ?)?site

and how it matched "Old-Subject:". We don't want this, so we I<anchor>
the pattern to the beginning of the line with a caret:

    ^Subject:.*your (web ?)?site

Now our pattern reads (in English):

    The beginning of the line, immediately followed by "Subject:",
    followed by zero or more characters (anything but a newline),
    followed by "your ", optionally followed "web" and an optional
    space, followed by "site".

That's precisely what we want. We make this minor adjustment to our
recipe and sleep well knowing that we won't be catching any
"Old-Subject:" lines again in our spam traps.

=head2 The Dollar Sign

Unfortunately, our friend has a short memory. He has already forgotten
that he shouldn't send any email messages with the subject "your site"
or "your web site" in it. He sent again the following subject:

    Subject: Dude, I think your site needs work!

which of course even matched our new pattern. What can we do to allow
these email messages through but still block unsavory messages?  Well,
we notice, for one thing, that all of the spam subjects I<end> with
the word "site", that is, we never get any subjects that go beyond
that:

    Subject: your web site
    Subject: Re: your web site
    Subject: Fw: Re: your website

So, we want to match phrases that I<end> with the word "site". This
is possible after we introduce our second anchor, the I<dollar sign>
("$"). The dollar sign matches the
"end-of-line"L<[2]|proctut4.pod/Note_2>.

So for us to make our pattern tight, we turn this:

    ^Subject:.*your (web ?)?site

into this:

    ^Subject:.*your (web ?)?site$

which reads in English:

    Match the beginning of the line (the caret), immediately followed
    by "Subject:", followed by zero or more characters (anything but
    a newline), followed by "your ", optionally followed "web" and an
    optional space, followed by "site", immediately followed by a
    newline character.

Now our pattern matches only the spams we've been getting and messages
from our friend no longer match, since the character after "site" is
not a newline. Compare this paragraph with the equivalent for the
L<beginning-of-line|proctut4.pod/The_Caret> anchor above.

Correct use of the caret and dollar sign will make your patterns match
quicker because the regular expression engine can immediately tell
after one comparison whether the pattern will match the current line
or not.

=head1 EXAMPLES

As promised, we'll now cover a few real-world examples from my own
procmail recipe file (slightly simplified) to illustrate dots,
quantifiers, character classes, and anchors.

Examples will be given first as a procmail recipe, followed by an
English translation for you to read along with. I have found English
translations of regular expressions invaluable in learning how to
"speak" regex.

=over 4

=item Example 1

Hoping that current spam legislation will pass and all email
advertisers will comply, I have created a recipe that will put any
subject line with 'ADV' at the beginning of the subject in a special
file that I will read later.

    :0:
    * ^Subject:[ 	]*.?ADV.?:?[ 	]+
    for_when_im_bored

The English translation:

    The beginning of the line, immediately followed by "Subject:",
    followed by zero or more spaces or tabs, optionally followed by
    any single character, followed by "ADV" (case-insensitive),
    optionally followed by any single character, optionally followed
    by a colon, followed one or more spaces or tabs.

For a case-sensitive version, we add the 'D' flag:

    :0 D:
    * ^Subject:[ 	]*.?ADV.?:?[ 	]+
    for_when_im_bored

This will match only when 'ADV' is really 'ADV' (not, for example,
'Adv' or 'adv').

=item Example 2

Being naturally suspicious of mail not destined for me, I decide to
shunt email messages destined to "undisclosed-recipients" somewhere
else.

    :0:
    * ^To:[ 	]*undisclosed[\.\-]recipients
    not_for_me

The English translation:

    The beginning of the line, followed by zero or more spaces or
    tabs, followed by "undisclosed", followed by a literal dot or a
    hyphen, followed by "recipients".

=item Example 3

I don't ever confuse a web browser for an email client, so I don't
often get around to reading HTML email messages--email is not for
sending web pagesL<[3]|proctut4.pod/Note_3>. I'd like to put email
messages written exclusively in HTML or Word documents in a folder I
can convert to plain text later.

    :0 B:
    * ^Content-type:[ 	]*(text/html|application/ms-word)
    html_email

The English translation reads:

    The body of the email message must have a line that begins with
    "Content-type:", followed by zero or more spaces, followed by
    "text/html" or "application/ms-word".

=back

=head1 SUMMARY

Anchors "tie down" our patterns to the beginning or the end of the
line, making our patterns match (or not match) quicker (since the
engine doesn't have to walk all the way down the string to see if the
next character matches) and more accurate. The caret anchor ties our
pattern to the beginning of the line, which the dollar sign anchor
ties our pattern to the end of the line.

=head1 NOTES

=over 4

=item Note 1

Procmail is unlike most other regular expression engines, including
Perl's, egrep's and many other popular programming languages, which
implement a I<leftmost>, I<longest> match. Procmail uses a
I<leftmost>, I<shortest> match, which makes it a much faster parser
than many other languages.

The only time procmail will match I<leftmost-longest> is when a
pattern appears to the right of the \/ (match) operator. The L<next
tutorial|proctut5.pod> covers this completely.

=item Note 2

The dollar sign has many uses in procmail. It indicates a variable,
when attached to letters immediately after it. It turns on shell
interpolation when it is all by itself at the start of a condition.
These uses and others will be covered in other tutorials.

=item Note 3

Yet another of your author's strong opinions. There's a lot of debate
about this, of course. The crux of my position comes from an old
security adage that says "keep code and data separate." Nearly all of
our security problems today, including the recent spread of Windows
worms/viruses stem from the fact that Microsoft forgot this simple
principle when they designed their operating system, web browser, and
email clients.

=back

=head1 PREVIOUS

L<Simple Regular Expressions, Part II|proctut3.pod>

=head1 NEXT

I<Thinking Regex>

=head1 SEE ALSO

procmail(1), procmailrc(5), procmailex(5), regex(3)

=head1 AUTHOR

Scott Wiersdorf E<lt>scott@perlcode.orgE<gt>

=head1 COPYRIGHT

Copyright (c) 2003 Scott Wiersdorf. All rights reserved.

=head1 REVISION

$Id: proctut4.pod,v 1.6 2003/10/18 02:52:03 deep Exp $