=head1 NAME

proctut3 - Simple Regular Expressions, Part II

=head1 SYNOPSIS

Regular expressions are integral to effective procmail recipes. In
L<Simple Regular Expressions, Part I|proctut2.pod>, we covered the
essential and basic parts of nearly all regular expressions. We
continue to learn more of the core regular expression syntax in this
second part of the regular expression tutorials.

=head1 DESCRIPTION

We recall from our L<previous discussion|proctut2.pod> that procmail
uses regular expressions to "find things" in message headers or
message bodies. To save us a lot of work, regular expressions allow
us to substitute special charactersL<[1]|proctut3.pod/Note_1> to match
many other characters.

For example, the dot character (".") matches I<any> character (except
newlines), and allows us to write patterns like this:

    My dog has .....

which will match any of the following:

    My dog has fleas
    My dog has foots
    My dog has phhht

With I<quantifiers>, we can extend the power of a pattern and make it
more concise. Consider this pattern:

    My dog has .+

This saves a few characters, is easier to read, and matches many more
phrases than our previous pattern. Patterns (the combination of
regular characters, regular expression tokens, and quantifiers) are
the expressive power of regular expressions.

=head2 A New Problem

It certainly is nice to be able to substitute a single character or
multiple characters with a single dot and a quantifier, but what if
we don't want to be so cavalier in our matching? That is to say, what
if we only want to match a certain I<set> of things?

For example, we want to match the following two phrases and I<only>
the following two phrases:

    My dog has fleas
    My dog has feet

Here are our current options:

=over 4

=item *

write one pattern:

    My dog has f.+

This suffers from inaccuracy. It will match much more than our two
phrases:

    My dog has feet
    My dog has friends
    My dog has foul halitosis

=item *

We'll have to write two patterns:

    My dog has fleas
    My dog has feet

=back

This doesn't agree with our sense of decency: regular expressions
I<ought> to be able to help us make this more efficient.

=head1 Things Lumped Together

The answer, of course, is we need a way to lump things together. In
the regular expression world, we call these "groups" or "sets". The
following three sections discuss how we can lump things together in
procmail regular expressions.

=head2 Groups

We spoke a half-truthL<[2]|proctut3.pod/Note_2>
L<earlier|proctut2.pod/Quantifiers> when we mentioned that quantifiers
apply to the immediately preceeding character. While that statement
is true, quantifiers may also apply to I<groups> of characters as a
whole.

Consider the problem of matching a sentence where an entire phrase is
optional:

    Help stamp out and abolish redundancyL<[3]|proctut3.pod/Note_3>

really ought to be:

    Help abolish redundancy

We'd like to find either phrase without writing two patterns.
Possible?  With I<groups> it is:

    Help (stamp out and )?abolish redundancy

We introduce now the parentheses; they "lump things together" and when
something is lumped together, we can then apply a I<quantifier> to the
entire group.

By tweaking the pattern just a little:

    Help (stamp out and )*abolish redundancy

We can now match highly redundant variations on a theme:

    Help abolish redundancy
    Help stamp out and abolish redundancy
    Help stamp out and stamp out and abolish redundancy
    Help stamp out and stamp out and stamp out and abolish redundancy

We can apply I<any> quantifier to a group and it will work just as if
the entire group were a single character (as we described in L<the
previous tutorial|proctut2.pod/The_Dot>).

=head2 Alternation

Armed with groups, we can almost solve our original problem of
matching the following two phrases:

    My dog has fleas
    My dog has feet

We might try:

    My dog has (fleas)?(feet)?

which would get us closer. This certainly will match the two above
sentences, but, unfortunately, it also matches these two sentences:

    My dog has
    My dog has fleasfeet

What this regular expression needs is a good five-cent I<alternation>.
Alternation is like having two or more groups lumped together and
letting the regular expression engine pick one among the groups. Here
is our answer:

    My dog has (fleas|feet)

The pipe character ("|") separates two or more things in a group. We
can now easily extend our pattern to match more things without
matching too much more:

    My dog has (fleas|feet|worms)

Poor doggy! Ah, well, despite our dog's troubles, we can match all of
them in a single, tidy regular expression. Can we apply quantifiers to
an alternation? You betcha: it behaves exactly like a group
(alternations are actually just a special kind of group). Consider
this pattern:

    Subject:.*(saw|about)? your (web ?)?site

Look familiar? This will match a formerly common spam subject line.
Let's study it carefullyL<[4]|proctut3.pod/Note_4>:

    The phrase "Subject:" followed by anything (dot-star matches
    anything) OPTIONALLY followed by the word "saw" OR "about"
    followed by " your " (with spaces) OPTIONALLY followed by the word
    "web" with an optional space, followed by the word "site".

So this will match exactly these phrases and I<only> these phrases
(we're ignoring the dot-star pattern for now):

    Subject: your site
    Subject: your website
    Subject: your web site
    Subject: saw your site
    Subject: saw your website
    Subject: saw your web site
    Subject: about your site
    Subject: about your website
    Subject: about your web site

How's that for a nice solution! With groups and alternation, we can
do so much more than we could with just quantifiers and characters.
To round off our lesson, we'll talk about a special way of grouping
characters called I<character classes>.

=head2 Character Classes

Character classes sounds technical, but it's a simple concept to
understand. We already understand groups. Say we have another problem
with spam. Subject lines like this keep getting through:

    Subject: 0nline casino

Notice the leading "0" in "0nline" is a zero, not the letter "o"?
Tricksy, nasty spammers! So we counter with our groups:

    Subject: (online|0nline) casino

But now the spammer sends this:

    Subject: 0nline casin0

A trailing zero! Alright, how about this pattern:

    Subject: (online|0nline) (casino|casin0)

That will work. For now:

    Subject: 0n1ine casin0

The ell in "online" is the digit "one" ("1"). Whew! We can make really
long alternations. Or, we could make smaller groups:

    Subject: (o|0)n(1|l)ine casin(o|0)

That's not too bad, but it's ugly. And it's inefficient for procmail
to do it this way. There's a better way: character classes. A
character class works like this:

    Subject: [o0]n[1l]ine casin[o0]

An open square brace, followed by all the possible characters that we
should look for, followed by a closed square brace. This is a
character class. Our pattern above has two character classes:

    [o0] (used twice)
    [1l]

But character classes are not limited to listing all possible
characters in the class. We also have the I<range> operator that helps
us define a range of characters to include. You'll often see character
classes like this:

    [a-z]

which means match any character "a" through "z" (inclusive). You'll
see:

    [0-9]

which of course matches all digits. Quantifiers also work with
character classes:

    [0-9]+

will match any series of numbers (not just a single repeating
digit, like "8+"; the character class means "any of these" and when a
plus quantifier is applied it means "one or more of 'any of these'").

Another common thing you'll see in many recipes:

    [ 	]

This is a space character followed by a tab character. It's a little
hard to read, unfortunately, but it's commonly used. It's useful in a
lot of places, especially email headers:

    Subject:[ 	]*some subject
    From:[ 	]*.*joe@schmoe.org

These match this way:

    The phrase "Subject:" followed by zero or more spaces or tabs
    (remember, they can be mixed), followed by the phrase "some
    subject".

=head2 Negative Character Classes

What about matching any character I<not> in a set? For example, we
want to match subject lines that do not begin with numbers:

    Subject:[ 	]*[a-z]+

We'll, that's a good start, but letters and numbers aren't the only
things on the keyboard. There's lots of punctuation marks, and there
are the extended ASCII characters (umlaut, grave and acute accented
characters, etc.). Listing all of these would be a big pain, and
error-prone.

The solution is to I<negate> your character class by putting a caret
("^") as the first character in your character class:

    Subject:[ 	]*[^0-9]+

This means literally:

    The phrase "Subject:" followed by zero or more spaces, followed
    by one or more characters that are not digits.

Negated character classes completes our understanding of "things
lumped together".

=head1 SUMMARY

Regular expressions are made more powerful and concise with groups,
alternation, and character classes. When combined with quantifiers,
these allow us to match entire words or phrases (or simply sets of
characters) in a clean, readable manner.

=head1 NOTES

=over 4

=item Note 1

When these special characters are joined together we call them
"patterns".

=item Note 2

A little inaccuracy sometimes saves tons of explanation -- H. H.
Munroe

=item Note 3

This first quote is apparently anonymous and is taken from the FreeBSD
fortune file. I think my most recent favorite I stumbled across was a
quote from Ralph Waldo Emerson: "I hate quotations."

=item Note 4

As the author, I can assert that you really ought to study these
"English" regular expressions, but it's up to you, of course. I'm not
making you read this tutorial (but it's good for you!).

=back

=head1 PREVIOUS

L<Simple Regular Expressions, Part I|proctut2.pod>

=head1 NEXT

L<Simple Regular Expressions, Part III|proctut4.pod>

=head1 SEE ALSO

procmail(1), procmailrc(5), procmailex(5), regex(3)

=head1 AUTHOR

Scott Wiersdorf <scott@perlcode.org>

=head1 COPYRIGHT

Copyright (c) 2003 Scott Wiersdorf. All rights reserved.

=head1 REVISION

$Id: proctut3.pod,v 1.6 2003/10/23 19:24:23 deep Exp $