Regular Expressions - Procmail Mail Filtering Syntax
This document has been supplanted by the 'proctut' procmail tutorial series found at <http://www.perlcode.org/tutorials/procmail/proctut/>.
Procmail accepts egrep extended regular expressions. A "regular expression" could be thought of as a miniature programming language for matching textual patterns where certain characters in the regular expression have special meaning.
In this document we will use the term "regex" to mean a regular expression designed to match a particular pattern of ASCII characters.
For example, an asterisk (*) is part of regular expression syntax called a "quantifier"; this means that the asterisk defines how many characters to match in a pattern. Specifically, the asterisk means that the character immediately preceeding it may be matched any number of times (including zero times).
Thus, if you want a regular expression to match all different kinds of screaming, use a regex like this:
aiee*
This pattern matches "aie" (the asterisk matches zero 'e's), "aiee" (the asterisk matches exactly one 'e'), and "aieeeeee" (the asterisk matches any number of 'e's).
The remainder of this document introduces procmail's complete regular expression syntax. See also "EXAMPLES".
Here is procmail's regular expression list, from the procmailrc(5) manpage:
^ Start of a line.
$ End of a line.
. Any character except a newline.
a* Any sequence of zero or more a's.
a+ Any sequence of one or more a's.
a? Either zero or one a.
[^-a-d] Any character which is not either a dash, a, b,
c, d or newline.
de|abc Either the sequence `de' or `abc'.
(abc)* Zero or more times the sequence `abc'.
\. Matches a single dot; use \ to quote any of the
magic characters to get rid of their special
meaning. See also $\ variable substitution.
These were only samples, of course, any more complex com-
bination is valid as well.
The following token meanings are special procmail exten-
sions:
^ or $ Match a newline (for multiline matches).
^^ Anchor the expression at the very start of the
search area, or if encountered at the end of the
expression, anchor it at the very end of the
search area.
\< or \> Match the character before or after a word.
They are merely a shorthand for `[^a-zA-Z0-9_]',
but can also match newlines. Since they match
actual characters, they are only suitable to
delimit words, not to delimit inter-word space.
\/ Splits the expression in two parts. Everything
matching the right part will be assigned to the
MATCH environment variable.
To summarize, all of the above are "magic" characters or "operators" in procmail's regular expression language which have special meaning. If you want your pattern to match one of the above operators literally (e.g., you want to match a dollar sign), you must "escape" it by preceeding it with a backslash.
FIXME-VERIO: maybe someone needs to add the verbose version describing how to click buttons and pull menus down? This tutorial also doesn't treat the subject of selecting "Body" versus "Subject" versus "From".
This section is far from complete, but should give you a good, general understanding of how to match patterns without matching things you don't want to match.
For all of these examples, we are taking the position that the email we are receiving is unwanted and that it should be put in the Trash folder. This may not be the case for you, of course, so you'll want to adapt these rules according to your own needs (i.e., to allow them or deny them).
Suppose we receive an exciting email daily from an email marketing company (yourfriend@marketing.com) whose mailing list we apparently subscribed to several years ago, but who now refuses to unsubscribe our email address.
yourfriend@marketing\.com
Why the backslash ('\') before the period?
The backslash (generally) tells procmail that the next character should be treated literally. This is called "escaping". That is, because the period has special meaning (the period matches any character except newlines), it should be "escaped" with a backslash so that the period matches only a literal period. Period.
Suppose we receive an exciting email daily from an email marketing company who changes its "From" address each time it sends out a new email (someone4221@marketing.com, someone4222@marketing.com, etc.).
.+@marketing\.com
or
someone[0-9]+@marketing\.com
As with all good problems, there are a variety of alternative solutions, each solution tackling a different aspect of the problem.
Solution 1
The first solution is what we call a "catch-all" solution. It matches "From" addresses from every conceivable user at that domain. The '.+' before the at symbol (@) matches one or more characters.
The period matches any character, as stated before, and the plus (+) is another quantifier like the asterisk. Unlike the asterisk, however, the plus matches at least one character. The asterisk matches zero or more characters and the plus matches one or more characters. The difference is subtle but may have important ramifications in certain circumstances.
So, .+@marketing\.com will match "anything@marketing.com", "joe@marketing.com", and even "humble_apologies@marketing.com". In this case we might be matching more than we really want, since receiving a humble apology from anyone is always welcome in this day and age.
Solution 2
The second solution is a more focused approach. It takes advantage of the fact that we have noticed the pattern in the email addresses we receive email from.
This pattern matches the word 'someone', followed by one or more digits, followed by '@marketing.com'. The square braces ([]) in this example create what's called a "character class". This means that any quantifiers after the class (in this case, a '+' is the quantifier) apply to the entire character class (not just the preceeding character). So the pattern:
[0-9]+
means "match one or more digits". If we ever received an email from "someone@marketing.com", we'd want to change the plus to an asterisk:
[0-9]*
which means "match zero or more digits". It does not mean "match digits followed by zero or more close square braces".
We also introduce the range operator here. The hyphen between two characters creates a range of characters (in this case, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9) succinctly. It also works for the alphabet (e.g., [a-zA-Z] matches all ASCII letters).
Suppose we receive mail that has the string '[ADV]' and '$$$' in the subject line. We'd like this kind of mail to be filed in a special location.
We have to pull out our handy backslash to escape the special characters:
\[ADV\]
and
\$\$\$
These regexes (or "regexen" if you're feeling sassy) match literally '[ADV]' (without the quotes, of course) and '$$$'. The backslashes take away the magic meaning of the square braces as character class boundaries and the dollar signs as newline matching operators, and make them instead match literal square braces and dollar signs.
Our advertisers have gotten more clever and have disguised their subject lines like this: '[A D V]' or '[A*D*V]'. How can we file these kinds of messages in a special location?
\[A.?D.?V\]
or
\[A( |\*)?D( |\*)?V\]
This problem has other solutions besides these two, but these two illustrate some important points. The first example ('\[A.?D.?V\]') introduces a new quantifier: the question mark ('?'). The question mark means "zero or one" of the preceeding character will match. In this example, we match zero or one periods (recall that a period matches any character). So we match all kinds of strings with this pattern:
[ADV]
[A*D*V]
[A D V]
[ARDOV]
[A DAV]
...
The period matches more things than we might want to match. The second solution ('\[A( |\*)?D( |\*)?V\]') is a bit longer but is a more focused pattern.
This example introduces parentheses, which form the concept of "grouping" in most regular expression languages. A "group" is a sequence of characters together in order. In this simple example, we only have one character in a sequence: a space (' ') or a literal asterisk ('\*').
The pipe operator ('|') means "or" in a group; that is, it means that the group consists of either a space or an asterisk. The question mark operator after the group means, as it did in the first solution, that the group may appear zero or one times.
All quantifiers (*, +, ?, etc.) apply to the preceeding "element". Up to this point, we've generally been using "character" as the preceeding element, but the element could also be a character class (as we saw in the example labeled "Matching Messages From an Entire Domain") or a group, as we see in this example.
This second solution could be read like this (following the pattern as you read this may be helpful):
Match a literal square brace, followed by an A, optionally followed by either a space or an asterisk, followed by a D, optionally followed by either a space or an asterisk, followed by a V, followed by a close square brace.
We receive a lot of mail from people who are not our friends, but who call us "Dear friend" or "Dear friend". How dear can someone be if you don't know their name? We want to be careful, however, that we don't accidentally match our old college buddy Joe who regularly sends out email with "Dear friends," in it.
^dear( )+friend[,:]
This solution introduces yet another regular expression operator: positional anchors. This sounds like a scary term, but it's not too bad. All it really means is that the pattern your matching must match at a certain point on the line, namely at the beginning or at the end.
In this case, the caret (^) appearing at the beginning of the line means "match the beginning of the line". This may seem strange that the '^' doesn't actually match a real character, but once you've used it a few times, you'll wonder how you ever go along without it. Besides the preciseness that the caret operator adds to your regular expressions, it makes matching efficient. If the first character of the string does not match the first character after the caret, the regular expression engine immediately knows that this pattern will never match this string. Without the caret, the regular expression engine must try each character in the string in turn from left to right to check for a match.
The full regex reads:
Match the beginning of the line, followed by the word "dear", followed by one or more spaces (I could have used a character class here, too), followed by the word "friend" followed by either a comma or a colon.
This pattern will not match "dear friends," because the word "friend" in this case is not followed by a comma or a colon like our pattern requires, but rather it is followed by an 's'.
We want to put "one time emails" in a special location.
one[- ]?time (only )?(e-?)?mail
This matches (say it as you follow along!):
The word "one" OPTIONALLY followed by a hyphen OR a space, followed by the word "time", followed by a space, OPTIONALLY followed by the word "only " (with a space after it), OPTIONALLY followed by "e" OR "e-", followed by the word "mail".
This tidy little pattern will match all of these strings:
onetime mail
onetime email
onetime e-mail
onetime only mail
onetime only email
onetime only e-mail
one-time mail
one-time email
one-time e-mail
one time mail
one time email
one time e-mail
one-time only mail
one-time only email
one-time only e-mail
one time only mail
one time only email
one time only e-mail
Can you feel the power of regular expressions now!?
Regular expressions allow you to match many strings with just a few patterns. Knowing procmail's special characters will help you to avoid matching patterns you didn't intend to.
This tutorial does not cover the following topics: character class negation, negative match assertion, match assignment, search area anchors, shell substitution, exitcodes, message size checks, variable comparisons, and a host of other important procmail topics.
If you found this tutorial useful, you're ready to dive into the real man pages. See procmailrc, procmailex, and procmailsc for more complex procmail examples.
Scott Wiersdorf, <scott@perlcode.org>