Regular Expressions - Procmail Mail Filtering Syntax
This document has been supplanted by the 'proctut' procmail tutorial series found at <http://www.perlcode.org/tutorials/procmail/proctut/>.
Procmail accepts egrep extended regular expressions. A "regular expression" could be thought of as a miniature programming language for matching textual patterns where certain characters in the regular expression have special meaning.
In this document we will use the term "regex" to mean a regular expression designed to match a particular pattern of ASCII characters.
For example, an asterisk (*) is part of regular expression syntax called a "quantifier"; this means that the asterisk defines how many characters to match in a pattern. Specifically, the asterisk means that the character immediately preceeding it may be matched any number of times (including zero times).
Thus, if you want a regular expression to match all different kinds of screaming, use a regex like this:
aiee*
This pattern matches "aie" (the asterisk matches zero 'e's), "aiee" (the asterisk matches exactly one 'e'), and "aieeeeee" (the asterisk matches any number of 'e's).
The remainder of this document introduces procmail's complete regular expression syntax. See also "EXAMPLES".
Here is procmail's regular expression list, from the procmailrc(5) manpage:
^ Start of a line.
$ End of a line.
. Any character except a newline.
a* Any sequence of zero or more a's.
a+ Any sequence of one or more a's.
a? Either zero or one a.
[^-a-d] Any character which is not either a dash, a, b,
c, d or newline.
de|abc Either the sequence `de' or `abc'.
(abc)* Zero or more times the sequence `abc'.
\. Matches a single dot; use \ to quote any of the
magic characters to get rid of their special
meaning. See also $\ variable substitution.
These were only samples, of course, any more complex com-
bination is valid as well.
The following token meanings are special procmail exten-
sions:
^ or $ Match a newline (for multiline matches).
^^ Anchor the expression at the very start of the
search area, or if encountered at the end of the
expression, anchor it at the very end of the
search area.
\< or \> Match the character before or after a word.
They are merely a shorthand for `[^a-zA-Z0-9_]',
but can also match newlines. Since they match
actual characters, they are only suitable to
delimit words, not to delimit inter-word space.
\/ Splits the expression in two parts. Everything
matching the right part will be assigned to the
MATCH environment variable.
To summarize, all of the above are "magic" characters or "operators" in procmail's regular expression language which have special meaning. If you want your pattern to match one of the above operators literally (e.g., you want to match a dollar sign), you must "escape" it by preceeding it with a backslash.
FIXME-VERIO: maybe someone needs to add the verbose version describing how to click buttons and pull menus down? This tutorial also doesn't treat the subject of selecting "Body" versus "Subject" versus "From".
This section is far from complete, but should give you a good, general understanding of how to match patterns without matching things you don't want to match.
For all of these examples, we are taking the position that the email we are receiving is unwanted and that it should be put in the Trash folder. This may not be the case for you, of course, so you'll want to adapt these rules according to your own needs (i.e., to allow them or deny them).
Suppose we receive an exciting email daily from an email marketing company (yourfriend@marketing.com) whose mailing list we apparently subscribed to several years ago, but who now refuses to unsubscribe our email address.
yourfriend@marketing\.com
Suppose we receive an exciting email daily from an email marketing company who changes its "From" address each time it sends out a new email (someone4221@marketing.com, someone4222@marketing.com, etc.).
.+@marketing\.com
someone[0-9]+@marketing\.com
.+@marketing\.com will match "anything@marketing.com", "joe@marketing.com", and even "humble_apologies@marketing.com". In this case we might be matching more than we really want, since receiving a humble apology from anyone is always welcome in this day and age. [0-9]+
[0-9]*
Suppose we receive mail that has the string '[ADV]' and '$$$' in the subject line. We'd like this kind of mail to be filed in a special location.
\[ADV\]
\$\$\$
Our advertisers have gotten more clever and have disguised their subject lines like this: '[A D V]' or '[A*D*V]'. How can we file these kinds of messages in a special location?
\[A.?D.?V\]
\[A( |\*)?D( |\*)?V\]
[ADV]
[A*D*V]
[A D V]
[ARDOV]
[A DAV]
...
We receive a lot of mail from people who are not our friends, but who call us "Dear friend" or "Dear friend". How dear can someone be if you don't know their name? We want to be careful, however, that we don't accidentally match our old college buddy Joe who regularly sends out email with "Dear friends," in it.
^dear( )+friend[,:]
We want to put "one time emails" in a special location.
one[- ]?time (only )?(e-?)?mail
onetime mail
onetime email
onetime e-mail
onetime only mail
onetime only email
onetime only e-mail
one-time mail
one-time email
one-time e-mail
one time mail
one time email
one time e-mail
one-time only mail
one-time only email
one-time only e-mail
one time only mail
one time only email
one time only e-mail
Regular expressions allow you to match many strings with just a few patterns. Knowing procmail's special characters will help you to avoid matching patterns you didn't intend to.
This tutorial does not cover the following topics: character class negation, negative match assertion, match assignment, search area anchors, shell substitution, exitcodes, message size checks, variable comparisons, and a host of other important procmail topics.
If you found this tutorial useful, you're ready to dive into the real man pages. See procmailrc, procmailex, and procmailsc for more complex procmail examples.
Scott Wiersdorf, <scott@perlcode.org>