proctut2 - Simple Regular Expressions,
Part I
Procmail regular expressions are used chiefly in condition lines and are integral to writing successful,
succinct recipes.
A knowledge of regular expressions carries over into other domains,
including shell scripting,
general programming,
and other text processing utilities.
Familiarity with regular expressions will allow you to read and understand other people's regular expressions as well as craft your own to make your mail processing precise and efficient.
Procmail uses regular expressions to "find things" in email messages.
For example,
here is a simple procmail recipe that uses a regular expression in the condition for finding the phrase "table tennis":
:0 HB:
* table tennis
/var/mail/pingpong
Any email message with the phrase "table tennis" appearing in either the header or the body will match this condition and the recipe's action line will trigger (delivering the mail to /var/mail/pingpong).
"But wait," you say, "that's just a simple string match."
Yes; simple string matching is but a taste of what regular expressions can accomplish. For the remainder of this document we will freely interchange the phrase "regular expression" with the more economical "regex"[1] and "regexen" as the mildly comedic plural. We will also use the term "pattern" to mean a part or a whole regular expression.
You may have noticed funny characters in procmail recipes. For example, does this sort of recipe give you happy thoughts or bewilderment?
:0:
* ^Content-Type: multipart/[^;]+;[ ]*boundary="?\/[^"]+
mime
This condition contains a variety of regular expression tokens[2], including: ^, ., [, ], +, *, and ?. These characters have special meaning to procmail; they don't just match their literal ASCII equivalents[3]. Sometimes we call a regular expression a regex or pattern.
In the following sections, we'll cover the basics of some of these regular expression tokens. By the end, you should be familiar with (if not comfortable) using them in your recipes.
To avoid distraction, we will not include the leading '*' that denotes a procmail recipe condition. This means instead of this:
* nice regex!
we will simply type:
nice regex!
And now, the funny characters.
The dot (".") or period (for you grammarian purists) is the most commonly seen used regular expression character. It represents any[4] character. Consider following regular expression:
My ...'s name is Larry
This expression will match the following sentences:
My dog's name is Larry
My cat's name is Larry
My ear's name is Larry
My @ B's name is Larry
The dot matches any character[4]. Three dots matches any three characters, as our example illustrates. The last example in our list contains an 'at' symbol and a space; these are also matched by the dot.
A dot's power is extended greatly when combined with quantifiers.
Having the dot is useful because we can match a variety of things:
What the .... is going on here
will match a variety of four letter words:
What the heck is going on here
What the fish is going on here
What the toot is going on here
But now we want to match 3 letter words too:
What the ... is going on here
Wouldn't it be nice if we could write a regex that would match both 3 and 4 letter words without having to write two expressions? With quantifiers we can:
What the ....? is going on here
A quantifier means "how many" of whatever character the quantifier follows. In this example, the question mark serves as a quantifier for the dot immediately preceeding it. More on quantifiers below.
- The Question Mark: ?
- This introduces our first quantifier: the question mark ("?"). As with all quantifiers, the question mark by itself does not match anything; it is a quantifier. It only has meaning when placed after another character[5].
- The question mark indicates that the preceeding character may match zero or one times. In our example, the question mark is applied to the last dot in our expression: ....?
- If we were to read this regex in English, it would say: "Match any character, followed by any character, followed by any character, optionally followed by any character" or more briefly, "Match any three characters followed by zero or one additional characters."
- Now our expression matches:
What the sam is going on here
What the hill is going on here
- We can use multiple question marks to get exactly what we want:
What the ..?.?.?.?.?.?.? is going on here
- can be read: "Any character followed by zero to seven additional characters," that is to say one, two, three, four, five, six, seven, or eight letter words:
What the samhill is going on here
What the gol durn is going on here
What the c is going on here
- The question mark quantifier is sometimes forgotten because of its limited application compared to the other quantifiers, but this should not be. There are entire classes of regular expression problems that can only be solved with a question mark.
- For example, if we want to match a sentence where a plural may or may not occur, we need a question mark:
Please let the dogs? out
- This will match:
Please let the dogs out
- and:
Please let the dog out
- It will not match more than one 's':
Please let the dogsss out
- The question mark means "zero or one of the preceeding character" (sometimes it is read "with an optional x"). The power of the question mark is extended if the preceeding character is a special regular expression character (e.g., the dot).
- The Asterisk: *
- The asterisk ("*") is the next quantifier we will discuss. The asterisk means "zero or more" of the preceeding character. This way, we can match "whatever". In fact, most regular expressions we will encounter will have .* in them, meaning "and anything else (except newlines)".
- Here is a pattern that will match Subject lines that have 'ADV' and 'mortgage' in them:
^Subject: ADV.* mortgage
- Let's read it in English: The word "Subject:"[6], followed by a space, followed by a the three letters ADV, followed by anything else (including nothing else), followed by the word "mortgage".
- This is useful because it will match subjects like this[7]:
Subject: ADV mortgage rates have dropped
Subject: ADV mortgages have never been lower!
Subject: Advertisement: lower your mortgage payment
Subject: advanced mortgage seminars
- In the first example, .* matches nothing; the asterisk assumes its ability to match zero of the preceeding characters (a dot, in this case). In the second example, .* matches all the spaces after 'ADV' (except the space preceeding "mortgages" because we put that literal space in our pattern). In the third example, .* matches "ertisement: lower your". In the fourth example, .* matches "anced" as part of "advanced".
- Another example
- We can see how powerful the dot and asterisk can be together; but the asterisk needn't apply only to dots. It can apply to literal characters as well. Consider the following regular expression:
Subject: .*!!!*
- This regex makes use of the familiar "dot-star" pattern we've seen before. We match the word "Subject:" followed by a space, followed by "anything" (there's our dot-star combo), followed by two exclamation points, followed by zero or more exclamation points.
- So, our pattern will match the following lines:
Subject: dude!!
Subject: Hey!!!!!
- In the first case, the dot-star (.*) in our pattern matches "dude" and the next two exclamation points in our pattern match the two exclamation points following "dude". The final !* of the regex matches nothing, because there are no more exclamation points that haven't been matched already.
- In the second case, the dot-star (.*) in our pattern matches "Hey!!!". The next two exclamation points in our pattern match the final two exclamation points that the dot-star didn't catch; the final !* in our pattern matches nothing again.
- Was that a surprise? Ah! Likely, we forgot that a dot matches anything and when combined with a quantifier it matches as much as it needs to for the expression to be "true".
- Ok, so these weren't very good examples of how to apply the asterisk to a literal character (but we learned something useful anyway). We'll have a better example ready as we discuss the next quantifier.
- The Plus Sign: +
- The plus ("+") is the last quantifier we'll discuss in this tutorial. The plus is closely related to the asterisk, but it means "one or more" rather than "zero or more". This means that there must be at least one instance of the preceeding character. Consider this example:
Subject: .*!!!+
- This is similar to our previous example with asterisk, except that we've exchanged the final asterisk for a plus. If we were to reconsider our previous example lines:
Subject: dude!!
Subject: Hey!!!!!
- We will get different results than with asterisk. For starters, the first line will not match our pattern. The dot-star (.*) matches "dude" and then we have two literal exclamation points. The final .+ of our regex has nothing left to match, and since the plus must match at least one exclamation point, our regex fails for this line.
- The second example does match, but it does so in a different way than the asterisk example did. The leading dot-star matches "Hey!!", instead of "Hey!!!" (one fewer exclamation points). The next two literal exclamation points match, and the final bang-plus (!+) of our regex matches the trailing exclamation point.
- Why does the first dot-star in our regex only match "Hey!!" instead of "Hey!!!" like it did earlier with the asterisk? Quantifiers match as much as they need to to allow subsequent parts of the pattern to match also: the regex engine wants to find matches. In order to allow the bang-plus to match, the dot-star had to match less than it did before.
As a final example, we'll cover the three quantifiers one last time.
- Question Mark
- Let's take our pattern and make it more explicit:
Subject: Hey!?
- This will now match only the following:
Subject: Hey
Subject: Hey!
- The question mark when applied to the exclamation point means "match zero or one exclamation points".
- Asterisk
- Now we'll use the asterisk:
Subject: Hey!*
- This will match:
Subject: Hey
Subject: Hey!
Subject: Hey!!
Subject: Hey!!!!!
etc.
- The asterisk applied to the exclamation point means "match zero or more exclamation points".
- Plus
- Finally, the plus:
Subject: Hey!+
- This matches:
Subject: Hey!
Subject: Hey!!
Subject: Hey!!!!!
etc.
- It will not match:
Subject: Hey
- because the plus sign indicates that there must be at least one of the preceeding character (in this case, when applied to the exclamation point, it means "match one or more exclamation points").
Procmail regular expressions are much like other common egrep-like regular expression languages. Regular expressions are simply string matches with certain characters meaning special things, for example, the dot (".") matches any character[4]. Combined with quantifiers, the dot makes a potent regular expression character.
Quantifiers do not match anything by themselves; they only determine "how many" of the preceeding character to match. The most common quantifers are the question mark ("?") which matches zero or one of the preceeding character; the asterisk ("*") matches zero or more of the preceeding character; and the plus ("+") matches one or more of the preceeding character.
- Note 1
- "Regex" is pronounced "reg-ex" as in "REGular EXpression" with a hard 'G'. You may come across "regexp" in your personal studies. This contraction is notoriously difficult (and highly discouraged) to say aloud without emitting spittle and is likewise frowned upon in writing. The opinions in this article reflect exactly those views of the author and, for reasons of world peace, should be considered authoritative.
- Note 2
- For a complete list, see "Extended Regular Expressions" in promailrc(5).
- Note 3
- There is a way to match the literal tokens by escaping them with a backslash. Here, for example, is how to match a literal period character: \.
- Note 4
- Er, almost any character. It doesn't match a newline character. See "Extended Regular Expressions" in procmailrc(5).
- Note 5
- Yet another half-truth. Quantifiers have meaning when placed after characters, groups, and character classes. We can call these things collectively "entities", "objects", "units", or my personal preference (betraying my Perl background), "thingies". In this tutorial, we simply say "character", but we really mean "thingies".
- Note 6
- The caret means "at the beginning of the line." We'll cover that more thoroughly in another tutorial)
- Note 7
- Procmail, it should be mentioned, matches case-insensitively, that is, without regard to upper or lowercase letters. For case-sensitive matching, you'll need to enable the D flag in your recipe.
Anatomy of a Procmail Recipe, Part I
Simple Regular Expressions, Part II
procmail(1), procmailrc(5), procmailex(5), procmailsc(5), egrep(1), Jeffrey Friedl's Mastering Regular Expressions (O'Reilly)
Jeffrey Friedl's Mastering Regular Expressions (O'Reilly) is a masterful and thorough work on regular expressions. Anyone serious about becoming a competent regular expression writer should read this book. It covers a lot of technical ground, but the examples are excellent and most of it applies directly to procmail regular expressions (one important exception being procmail's lack of a numerical range quantifier like Perl's {n,m} syntax).
Scott Wiersdorf <scott@perlcode.org>
Copyright (c) 2003 Scott Wiersdorf. All rights reserved.
$Id: proctut2.pod,v 1.12 2003/10/15 04:35:46 deep Exp $