=head1 NAME proctut4 - Simple Regular Expressions, Part III =head1 SYNOPSIS We finish our "Simple Regular Expressions" series by learning about anchors that procmail uses in matching. Anchors "tie down" a regular expression so it doesn't float left or right on the line. We also cover some real-world examples that will bring together everything we've covered about regular expressions to date. =head1 DESCRIPTION We've learned about the L, L, and L, which include L and L (as well as L). What more could we possibly need? Funny you should ask. Let's pretend that as part of our spam fighting arsenal, we trash messages that contain certain words in the subject line. The regular expression we use to do that looks like this: Subject:.*your (web ?)?site This is an incomplete recipe, of course, missing the L and L lines. As we've done in previous tutorials where we want to look just at the regular expression, we also remove the asterisk the the beginning of every L. So far, so good. Our condition is successfully matching spam messages and trashing them. Our unwitting friend, however, bumblingly sends us an email message with the following subject: Subject: Dude, I think your site needs work! Will it match our anti-spam recipe? Of course it will. Our regular expression reads in English: Match the phrase "Subject:" followed by anything ("Dude, I think "), followed by the word "your ", followed by the word "site" (remember that "web" and space are optional). That matches our subject line alright. "But there's more stuff after 'site'," you say. Yes, there is, but our pattern doesn't care about that: we reached the end of our regular expression successfully and that's good enough for procmail. So how can we allow our friend's email message through but still block spam? =head2 Two Solutions Our naive solution is to have our friend resend his message, but change his subject line; we call him up and ask if he could send that again. He obliges promptly and send the message again. We wait a few moments, but the message never arrives. We check our spam log again, and, dang it, there's his message again! What's going on? His message had this subject line: Subject: second try: check out yo' space Hmmm... that doesn't seem like it would match our regex. What else could possibly trigger our recipe? As we examine his message further, we notice the following line in his message headers: Old-Subject: Dude, I think your site needs work! Ha! His email client preserved the old subject line and prepended "Old-" in front of it. But why did it catch in our spam line again? "Old-Subject:" doesn't match "Subject:" does it? Funny you should ask. =head2 Leftmost, Shortest Match The procmail regular expression engineL<[1]|proctut4.pod/Note_1> matches (roughly) like this: It compares the first character in the pattern (our letter "S" in "Subject") with the first character in the header line ("O" in "Old-Subject"). If they're the same (they're not in this case), the next letter of the pattern (still "u") is compared with the next letter of the header and so forth until the end of the pattern is reached. If the first character of the pattern does not match the first character of the line we're matching (as it is in our case) the I character of the header ("l"--ell in "Old-Subject") is compared with the I character of the pattern. We continue looking at our pattern's first character but slide along to the right in the string we're examining. The regular expression engine I to match. As we move along in our line, we finally get to the "S" in "Old-Subject" and we begin our comparison from there. This is where our match happens and why the pattern "Subject:" matches the string "Old-Subject:". What we need is a way to tell the procmail regular expression engine to I the match to occur only at the beginning of the line; the regular expression token that accomplishes this is called an I. Wildcards (e.g., the dot character, which represents I character (except newline) or character classes are considered "wildcards") accompanied by either an asterisk ("zero or more") or plus sign ("one or more") will match as I characters as possible when matching. This is different than most other regular expression engines. =head2 Anchors Like an anchor for a boat, regular expression anchors prevent the pattern from "drifting". They force the pattern to either the beginning (left end) of the string or the end (right end) of the string. Procmail has several anchors; in this tutorial we will cover two of them: the I or caret ("^") and the I or dollar-sign ("$"). =head2 The Caret You might remember that the caret already does something special for us. It is our L. When the caret is found as the first character in a L (using square braces: "[" and "]"), it I or I the character class. However, when the caret is used as the first character in a regular expression pattern, it becomes a beginning-of-line anchor. This means that the pattern following the anchor I match ("be anchored") at the beginning of the line. Recall our pattern: Subject:.*your (web ?)?site and how it matched "Old-Subject:". We don't want this, so we I the pattern to the beginning of the line with a caret: ^Subject:.*your (web ?)?site Now our pattern reads (in English): The beginning of the line, immediately followed by "Subject:", followed by zero or more characters (anything but a newline), followed by "your ", optionally followed "web" and an optional space, followed by "site". That's precisely what we want. We make this minor adjustment to our recipe and sleep well knowing that we won't be catching any "Old-Subject:" lines again in our spam traps. =head2 The Dollar Sign Unfortunately, our friend has a short memory. He has already forgotten that he shouldn't send any email messages with the subject "your site" or "your web site" in it. He sent again the following subject: Subject: Dude, I think your site needs work! which of course even matched our new pattern. What can we do to allow these email messages through but still block unsavory messages? Well, we notice, for one thing, that all of the spam subjects I with the word "site", that is, we never get any subjects that go beyond that: Subject: your web site Subject: Re: your web site Subject: Fw: Re: your website So, we want to match phrases that I with the word "site". This is possible after we introduce our second anchor, the I ("$"). The dollar sign matches the "end-of-line"L<[2]|proctut4.pod/Note_2>. So for us to make our pattern tight, we turn this: ^Subject:.*your (web ?)?site into this: ^Subject:.*your (web ?)?site$ which reads in English: Match the beginning of the line (the caret), immediately followed by "Subject:", followed by zero or more characters (anything but a newline), followed by "your ", optionally followed "web" and an optional space, followed by "site", immediately followed by a newline character. Now our pattern matches only the spams we've been getting and messages from our friend no longer match, since the character after "site" is not a newline. Compare this paragraph with the equivalent for the L anchor above. Correct use of the caret and dollar sign will make your patterns match quicker because the regular expression engine can immediately tell after one comparison whether the pattern will match the current line or not. =head1 EXAMPLES As promised, we'll now cover a few real-world examples from my own procmail recipe file (slightly simplified) to illustrate dots, quantifiers, character classes, and anchors. Examples will be given first as a procmail recipe, followed by an English translation for you to read along with. I have found English translations of regular expressions invaluable in learning how to "speak" regex. =over 4 =item Example 1 Hoping that current spam legislation will pass and all email advertisers will comply, I have created a recipe that will put any subject line with 'ADV' at the beginning of the subject in a special file that I will read later. :0: * ^Subject:[ ]*.?ADV.?:?[ ]+ for_when_im_bored The English translation: The beginning of the line, immediately followed by "Subject:", followed by zero or more spaces or tabs, optionally followed by any single character, followed by "ADV" (case-insensitive), optionally followed by any single character, optionally followed by a colon, followed one or more spaces or tabs. For a case-sensitive version, we add the 'D' flag: :0 D: * ^Subject:[ ]*.?ADV.?:?[ ]+ for_when_im_bored This will match only when 'ADV' is really 'ADV' (not, for example, 'Adv' or 'adv'). =item Example 2 Being naturally suspicious of mail not destined for me, I decide to shunt email messages destined to "undisclosed-recipients" somewhere else. :0: * ^To:[ ]*undisclosed[\.\-]recipients not_for_me The English translation: The beginning of the line, followed by zero or more spaces or tabs, followed by "undisclosed", followed by a literal dot or a hyphen, followed by "recipients". =item Example 3 I don't ever confuse a web browser for an email client, so I don't often get around to reading HTML email messages--email is not for sending web pagesL<[3]|proctut4.pod/Note_3>. I'd like to put email messages written exclusively in HTML or Word documents in a folder I can convert to plain text later. :0 B: * ^Content-type:[ ]*(text/html|application/ms-word) html_email The English translation reads: The body of the email message must have a line that begins with "Content-type:", followed by zero or more spaces, followed by "text/html" or "application/ms-word". =back =head1 SUMMARY Anchors "tie down" our patterns to the beginning or the end of the line, making our patterns match (or not match) quicker (since the engine doesn't have to walk all the way down the string to see if the next character matches) and more accurate. The caret anchor ties our pattern to the beginning of the line, which the dollar sign anchor ties our pattern to the end of the line. =head1 NOTES =over 4 =item Note 1 Procmail is unlike most other regular expression engines, including Perl's, egrep's and many other popular programming languages, which implement a I, I match. Procmail uses a I, I match, which makes it a much faster parser than many other languages. The only time procmail will match I is when a pattern appears to the right of the \/ (match) operator. The L covers this completely. =item Note 2 The dollar sign has many uses in procmail. It indicates a variable, when attached to letters immediately after it. It turns on shell interpolation when it is all by itself at the start of a condition. These uses and others will be covered in other tutorials. =item Note 3 Yet another of your author's strong opinions. There's a lot of debate about this, of course. The crux of my position comes from an old security adage that says "keep code and data separate." Nearly all of our security problems today, including the recent spread of Windows worms/viruses stem from the fact that Microsoft forgot this simple principle when they designed their operating system, web browser, and email clients. =back =head1 PREVIOUS L =head1 NEXT I =head1 SEE ALSO procmail(1), procmailrc(5), procmailex(5), regex(3) =head1 AUTHOR Scott Wiersdorf Escott@perlcode.orgE =head1 COPYRIGHT Copyright (c) 2003 Scott Wiersdorf. All rights reserved. =head1 REVISION $Id: proctut4.pod,v 1.6 2003/10/18 02:52:03 deep Exp $