Regular expressions is the term used for a codified method of searching 'invented' or 'defined' by the American mathematician Stephen Kleene.
The following section overviews the format and syntax of 'Regular Expressions' specifically as they are used in Apache which includes:
Humble Pie Time: In our examples we blew this expression ^([M-Z]in), we incorrectly stated that this would negate the tests [M-Z], the '^' only performs this function inside square brackets, here it is outside the square brackets and is an anchor indicating 'start from first character'. The corrected section is here (many thanks to Mirko Stojanovic for pointing it out and apologies to one and all).
The syntax (language format) described is compliant with 'extended' regular expressions (EREs) defined in IEEE POSIX 1003.2 (Section 2.8). Extended Regular Expressions (ERE) are now commonly supported by Apache, PHP4, Javascript 1.3, Microsoft's Visual Studio, the GNU family of tools (including grep and sed) as well as many others. Extended Regular Expressions (ERE's) will support 'Basic' Regular Expressions (BRE's essentially a subset or EREs). 'grep' by the way stands for global regular expression print - well,well! and egrep - guess.
We are going to be using the terms 'literal', 'metacharacter', 'target string', 'escape sequence' and 'search string' in this overview. Here is a definition of our terms:
literal | A 'literal' is any actual character we use in a search or matching expression e.g. to find ind in windows the ind is a 'literal' string each character plays a part in the search, it is literally the string we want to find. |
metacharacter | A 'metacharacter' is one or more special characters that have a unique meaning and are NOT used as 'literals' in the search expression e.g the character ^ (circumflex) is a 'metacharacter'. |
escape sequence | An 'escape sequence' is a way of indicating that we want to use one of our 'metacharacters' as a 'literal'. In a regular expression an 'escape sequence' involves placing the 'metacharacter' \ (backslash) in front of the 'metacharacter' e.g if we want to find ^ind in w^indow then we use the search string \^ind and if we want to find \\file in the string c:\\file then we would need to use the search string \\\\file (each \ we want to search for is preceded by an escape \). |
target string | We have chosen to use this term to describe the string that we will be searching i.e. the string in which we want to find our match or search pattern. |
search expression | We have chosen to use this term to describe the expression that we will be using to search our target string i.e. the pattern we want to find. |
Throughout this piece we will use the following as our target strings:
STRING1 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt) STRING2 Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)
These are Browser ID Strings and appear as the Apache Environmental variable HTTP_USER_AGENT.
We are going to try some simple matching against our example target strings:
Search for |
|||
m | STRING1 | match | as expected |
STRING2 | no match | There is no lower case m in this string. Searches are case sensitive unless you take special action. | |
a/4 | STRING1 | match | Any combination of characters can be used for the match |
STRING2 | match | Found in same place as in STRING1 | |
in | STRING1 | match | found in Windows |
STRING2 | match | Found in Linux | |
le | STRING1 | match | found in compatible |
STRING2 | no match | There is an l and an e in this string but they are not adjacent (or contiguous). |
Bracket expressions introduce our first 'metacharacters' the square brackets which allow us to define list of things to test for rather than the single characters we have been checking up until now.
Metacharacter |
Meaning |
[ ] | Match anything inside the square brackets for one character position once and only once e.g. [12] says match the target to either 1 or 2 while [0123456789] says match to any character in the range 0 to 9. |
- | The - (dash) inside square brackets is the 'range separator' and allows us to define a range so in our example above of [0123456789] we could rewrite it as [0-9]. You can define more than one range inside a list e.g. [0-9A-C] says check for 0 to 9 and A to C. NOTE: To test for - inside brackets (as a literal) it must come first or last i.e. [-0-9] will test for - and 0 to 9. |
^ | The ^ (circumflex) inside square brackets negates the expression (we will see an alternate use for the circumflex outside square brackets later) e.g. [^Ff] says anything except upper or lower case F. NOTE: Spaces, or in this case the lack of them, between ranges are very important. |
NOTE: There are some special range values here that are built-in to most regular expression software (and have to be if it claims POSIX 1003.2 compliance for either BRE or ERE).
So lets try this out with our example target strings..
Search for |
|||
in[du] | STRING1 | match | finds ind in Windows |
STRING2 | match | finds inu in Linux | |
x[0-9A-Z] | STRING1 | no match | Again the tests are case sensitive to find the xt in DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can also use this format for testing upper and lower case e.g. [Ff] will check for lower and upper case F. |
STRING2 | match | Finds x2 in Linux2 | |
[^A-M]in | STRING1 | match | Finds Win in Windows |
STRING2 | no match | We have excluded the range A to M in our search so Linux is not found. |
We can control where in our target strings the matches are valid. The following is a list of 'metacharacters' that affect the position of the search:
Metacharacter |
Meaning |
^ | The ^ (circumflex) outside square brackets says look only at the beginning of the target string e.g. ^Win will not find Windows in STRING1 but ^Moz will find Mozilla. |
$ | The $ (dollar) says look only at the end of the target string e.g. $fox will find a match in 'silver fox' but not in 'the fox jumped over the moon'. |
. | The . (period) says any character in this position e.g. ton. will find tons and tonneau but not wanton because it has no following character. |
NOTE: You will find many systems, but not all, support special macros e.g. \< match at beginning of word, \> match at end of word, \b match at the begining OR end of word , \B except at the beginning or end of a word.
So lets try this lot out with our example target strings..
Search for |
|||
$[a-z]) | STRING1 | match | finds t) in DigiExt) |
STRING2 | no match | We have a numeric value at the end of this string we need [0-9a-z]) to find it. | |
.in | STRING1 | match | Finds Win in Windows. |
STRING2 | match | Finds Lin in Linux. |
The following is a set of 'metacharacters' that can control the number of times a character or string is found in our searches:
Metacharacter |
Meaning |
? | The ? (question mark) matches the preceding character 0 or 1 times only e.g. colou?r will find both color and colour. |
* | The * (asterix or star) matches the preceding character 0 or more times e.g. tre* will find tree and tread and trough. |
+ | The + (plus) matches the previous character 1 or more times e.g. tre+ will find tree and tread but not trough. |
{n} | Matches the preceding character n times exactly e.g. to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567. Note: The - (dash) in this case, because it is outside the square brackets, is a 'literal'. |
{n,m} | Matches the preceding character at least n times but not more than m times e.g 'a{2,3}' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. |
So lets try them out with our example target strings..
Search for |
|||
l? | STRING1 | match | finds l in compatible |
STRING2 | no match | Mozilla contains two lls (no match) and Linux has an upper case L (no match). | |
W*in | STRING1 | match | Finds the Win in Windows. |
STRING2 | match | Finds in in Linux (and finds W zero times) | |
[xX][0-9a-z]{2} | STRING1 | no match | Finds x in DigExt but only one t. |
STRING2 | match | Finds X and 11 in X11. |
The following is a set of additional 'metacharacters' that provide additional power to our searches:
Metacharacter |
Meaning |
() | The ( (open parenthesis) and ) (close parenthesis) may be used to group parts of our search expression together e.g. |
| | The | (vertical bar or pipe) says find the left hand OR right values e.g. gr(a|e)y will find 'gray' or 'grey'. |
So lets try these out with our example strings..
All we ever wanted to do was find enough about our browsers to decide what code to supply or not for our pop-out menus.
We want to know:
Here in their glory are what we used (maybe you can understand them now)
BrowserMatchNoCase Mozilla/[4-6] isJS BrowserMatchNoCase MSIE isIE BrowserMatchNoCase Gecko isW3C BrowserMatchNoCase MSIE.((5\.[5-9])|([6-9])) isW3C BrowserMatchNoCase W3C_ isW3C
Notes:
Line 4 checks for MSIE 5.5 (or greater) OR MSIE 6+.
NOTE about binding:This string does not work:
BrowserMatchNoCase MSIE.(5\.[5-9])|([6-9]) isW3C
It incorrectly sets variable isW3C if the number 6 - 9 appears in the string. Our guess is the binding of the first parenthesis is directly to the MSIE expression and the OR and second parenthesis is treated as a separate expression. Adding the outer parenthesis fixed the problem.
Line 5 checks for W3C_ in any part of the line. This allows us to identify the W3C validation services (either CSS or HTML/XHTML page validation).
Some of the above checks may be a bit excessive i.e. is Mozilla ever spelled mozilla, but it is also pretty silly to have code fail just because of this easy to prevent condition. There is apparently no final consensus that all Gecko browsers will have to use Gecko in their user-agent' string but it would be extremely foolish not to since this would force guys like us to make huge numbers of tests for branded products and the more likely outcome would be that we would not.
The following is a set of special values that denote certain common ranges. They tend to look very ugly but have the advantage that also take in account the 'locale' i.e. any variant of the local language/coding system (guess what, the whole world cannot not use ASCII 'cos the A stands for American).
Value |
Meaning |
[:digit:] | Only the digits 0 to 9 |
[:alnum:] | Any alphanumeric character 0 to 9 OR A to Z or a to z. |
[:alpha:] | Any alpha character A to Z or a to z. |
[:blank:] | Space and TAB characters only. |
[:xdigit:] | . |
[:punct:] | Punctuation symbols . , " ' ? ! ; : |
[:print:] | Any printable character. |
[:space:] | Any space characters. |
[:graph:] | . |
[:upper:] | Any alpha character A to Z. |
[:lower:] | Any alpha character a to z. |
[:cntrl:] | . |
These are always used inside square brackets in the form [[:alnum:]] or combined as [[:digit:]a-d]
Problems, comments, suggestions, corrections (including broken links) or some thing to add? Please take the time from a busy life to 'mail us' (at top of screen), the webmaster (below) or support@zytrax.com. You will have a warm inner glow for the rest of the day.
Copyright © 1994 - 2003 ZyTrax, Inc. All rights reserved. Legal and Privacy |
site by zytrax |
Questions to webmaster@zytrax.com Page modified: April 13 2003. |