ZYTRAX PRODUCTS LOGO

Regular Expressions

Regular expressions is the term used for a codified method of searching 'invented' or 'defined' by the American mathematician Stephen Kleene.

The following section overviews the format and syntax of 'Regular Expressions' specifically as they are used in Apache which includes:

Humble Pie Time: In our examples we blew this expression ^([M-Z]in), we incorrectly stated that this would negate the tests [M-Z], the '^' only performs this function inside square brackets, here it is outside the square brackets and is an anchor indicating 'start from first character'. The corrected section is here (many thanks to Mirko Stojanovic for pointing it out and apologies to one and all).

The syntax (language format) described is compliant with 'extended' regular expressions (EREs) defined in IEEE POSIX 1003.2 (Section 2.8). Extended Regular Expressions (ERE) are now commonly supported by Apache, PHP4, Javascript 1.3, Microsoft's Visual Studio, the GNU family of tools (including grep and sed) as well as many others. Extended Regular Expressions (ERE's) will support 'Basic' Regular Expressions (BRE's essentially a subset or EREs). 'grep' by the way stands for global regular expression print - well,well! and egrep - guess.

Some Definitions before we start

We are going to be using the terms 'literal', 'metacharacter', 'target string', 'escape sequence' and 'search string' in this overview. Here is a definition of our terms:

literal A 'literal' is any actual character we use in a search or matching expression e.g. to find ind in windows the ind is a 'literal' string each character plays a part in the search, it is literally the string we want to find.
metacharacter A 'metacharacter' is one or more special characters that have a unique meaning and are NOT used as 'literals' in the search expression e.g the character ^ (circumflex) is a 'metacharacter'.
escape sequence An 'escape sequence' is a way of indicating that we want to use one of our 'metacharacters' as a 'literal'. In a regular expression an 'escape sequence' involves placing the 'metacharacter' \ (backslash) in front of the 'metacharacter' e.g if we want to find ^ind in w^indow then we use the search string \^ind and if we want to find \\file in the string c:\\file then we would need to use the search string \\\\file (each \ we want to search for is preceded by an escape \).
target string We have chosen to use this term to describe the string that we will be searching i.e. the string in which we want to find our match or search pattern.
search expression We have chosen to use this term to describe the expression that we will be using to search our target string i.e. the pattern we want to find.
Our Example Target Strings

Throughout this piece we will use the following as our target strings:

STRING1   Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
STRING2   Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)

These are Browser ID Strings and appear as the Apache Environmental variable HTTP_USER_AGENT.

Simple Matching

We are going to try some simple matching against our example target strings:

Search for

m STRING1 match as expected
STRING2 no match There is no lower case m in this string. Searches are case sensitive unless you take special action.
a/4 STRING1 match Any combination of characters can be used for the match
STRING2 match Found in same place as in STRING1
in STRING1 match found in Windows
STRING2 match Found in Linux
le STRING1 match found in compatible
STRING2 no match There is an l and an e in this string but they are not adjacent (or contiguous).
Brackets, Ranges and Negation

Bracket expressions introduce our first 'metacharacters' the square brackets which allow us to define list of things to test for rather than the single characters we have been checking up until now.

Metacharacter

Meaning

[ ]

Match anything inside the square brackets for one character position once and only once e.g. [12] says match the target to either 1 or 2 while [0123456789] says match to any character in the range 0 to 9.

-

The - (dash) inside square brackets is the 'range separator' and allows us to define a range so in our example above of [0123456789] we could rewrite it as [0-9].

You can define more than one range inside a list e.g. [0-9A-C] says check for 0 to 9 and A to C.

NOTE: To test for - inside brackets (as a literal) it must come first or last i.e. [-0-9] will test for - and 0 to 9.

^

The ^ (circumflex) inside square brackets negates the expression (we will see an alternate use for the circumflex outside square brackets later) e.g. [^Ff] says anything except upper or lower case F.

NOTE: Spaces, or in this case the lack of them, between ranges are very important.

NOTE: There are some special range values here that are built-in to most regular expression software (and have to be if it claims POSIX 1003.2 compliance for either BRE or ERE).

So lets try this out with our example target strings..

Search for

in[du] STRING1 match finds ind in Windows
STRING2 match finds inu in Linux
x[0-9A-Z] STRING1 no match Again the tests are case sensitive to find the xt in DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can also use this format for testing upper and lower case e.g. [Ff] will check for lower and upper case F.
STRING2 match Finds x2 in Linux2
[^A-M]in STRING1 match Finds Win in Windows
STRING2 no match We have excluded the range A to M in our search so Linux is not found.
Positioning (or Anchors)

We can control where in our target strings the matches are valid. The following is a list of 'metacharacters' that affect the position of the search:

Metacharacter

Meaning

^ The ^ (circumflex) outside square brackets says look only at the beginning of the target string e.g. ^Win will not find Windows in STRING1 but ^Moz will find Mozilla.
$ The $ (dollar) says look only at the end of the target string e.g. $fox will find a match in 'silver fox' but not in 'the fox jumped over the moon'.
. The . (period) says any character in this position e.g. ton. will find tons and tonneau but not wanton because it has no following character.

NOTE: You will find many systems, but not all, support special macros e.g. \< match at beginning of word, \> match at end of word, \b match at the begining OR end of word , \B except at the beginning or end of a word.

So lets try this lot out with our example target strings..

Search for

$[a-z]) STRING1 match finds t) in DigiExt)
STRING2 no match We have a numeric value at the end of this string we need [0-9a-z]) to find it.
.in STRING1 match Finds Win in Windows.
STRING2 match Finds Lin in Linux.
Iteration 'metacharacters'

The following is a set of 'metacharacters' that can control the number of times a character or string is found in our searches:

Metacharacter

Meaning

? The ? (question mark) matches the preceding character 0 or 1 times only e.g. colou?r will find both color and colour.
*

The * (asterix or star) matches the preceding character 0 or more times e.g. tre* will find tree and tread and trough.

+

The + (plus) matches the previous character 1 or more times e.g. tre+ will find tree and tread but not trough.

{n}

Matches the preceding character n times exactly e.g. to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567.

Note: The - (dash) in this case, because it is outside the square brackets, is a 'literal'.

{n,m} Matches the preceding character at least n times but not more than m times e.g 'a{2,3}' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'.

So lets try them out with our example target strings..

Search for

l? STRING1 match finds l in compatible
STRING2 no match Mozilla contains two lls (no match) and Linux has an upper case L (no match).
W*in STRING1 match Finds the Win in Windows.
STRING2 match Finds in in Linux (and finds W zero times)
[xX][0-9a-z]{2} STRING1 no match Finds x in DigExt but only one t.
STRING2 match Finds X and 11 in X11.
Additional 'metacharacters'

The following is a set of additional 'metacharacters' that provide additional power to our searches:

Metacharacter

Meaning

() The ( (open parenthesis) and ) (close parenthesis) may be used to group parts of our search expression together e.g.
| The | (vertical bar or pipe) says find the left hand OR right values
e.g. gr(a|e)y will find 'gray' or 'grey'.

So lets try these out with our example strings..

Search for

^([M-Z]in) STRING1 no match The '^' is an anchor indicating first position. Win does not start the string so no match.
STRING2 no match The '^' is an anchor indicating first position. Linux does not start the string so no match.
(Win)|(Lin) STRING1 match Finds Win in Windows.
STRING2 match Finds Lin in Linux.
Browser Identification

All we ever wanted to do was find enough about our browsers to decide what code to supply or not for our pop-out menus.

We want to know:

Here in their glory are what we used (maybe you can understand them now)

BrowserMatchNoCase Mozilla/[4-6] isJS
BrowserMatchNoCase MSIE isIE
BrowserMatchNoCase Gecko isW3C
BrowserMatchNoCase MSIE.((5\.[5-9])|([6-9])) isW3C
BrowserMatchNoCase W3C_ isW3C

Notes:

Some of the above checks may be a bit excessive i.e. is Mozilla ever spelled mozilla, but it is also pretty silly to have code fail just because of this easy to prevent condition. There is apparently no final consensus that all Gecko browsers will have to use Gecko in their user-agent' string but it would be extremely foolish not to since this would force guys like us to make huge numbers of tests for branded products and the more likely outcome would be that we would not.

Special Range 'metacharacters'

The following is a set of special values that denote certain common ranges. They tend to look very ugly but have the advantage that also take in account the 'locale' i.e. any variant of the local language/coding system (guess what, the whole world cannot not use ASCII 'cos the A stands for American).

Value

Meaning

[:digit:] Only the digits 0 to 9
[:alnum:] Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:] Any alpha character A to Z or a to z.
[:blank:] Space and TAB characters only.
[:xdigit:] .
[:punct:] Punctuation symbols . , " ' ? ! ; :
[:print:] Any printable character.
[:space:] Any space characters.
[:graph:] .
[:upper:] Any alpha character A to Z.
[:lower:] Any alpha character a to z.
[:cntrl:] .

These are always used inside square brackets in the form [[:alnum:]] or combined as [[:digit:]a-d]



Problems, comments, suggestions, corrections (including broken links) or some thing to add? Please take the time from a busy life to 'mail us' (at top of screen), the webmaster (below) or support@zytrax.com. You will have a warm inner glow for the rest of the day.

Copyright © 1994 - 2003 ZyTrax, Inc.
All rights reserved. Legal and Privacy
 
site by zytrax
Questions to webmaster@zytrax.com
Page modified: April 13 2003.