Knowing how a regex engine interprets the rules of a regex pattern is useful, as it helps you to write more efficient expressions and avoid certain mistakes.
The core regex engine used by cfRegex is the java.util.regex library, which uses an algorithm known as "backtracking NFA".
Whilst you don't need to know the full details of the underlying theory of how this algorithm works and how it compares to other possible implementations, it is useful to understand a number of practical aspects to it, which are described in the sections below.
This is something common to all existing regex engines - matching is done from left-to-right, in all situations. Whilst a regex matcher might skip back to a previous position, it never actually scans from right-to-left
If you need to deal only with something at the end of a string (such as a file
extension) it is always best to isolate the end of the string using simpler
methods before trying to match a regex against it. (e.g. use Right(string,num)
or ListLast(string,delimiter)
as these both work by starting from the right
and moving towards the left.)
To demonstrate this behaviour, consider the text
"the quick fox jumps over the lazy brown dog", and the expression "\b.+?dog
".
You might think the ".+?
" will match at least once but as few characters as
possible, and so the closest the "\b
" can match to "dog" is after the "n" of
"brown", before the space - however this is falling into the trap of applying
your own direction to the interpretation of the instructions.
The regex engine starts with the "\b
", which matches immediately, before the
first "t", then the ".+?
" matches the "t", tries to pass control to the next
instruction (because it is a lazy quantifier), but this keeps failing and passes
back to the lazy quantifier, which keeps expanding it's match one character
at a time, until it has eventually found the "dog" at the end and ended up
matching the entire text.
So remember: Regex always starts with the first instruction and at the start (left-most position) of a string of characters.
One of the key benefits of regex is its ability to try multiple alternatives, however it is not always obvious what the cost of this may be.
When matching, a regex engine applies the regex instructions one at a time. If the instruction succeeds, it moves forward and tries the next instruction. When an instruction fails, it looks to see if there are alternatives, if there are none it skips back to the last point it had alternatives and tries a different option, and it repeats this whole process until either the regex has matched, or the regex engine has ran out of alternatives to try.
A regex matcher will not skip forward to future parts of the regex and try those first. It will not even check what the next instruction is and keep track of when that might match - it deals with the current instruction completely before even considering what the following instruction might be.
For example, the pattern ^.+(?=://)
might be used in order to match the
protocol from a URL - basically saying, "start at the beginning, match any
character as many times as possible, then look for (but don't match) ://".
However, what happens here is that the +
quantifer matches the entire input
string until the end, then it tries (and fails) to match :// and it backtracks
one character at a time, attempting to match, failing, and backtracking again,
until eventually it reaches almost the start of the string, and eventually
finds the :// at position four (assuming a string starting "http://...").
The most common solution people use for this is to use the "lazy" +?
quantifier instead, however whilst this is significantly faster it will still
involve unnecessary backtracking. Lazy quantifiers will match one unit at a
time, then immediately say "ok, what's next", and (when the next instruction
fails) control is passed back, an additional unit is matched, and then it
passes control forward again. For this example that's not going to matter much,
but for longer strings this can be just as much as issue as the .+
originally
used.
The improved solution is not to use .
but to be more specific with what is
expected to be found. Using either [^:]+
or \w+
will both prevent the
regex going all the way along the string, whilst also avoiding the continuous
backtracking that a lazy quantifier can cause.
See the following diagram for a more visible example of what is going on:
^.+(?=://)
Backtracks len(input)-len(match) times.
^.+?(?=://)
Backtracks len(match)-1 times.
^[^:]+(?=://)
Matches without backtracking.
Sometimes backtracking is unavoidable - without it regex would not be as useful as it is - but the key is to try to keep it to a minimum, and if it can be avoided altogether that's even better.
When using pipe "|
" to allow a number of alternatives, it is important to keep
in mind that regex does not even consider later options if an earlier one has
matched successfully.
The expression (car|cart|carpet)(\w*)
matched against the text "carpets" will
place "car" in group one and "pets" in group two.
In other regex implemenations (ones without backtracking), all three alternatives are matched in parallel and the longest match is used (which would result in "carpet" and "s").
If we changed the expression to (car|cart|carpet)([^p]\w*)
and still matched
against the text "carpets", the first group would still match "car", however
(when attempting to match the second group) the [^p]
would prevent a succesful
match, so the regex engine would backtrack, attempt "cart" which fails, then
attempt "carpet" which succeeds, then move forward to the second group again,
which would match the "s".
For this reason, you should order alternatives so the most likely to match is first and the least likely of the options is last.