When working with regex it is useful to remember that matching can deal with positions as well as (or instead of) matching actual characters. A position is the 'gap' between characters, and is sometimes referred to as a zero-length match.
There are two main ways to match positions - by using pre-defined "boundary" metacharacters, or by using ad-hoc "lookaround" expressions.
To match the start of the input text - the position before the first character -
you can use either "\A
" or "^
". The former will only ever match this
position, but when multiline mode is enabled, "^
" will additionally match
start of line position.
To match the end of the input text - the position after the last character - you
can use either "\z
" or "$
". The former will only ever match the end of
input, whilst the latter can also match the end of lines, if multiline mode is
enabled.
There is also a "\Z
" which almost matches the end of input but will match at
the position before a trailing newline, if there is one.
When in multiline mode, you can use caret "^
" to match the start of the line,
which is defined as the position after a newline.
What is considered a newline can be altered with Unix Lines mode, which allows you to include or exclude carriage returns, (however this will only affect the start of line position if there are individual carriage returns, since carriage returns paired with a line-break come at the end of the line, not the start).
With multiline mode disabled, there is no explicit start of line character,
though a positive lookbehind can be used, i.e. (?<=\n)
When in Multiline mode you can use dollar "$
" to match the end of the line,
which is defined as the position before any newline.
Whether carriage returns are considered part of newline can be controlled with the Unix Lines mode which (when enabled) means that only newline character is considered a newline, and the position matched will be after any carriage returns that might otherwise be paired with a newline.
With multiline mode disabled, there is no explicit end of line character,
though a positive lookahead can be used, i.e. (?=\n)
This is slightly different to what you might expect. It does not match
whitespace between words (remember, whitespace is characters, and we're dealing
in positions), but the "\b
" word boundary metacharacter is used to match a
change between a word character and a non-word character.
There is a word boundary position between the two characters "a-" and also between "-b", but there is not a word boundary between "ab" nor is there one between "--".
Whilst some regex implementations have distinct "start word" and "end word"
boundaries, the engine used by cfRegex does not differentiate them. You can
workaround this by using lookarounds to immitate start of word (?<!\w)(?=\w)
and end of word (?<=\w)(?!\w)
You can match the opposite of a word boundary using "\B
", which will match
between "ab" and between "--" but not between "a-" nor "-b".
When you need to match an adhoc-position, you can use lookarounds. A lookaround
lets you use a sub-expression to indicate the position that can match. For
lookaheads you have the full regex syntax available to you. For lookbehinds you
can only use limited-width quantifiers (that is, the standard "*
", "+
",
and variants are unavailable, since they do not have a maximum width).
As you might guess from the name, lookarounds do not actually match anything - they simply look at what is ahead or behind and determine if their sub-expression will match or not, and either succeed (and let matching continue) or fail (and the match fails).
Lookarounds can be either positive (their sub-expression must match to succeed), or negative (their sub-expression must not match to success), which - combined with lookahead and lookbehind - gives four different lookarounds in total:
(?=...)
(?!...)
(?<=...)
(?<!...)
It is useful to remember that - since lookarounds do not consume characters -
they can be "stacked" to allow for a combination of conditions (which might be
less maintainable if expressed all together), for example you can use
"(?=\w)(?!x)
" to match the position before any word character except the
letter "x". As a single lookahead this would need to be "(?=[A-Za-wyz0-9_])
"
which is obviously more long-winded.