Groups are useful when you have a sub-expression that should be treated as a single unit, so that it can either be captured or repeated.
There are three different types of groups: capturing group, non-capturing group, and atomic group.
A capturing group is where the contents of a group are stored, and can be used as a backreference within the expression, or returned to be acted upon outside of the regex (such as in a replacement string or function).
To create a capturing group, simply enclose the sub-expression with parentheses:
(captured group)
Capturing groups can be nested - their capture number is counted based on the
position of their opening parenthesis, and captured content includes that of any
enclosed groups. That is, "(a(b)(c))((d)e)
" results in the five captured
values of "abc","b","c","de","d".
When you want to refer to the value of a captured group, you use what is known
as a backreference, which is the group number preceeded by a backslash. So for
group 1 you do \1
, for group 2 you do \2
and so on. It is possible to have
over a hundred groups, but it is not recommended to actually use this many -
if you have a regex with more than a dozen captured groups then you should
consider if there is a better way to do whatever you are doing.
It is important to remember that a backreference is equivalent to the literal
text which was captured by the group, not the instructions within in. (For
example, "([abc])\1
" will match "aa" or "bb" or "cc", but not "ab" or
anything else.)
Some regex implementations support named capture groups, which make it easier to keep track of what's what. These named groups are also numbered according to their position.
cfRegex uses java.util.regex which (since Java 7) supports the capture syntax
(?<name>...)
and the backreference syntax \k<name>
-
where name is alphanumeric, but must start with a letter.
For comparison Python's regex implementation uses a similar but different syntax - it captures with
(?P<name>...)
and uses \g<name>
for backreferences.
When you do not need the value of a group, but simply want to act upon it as a single item, you should use a non-capturing group.
(?:non-capturing group)
You can also combine a non-capturing group with a mode flag, to apply a particular regex mode only to the expression within the group.
For example, if there is a place you need dot to match newline, but not for the
whole expression, then "(?s:.)
" could be used.
Alternatively, you might have an expression which is case-insensitive, expect
for one small part, "(?-i:CASE IMPORTANT)
" is a way to do that.
Non-capturing groups with flags can still also be used for repetition.
Atomic groups are also non-capturing but they go a step further than simply treating a sub-expression as a single item - they prevent the regex engine from backtracking inside the group (whilst a normal non-atomic group allows backtracking to re-evaluate its contents).
This is an advanced feature that can help improve performance, but you should fully understand what backtracking is - when you want it and when you don't - before attempting to use atomic groups.
(?>atomic group)