Character classes allow you to specify a set of characters from which a single character will be matched (unless a quantifier is used), and to do this they provide a notation that is significantly shorter than having to alternate between all the possible characters you want to allow.
The character classes notation use an almost completely different set of metacharacters than normal regex syntax, with only five characters that need to be escaped within a class to prevent them being treated as metacharacters.
The only standard regex constructs which apply inside a character class, other than nested classes, is encoded characters. These work in exactly the same way as a literal character inside a class, and so can be used in ranges.
A character class is formed with two brackets [...]
within which a set of characters is provided, and the following rules apply.
A class of "[abcdef0123456789]
" will match a single character if it is a
letter from "a" to "f" or a digit from "0" to "9", but to avoid listing all the
letters, ranges can be used.
The above example can be condensed to "[a-f0-9]
" and will still match the same
thing.
This works for all characters (not just letters and numbers), based on their
assigned numerical value in a character chart, so it is important to be aware of
this order. (You can use charmap
to see character order.)
You can also use encoded characters in ranges, so the ASCII hex encoded
"[\x61-\x7A\x30-\x39]
" is equivalent to the "[a-f0-9]
" version.
It is especially important to remember that doing "[A-z]
" will include six
characters which exist between "Z" and "a", which is probably not desired, and
is can easily be overlooked - so doing it is not recommended even if you do
actually want those six characters.
To match a literal hyphen "-" you should prefix it with a backslash "\-
",
though also be aware that if the first or last character in the class is hyphen
it is treated as a literal too (since it cannot be a range).
If the first character in a class is a caret "^
" it transforms the class into
a negative character class.
A negative class works in exactly the same way as a normal non-negated class, except the class represents all the characters that should not be matched.
So "[^0-9]
" will match any character that is not a digit - any character,
including whitespace, control characters, and so on).
To match a literal caret, either escape with "\^
" or do not place it as
the first character.
Character classes can be nested. That is, you can do "[[a-f][0-9]]
" and it
will work (although this example doesn't have any benefit over "[a-f0-9]
").
By default, classes are combined by union (adding the results together), thus
"[[a-f][^a-c]]
" is not equivalent to "[def]
" but actually means "abcdef OR
anything not abc", which results in any character being matched.
To combine nested classes with intersection (only the characters common to both
classes are used), you can use the special metacharacter "&&
" between the
classes, so "[[a-f]&&[^a-c]]
" is equivalent to "[def]
".
Note that you do not need to escape a single ampersand "&
" because the
metacharacter only exists as a double-ampersand (which should not otherwise
appear in a class, however you can use "\&&
" if you do somehow have a
double-ampersand to be escaped).
To include a "\
", "[
" or "]
" inside a character class, they always need
to be escaped as "\\
", "\[
" and "\]
" respectively.
When a hypen "-
" is not first or last in a class, it must be escaped as "\-
",
and it is recommended to always manually escape for greater maintainability.
If a caret "^
" is the first character in a class, it creates a negative class,
unless it is escaped with "\^
". A caret that is not at the start of a class
does not need to be escaped.
In certain situations, "&&
" is a metacharacter, but a single ampersand does
not need escaping. Similarly, "{
" and "}
" can occur as part of a
metacharacter but do not themselves need escaping.
No error is returned from over-escaping inside a class, it simply reduces readability and may confuse people new to regex.
In summary, only the five following characters must be escaped to match their
literal values inside a class: [
^
-
\
]
As you might imagine, there are a number of classes which would be used more frequently than others, and so these classes have shorthand notation to simplify patterns that use them.
[0-9]
" you can use "\d
" for a digit".[A-Za-z0-9_]
" you can use "\w
" for a word character.[\r\n\t ]
" you can use "\s
" for a whitespace character.These three shortcuts all have negated character class variants too:
[^0-9]
" you can use "\D
" for a non-digit.[^A-Za-z0-9_]
" you can use "\W
" for a non-word character.[^\r\n\t ]
" you can use "\S
" for a non-whitespace character.Since character classes can be nested, you can also nest these shorthand classes,
so to match hexidecimal digits you can do: "[\dA-F]
" which is equivalent to
doing "[[0-9][A-F]]
".
(The \s
class technically includes two other characters ASCII 11 (vertical tab)
and ASCII 12 (form feed) which are also considered whitespace, but generally are
not used any more, so are not listed above to avoid unnecessary complexity.)
In addition to the basic shorthand classes listed above, there are a couple of sets of other convenience classes defined, POSIX-compatible classes and Unicode category classes.
Both of these can be referenced using "\p{Code}
" for the normal class, and
"\P{Code}
" for the negated variant.
For full details of what codes are available and the characters they represent, see the individual POSIX-compatible classes and Unicode category classes pages.