Sorcerer's IsleDocs cfRegexOverviewCode

Character Classes

Character classes allow you to specify a set of characters from which a single character will be matched (unless a quantifier is used), and to do this they provide a notation that is significantly shorter than having to alternate between all the possible characters you want to allow.

The character classes notation use an almost completely different set of metacharacters than normal regex syntax, with only five characters that need to be escaped within a class to prevent them being treated as metacharacters.

The only standard regex constructs which apply inside a character class, other than nested classes, is encoded characters. These work in exactly the same way as a literal character inside a class, and so can be used in ranges.

A character class is formed with two brackets [...] within which a set of characters is provided, and the following rules apply.

Ranges

A class of "[abcdef0123456789]" will match a single character if it is a letter from "a" to "f" or a digit from "0" to "9", but to avoid listing all the letters, ranges can be used.

The above example can be condensed to "[a-f0-9]" and will still match the same thing.

This works for all characters (not just letters and numbers), based on their assigned numerical value in a character chart, so it is important to be aware of this order. (You can use charmap to see character order.)

You can also use encoded characters in ranges, so the ASCII hex encoded "[\x61-\x7A\x30-\x39]" is equivalent to the "[a-f0-9]" version.

It is especially important to remember that doing "[A-z]" will include six characters which exist between "Z" and "a", which is probably not desired, and is can easily be overlooked - so doing it is not recommended even if you do actually want those six characters.

To match a literal hyphen "-" you should prefix it with a backslash "\-", though also be aware that if the first or last character in the class is hyphen it is treated as a literal too (since it cannot be a range).

Negative Classes

If the first character in a class is a caret "^" it transforms the class into a negative character class.

A negative class works in exactly the same way as a normal non-negated class, except the class represents all the characters that should not be matched.

So "[^0-9]" will match any character that is not a digit - any character, including whitespace, control characters, and so on).

To match a literal caret, either escape with "\^" or do not place it as the first character.

Nesting

Character classes can be nested. That is, you can do "[[a-f][0-9]]" and it will work (although this example doesn't have any benefit over "[a-f0-9]").

By default, classes are combined by union (adding the results together), thus "[[a-f][^a-c]]" is not equivalent to "[def]" but actually means "abcdef OR anything not abc", which results in any character being matched.

To combine nested classes with intersection (only the characters common to both classes are used), you can use the special metacharacter "&&" between the classes, so "[[a-f]&&[^a-c]]" is equivalent to "[def]".

Note that you do not need to escape a single ampersand "&" because the metacharacter only exists as a double-ampersand (which should not otherwise appear in a class, however you can use "\&&" if you do somehow have a double-ampersand to be escaped).

Escaping

To include a "\", "[" or "]" inside a character class, they always need to be escaped as "\\", "\[" and "\]" respectively.

When a hypen "-" is not first or last in a class, it must be escaped as "\-", and it is recommended to always manually escape for greater maintainability.

If a caret "^" is the first character in a class, it creates a negative class, unless it is escaped with "\^". A caret that is not at the start of a class does not need to be escaped.

In certain situations, "&&" is a metacharacter, but a single ampersand does not need escaping. Similarly, "{" and "}" can occur as part of a metacharacter but do not themselves need escaping.

No error is returned from over-escaping inside a class, it simply reduces readability and may confuse people new to regex.

In summary, only the five following characters must be escaped to match their literal values inside a class: [ ^ - \ ]

Shorthand Classes

As you might imagine, there are a number of classes which would be used more frequently than others, and so these classes have shorthand notation to simplify patterns that use them.

These three shortcuts all have negated character class variants too:

Since character classes can be nested, you can also nest these shorthand classes, so to match hexidecimal digits you can do: "[\dA-F]" which is equivalent to doing "[[0-9][A-F]]".

(The \s class technically includes two other characters ASCII 11 (vertical tab) and ASCII 12 (form feed) which are also considered whitespace, but generally are not used any more, so are not listed above to avoid unnecessary complexity.)

Other Predefined Classes

In addition to the basic shorthand classes listed above, there are a couple of sets of other convenience classes defined, POSIX-compatible classes and Unicode category classes.

Both of these can be referenced using "\p{Code}" for the normal class, and "\P{Code}" for the negated variant.

For full details of what codes are available and the characters they represent, see the individual POSIX-compatible classes and Unicode category classes pages.