BLOG

Regular Expression (regex) Sheet

01 Oct 2014, Posted by BoB in All Posts, Computer Talk

POSIX Character Class Definitions…

Value

Meaning

[:digit:] Only the digits 0 to 9
[:alnum:] Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:] Any alpha character A to Z or a to z.
[:blank:] Space and TAB characters only.
[:xdigit:] Hexadecimal notation 0-9, A-F, a-f.
[:punct:] Punctuation symbols . , ” ‘ ? ! ; : # $ % & ( ) * + – / < > = @ [ ] \ ^ _ { } | ~
[:print:] Any printable character.
[:space:] Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
[:graph:] Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
[:upper:] Any alpha character A to Z.
[:lower:] Any alpha character a to z.
[:cntrl:] Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.

Common Extensions and Abbreviations…

Character Class Abbreviations

\d Match any character in the range 0 – 9 (equivalent of POSIX [:digit:])
\D Match any character NOT in the range 0 – 9 (equivalent of POSIX [^[:digit:]])
\s Match any whitespace characters (space, tab etc.). (equivalent of POSIX [:space:] EXCEPT VT is not recognized)
\S Match any character NOT whitespace (space, tab). (equivalent of POSIX [^[:space:]])
\w Match any character in the range 0 – 9, A – Z and a – z (equivalent of POSIX [:alnum:])
\W Match any character NOT the range 0 – 9, A – Z and a – z (equivalent of POSIX [^[:alnum:]])

Positional Abbreviations

\b Word boundary. Match any character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus \bton\b will find ton but not tons, but \bton will find tons.
\B Not word boundary. Match any character(s) NOT at the beginning(\Bxx) and/or end (xx\B) of a word, thus \Bton\B will find wantons but not tons, but ton\B will find both wantons and tons.

Iteration ‘metacharacters’…

Metacharacter

Meaning

? The ? (question mark) matches when the preceding character occurs 0 or 1 times only, for example, colou?r will find both color (u is found 0 times) and colour (u is found 1 time).
* The * (asterisk or star) matches when the preceding character occurs 0 or more times, for example, tre* will find tree (e is found 2 times) and tread (e is found 1 time) and trough (e is found 0 times).
+ The + (plus) matches when the preceding character occurs 1 or more times, for example, tre+ will find tree (e is found 2 times) and tread (e is found 1 time) but NOT trough (0 times).
{n} Matches when the preceding character, or character range, occurs n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567. Value is enclosed in braces (curly brackets).

Note: The – (dash) in this case, because it is outside the square brackets, is a literal. Louise Rains writes to say that it is invalid to commence a NXX code (the 123) with a zero (which would be permitted in the expression above). In this case the expression [1-9][0-9]{2}-[0-9]{4} would be necessary to find a valid local phone number.

{n,m} Matches when the preceding character occurs at least n times but not more than m times, for example, ba{2,3}b will find baab and baaab but NOT bab or baaaab. Values are enclosed in braces (curly brackets).
{n,} Matches when the preceding character occurs at least n times, for example, ba{2,}b will find ‘baab’, ‘baaab’ or ‘baaaab’ but NOT ‘bab’. Values are enclosed in braces (curly brackets).

Post a comment