In formal language theory
, a regular expression
is a construct that denotes a language
, that is, a set of string
s. It is built from the following symbols:
- literal characters that can appear in the string
- the Kleene star to express "0 or more times the preceding"
- the set union symbol to express alternatives
- brackets to group
With "regex" I think of Unix regular expressions; in particular, of a particular C library to implement them written by Henry Spencer.
In Unix, the ed text editor features a regular expression language for search and replace. This was carried over to its successor ex/vi, the standalone search utility grep, and the standalone search-and-replace utility sed; from there, regular expressions found their way into later text processing utilities such as expr, awk, emacs and perl.
Basic Unix regexps omit the choice, add
- . (the wildcard character, matching any character)
- character classes, with negation: [a-z0-9^4-6] (all lowercase letters plus the digits 0,1,2,3,7,8, and 9)
- ^ (start of line) and $ (end of line)
The egrep utility extends this with
- '+' (1 or more times the preceding)
- '?' (0 or 1 times the preceding)
- '|' (choice: match one or the other)
and other utilities add their own conventions. Perl
has a particularly rich set of extensions that allow the craziest things, even including bracket matching. This makes them much more expressive
than basic regular expressions.
The problem with regexps: this is Unix, so each utility has its own variant of them, and each variant of each utility has its own variant of that variant; sometimes, you can even configure the variant you want to use. For example, Solaris comes with two different sets of these utilities (in /bin and in /usr/xpg4/bin), and self-respecting sysadmins will add a third set (in /usr/gnu/bin). So in order to figure out the exact syntax of your regexps, you have to be very mindful of which variant of which tool you happen to be using, on which variant of Unix.
POSIX standardization tried to remedy this: it offers regex matching that is configurable with many binary flags to enable/disable features or change the syntax (on my Linux box, 'man regex' has the details). But many new features have been added to Perl, for instance, that the standard doesn't encompass.
See also the expressive power of regular expressions.