A regular expression is a string of characters that defines a set of one or more other strings.
Any string that is defined by a regular expression is said to match that expression.

Regular Expressions are implemented by a number of different languages and tools an unfortunately each implementation tends to be slightly different. This writeup attempts to be a general overview of REs.

Delimiters
A delimiter is a special character that is used to mark the beginning and end of a regular expression. A common delimiter is /.

The most basic regular expression contains no special characters other than the delimiter and matches only itself
For example
/ring/ matches ring as in spring, ringing, stringing

To get a regular expression to match more than one string you use special characters that have special meaning when part of a regular expression
The following lists special characters and their meaning together with examples. In the examples the only the strings in bold are matched. Oh and the examples build on each other

. ( period )
Matches any single character
For Example /.alk/ matches all strings with any character preceding 'alk'
as in balk or talking

* ( asterisk)
An asterisk will match zero or more occurrences of the character directly before it. Note that the character directly before it can be defined by a regular expression.
For example /ab.*c/ matches ab followed by zero or more occurrences of any character followed by c
as in abc or abjhgt gfafdg 43543 fgd c

^ ( caret )
Causes the regular expression to only match strings at the beginning of a line
For example /^T/ matches a T at the start of a line
as in
This line
but not This line

$ ( dollar sign )
Causes the regular expression to only match strings at the end of a line
For example /:$/ matches any colon that ends a line
as in
this line:
but not : this one

[] ( square brackets )
define a character class that matches any single character within the brackets.
For Example /t[aeiou].k/ matches t followed by a lower case vowel, any character and a k
as in talk or stink or teak
Within square brackets *,/ and $ lose their special meanings
If the first character following [ is a ^ it has a new meaning - the character class now matches any single character not within the brackets
Also you can use a - to denote a range of characters
For Example /[^a-zA-Z]/ matches any single character that is not a letter

Turning Special Characters Off
You can turn a special character off by preceding it with a \ ( backslash ). This is known as quoting
For Example /\*/ matches a single asterisk
and /\\/ matches a single backslash
and /and\/or/ matches and/or

Longest Match Possible
A regular expression will always match the longest match possible
For example, given the following string:
This (Dman) is a quite ( opinionated young fellow ), isn't he?

/Th.*is/ matches This (Dman) is a quite ( opiniated young fellow ), is and /(.*)/ matches (Dman) is a quite ( opiniated young fellow )
while /([^)]*)/ matches (Dman)


NOTE YMMV: For example in Perl the Longest Match Possible doesn`t hold. Perl will match the first string it finds. Also ( is a special character in Perl's implementation so it would need to be escaped using \.

Regular old search/replace not doing it for you? Starting out with Perl? Bored? Here, try this...

Quick and Dirty Regular Expression Guide

Basic

The most basic Regular Expression contains only the text you are looking for.

Example:

Bag

will match all occurrences of "Bag" in your document. Regular Expressions are by nature case sensitive, so this example will not match "bag", "bAg" or "baG".


^ -- Beginning of Line 



The "^" character represents the beginning of a line (unless used in a Character Class, see below).

Example:


^Bag

will only match occurrences of "Bag" if they are located at the beginning of a line.



$ -- End of Line



The "$" character represents the end of a line.

Example:


Bag$

will only match occurrences of "Bag" if they are located at the end of a line.

You can combine Regular Expressions to make a larger Regular Expression.

Example:

^Bag$

will match everywhere "Bag" is the only thing on the line.

 

. -- Any Single Character



The "." character represents any single character.

Example:

B.g

will match things like "Bag", "Bog", "Bxg", "B:g", "B g", etc. It will not match "Baag" because there are two "a" characters between the "B" and the "g"; but:

B..g

Will match "Baag".



[ ] -- Character Class



A character class is used to define what the one character at that location can be by supplying a list of acceptable characters.

Example:

B[aiu]g

will only match "Bag", "Big", and "Bug". It will not match "Baug" because "au" takes up two character locations. There are three shorthand list notations that can be used inside a character class:

a-z All lower case letters
A-Z All upper case letters
0-9 All numerals

Example:


B[a-z0-9]g

will match "Big", "B5g", but not "BAg" because of the upper case "A".

Another feature of the character class is the "^" complement operator; if "^" is the first character in the list, the character class will match all characters NOT in the list.

Example:

B[^a-z]g

will not match any three letter words starting with "B", ending with "g", and having a lower case middle letter. It will however match "BAg", "B9g", "B g", and "B:g".



\ -- The "Escape" Character



The "\" character has a couple of uses. The first is interpreted as "take the next character literally".

Example: 

To illustrate, let's say you're editing an *.ini file with a "Bug" section in it and need to match the "Bug" section header. If we just made a regular expression of:

[Bug]

the "[" and "]" would be interpreted as a character class causing the regular expression to look for any single character that is a "B", "u", or "g". Placing a "\" before the "[" and "]":

\[Bug\]

causes the regular expression to be interpreted as we would like; to look for a "[" followed by "B", "u", "g", and "]".

When followed by certain characters, the "\" and character pair have special interpretation:

\\ The "\" character
\n End of Line character
\t Tab character
\b Backspace character (Control-H)
\r Carriage return
\f Form feed

Also "\x" followed by a hexadecimal number can be used to represent any character.

Example:

\x0A

is a line feed character.



+, *, ? -- Iteration



These three operators ("+", "*", "?") are used to define the number of occurrences of the preceding expression. If an expression is followed by a "+", it will match one or more occurrences of that expression.

Example:

^.+$

will match any line containing at least one character. Likewise, an expression followed by a "*" will match zero or more occurrences of that expression. Therefore,

^.*$

will match any line whether it contains a character or not. Also, an expression followed by a "?" will match zero or one occurrences of that expression. So,

^.?$

will match any line that either contains one character or doesn't contain any characters.

These operators are most commonly used after Character Classes.

Example:

B[ai]?g

Which will only match "Bg", "Bag" and "Big".



( ) -- Grouping


Any portion of a regular expression surrounded by parenthesis ( "(" and ")" ) will be considered a group. This allows you to use items like "*", "+", "?", and "|" (discussed later in this document) on more than a single expression.

Example:

^(B[ai]g)?$

will match any blank line or any line containing only the word "Bag" or "Big". Another use of grouping would be the ability to use the matched group later on in a Search & Replace setting (described in the next section).

\n -- Group Reuse

Occasionally you might find the need to use the matched text from the search in your actual replacement string; \n allows you to do just that. A "\" followed by a number will put the group represented by that number into the location.

Example:

A "Find what:" statement of:

([a-zA-Z]+):([a-zA-Z]+)([^a-zA-Z])

and a "Replace with:" statement of:

\2:\1\3

will swap the location of any two words separated only by a colon.

| -- "Or" Operator

Any two expressions separated by a "|" will be interpreted as one and only one of the two expressions must match.

Example:

This prison serves ((bread)|(water))\.

will match lines describing very cruel prisons that only serve bread or only serve water! It will match the following two lines:

This prison serves bread.
This prison serves water.

But it will not match this line:

This prison serves breadwater


Complete Example



As noted previously, Regular Expressions can be combined to make one big Regular Expression. Here is a complete example of a Regular Expression used to find all "#define" statements in *.c files:

^[ \t]*#[ \t]*define[ \t]*[a-zA-Z_][a-zA-Z0-9_]*[^a-zA-Z0-9_]

This expression looks for, at the beginning of the line:

zero or more tabs and/or spaces followed by a "#" character,
followed by zero or more tabs and/or spaces,
followed by the word "define",
followed by zero or more spaces and/or tabs,
followed by one character that can be any alphabetic character or an "_",
followed by zero or more characters that are alphanumeric or "_",
followed by a character that is not alphanumeric or "_". (whew)

Regular expressions in computer science describe regular languages.
Regular expressions and languages can be defined inductively:
Σ is our alphabet, a ∈ Σ
∅ is a regular expression over Σ and describes the regular language ∅
ε is a regular expression over Σ and describes the regular language {ε} (ε is the empty word, but ∅ is the empty set)
a is a regular expression over Σ and describes the regular language {a}
If r1 is a regular expression ,then (r1)* is a regular expression, too ( the * is the Kleene star), It describes the regular language R1*
If r1 and r2 are regular languages, then (r1 | r2) (alternative or OR) and (r1r2) (concatenation) are regular expressions, too. They describe thre regular languages R1 U R2 and R1R2.
Nothing else is a regular expression (Important! If you can not lead an expression back to this, it is none).

Example: Σ = {a,b,c}
Regular expressions are for example a,b,c, aa, ba*,...
The corresponding languages are {a} (for a) or {b}{a}* (for ba*, as the Kleene star binds strongest).

JavaScript 1.2 (found in Netscape 4.0, Internet Explorer 4.0, and Opera 5.0) allows the use of regular expressions, with a few twists.

The most annoying twist is that you can only follow your regexp with i (case insensitive) and/or g (global search) operators. e, m, o, s, and x are all unavailable, in any browser. This is generally because of limitations imposed by the JavaScript language itself, typically because it lets you end any line with a newline instead of a semicolon.

The absense of e (evaluate before matching) is the most annoying, since this allows you to assemble your pattern dynamically. In Perl, if you had a variable $vari which a user could modify, the pattern /$vari/eg would match every occurence of whatever string $vari contained. Fortunately, JavaScript has a workaround.

A regular expression is treated like an object in JavaScript, and one can be assembled dynamically by feeding a pair of strings to a RegExp constructor. If you don't need a dynamic expression, you can feed it directly to a function, such as:

  str = str.replace(/sue/g, "bob");
If you had a variable string in JavaScript named vari, then
  reg = new RegExp(vari, "g");
will produce a regular expression to match every occurence of vari in a string, and
  str = str.replace(reg, "bob");
would replace every occurence of it in the string str with "bob".

This recently came in handy when I was writing some JavaScript pattern matches for an EDev document and discovered that my need for brackets inside the regular expression made E2 try to add hardlinks in the middle of my code. So instead of

  match = str.replace([aeoiu], "y");
I would do the following:
  ob = String.fromCharCode(91);    // opening bracket
  cb = String.fromCharCode(93);    // closing bracket
  reg = new RegExp(ob+"aeiou"+cb, "g");
  str = str.replace(reg, "y");

A bit of a kludge, but it gets the job done.

Having two kinds of REs is a botch.

- regex(7) manual page in GNU systems
(via Henry Spencer's regex package)

The regular expressions are a type of patterns that are used to find text that matches them.

The name "regular expression" is a bit misleading. While they're certainly Expressions, they cannot really be called all that Regular! There are many different RE implementations, each have slightly different rules, while still adhering to the standards. The implementations mostly differ on extensions, though; most old regular expression rules work just as well on new regex parsers.

There are several driving factors that guarantee with some level of certainty that at least some part of the regex syntax is supported.

The regular expressions have been standardised in POSIX 1003.2 standard, both in UNIX C API (defined in sys/types.h and regex.h, functions regcomp(), regexec(), regerror() and regfree()), and also the actual regex syntax.

The POSIX standard defines modern, "extended" regular expressions, and obsolete, "basic" regular expressions. The main difference is that the extended regexes support all sorts of froody stuff like the |, + and ? things, bounds and nested expressions use different syntax, and ^$ refer to the beginning or end of the expression all the time.

Or so the theory goes.

There is something to remember about the general rules of portability: If the program supports regular expressions, throw anything that would pass egrep into it and see if it salutes. Then, be prepared for a shock and throw something that Perl would parse, and don't be disappointed if it doesn't...

There are systems that implement the regular expressions as mentioned in the spec. One example is the familiar "grep" tool. GNU grep, and undoubtedly any modern grep, uses old regexes normally and modern regexes with the switch -E (or if invoked as egrep). However, be aware that on some archaic greps, egrep doesn't exactly do everything that modern egreps do (for example, the ranges may still need slashes, like in Emacs).

I'm talking here of two really important regex-using programs, Emacs editor and Perl programming language, because those are two of the forms I'm really familiar with.

Emacs is probably one of the most important editors I've ever worked with; it may be bloated, but dammit, at least the bloat is justified. =) It serves as an example of a program that doesn't follow the progress, without totally annoying the user. Perl, on the other hand, is my favorite programming language, has very good regular expression support and is and one of the things that actually fuel the development of regular expressions - to the point that many systems are marketed as having "Perl 5 compatible regular expressions"!

First of all, the groups. In Emacs, the syntax is more or less modern what comes to |, + and ?, except that | is actually \|. Ranges and groups are done the Old Way: a\{1,10\} matches anything from a to aaaaaaaaaa, and \(Foo\|Bar\) matches either Foo or Bar. Perl follows the new POSIX style: Ranges are in form a{1,10}, groups (Foo|Bar). See? The new regexes are more readable!

The POSIX standard defines "character classes"; For example, [0-9]* could also be written as [[:digit:]]*. And here come the extensions: POSIX only defines \w and \W as synonyms of "word characters" and "non-word characters" ([[:alnum:]] and [^[:alnum:]]). Perl has a lot of handy slash-preceded symbols that do matching, for example, \d to match for any digit.

There's a vast difference between the standard and the actual things implemented, and differences between applications and versions of applications.

And who knows what future will bring? For example, Perl 6 isn't even calling these things "regular expressions" any more, they're just "rules" and can define whole new nested grammars! Will the amazing parsing power of regexes amaze users even more in the future? Will they, as the predictions went, become self-aware and obliterate the lesser parsers in a /dev/nuclear war of epic scale?

Sources:
GNU grep(1) man page
GNU regex(7) man page
"Syntax of Regular Expressions", XEmacs 21.4 documentation

So what are they?

Regular expressions are a powerful way to search for text that matches a certain criterion, and optionally replace it or parts of it with other text. They are supported (at least, distinct variations, commonly known as different "flavours," are supported) by many different languages and programs, so after only slight tweaking, the regular expression you use in your PHP code can be used in vi. In this beginner's guide, I'll only cover searching, but even this should be enough to give you a glimpse of how versatile regular expressions are.

The main part of the search

You can search for any standard phrase using regular expressions. This is the most basic way of using a regular expressons based program such as egrep, but can still be useful. For example, searching a list of animals for lion will bring up the following results:

lion
lioness
stallion

Note that it doesn't just search for that word, but any line containing the characters you specified in the right order. When using regular expressions in this way, all you have to remember is to escape any metacharacters (any characters that the program should not take literally) with a backslash.

Metacharacters: \/.^$?*+{}(|)

As you have to escape any other metacharacters with a backslash, the backslash itself is also a metacharacter. If you wanted to search for the phrase and\or, you would need to type in and\\or. This applies to all the other metacharacters as well, so if you wanted to search for $10, you would need to type in \$10. Now you can search for any literal string of characters.

It's generally a good idea to let computers do the boring, repetitive work for you, so let's see what the metacharacters can do to make your life even easier. First is the dot, which matches any character. Try searching for mo.se. Your computer will find these matches:

moose
mouse

You can use as many of these as you like. Typing in .om.at will give you the following:

tomcat
wombat

If you remember what I said earlier about regular expressions only matching part of a line, you might think that the first dot is unnecessary, as either way, anything can come before the letter o. You'd be right. There is a subtle difference, however: putting a dot there means that there must be at least one character before the o, even though it can be anything. If any line that would otherwise match began with the o itself, it wouldn't count.

Another useful metacharacter is the caret (^), which means "the beginning of the line." If you search for ^lion then your computer will include lion and lioness on the list, but not stallion. Similarly, the dollar sign means "the end of the line," so a search for pig$ would give you pig and guineapig, but not pigeon.

You can use these together in any combination (as long as the carat only appears at the beginning of a line, and the dollar sign only appears at the end of it). For example, if you search for ^.at$, your computer will give you the following words:

bat
cat
rat

The next five metacharacters are called quantifiers. They tell the program how many instances of the last character (or group of characters, but we'll get to that later) it should match. The question mark means "zero or one," the asterisk means "zero, one or more" and the plus symbol means "one or more."

A good use of the question mark is when you're searching text that could use either British or American spelling. If you wanted to search for any instance of the word flavour or flavor, then you could combine them into a single search by typing in flavou?r. This means that the character directly before the question mark is optional, so both words will match.

It's worth noting that .* and .+ will match any letters, not just one letter repeating several times. For example, using egrep to search for ^b.*bird$ will make it look for the following: the beginning of the line, the letter b, any number of any characters (including none), the letters b, i, r, d, then the end of the line. It will give you the following matching words:

blackbird
bluebird

The braces ({ and }) let you specify exactly how many characters you want to match. For example, ^l.{3}bird$ will match the beginning of a line, the letter l, any three characters, the letters b, i, r, d, then the end of the line. The following words match:

ladybird
lovebird

This can be taken one step further by putting two numbers in the braces, seperated by a comma. The first is the minimum number of times the character must be matched, and the last number is the maximum number of times. For instance, Br{2,4}! will match Brr!, Brrr! and Brrrr!.

Any one of these single characters

The square brackets are used to group single characters together. Say, for instance, that you want to look for bat and cat but not rat. You can use the regular expression bcat to specify that either b or c, but nothing else, can precede a and t.

This part of the regular expression is called a character class, and has two metacharacters of its own. This time, however, you don't need to use the backslash to escape them.

Metacharacters: ^-

The first metacharacter in the character class is a carat. Although it usually means "the beginning of the line," here it means something else entirely. When placed at the very beginning of the character class, after the opening square bracket, it means that the following characters are the ones that must not appear. Searching for ^bcat, for example, would find rat but not bat or cat. This is called a negated character class.

When placed at the beginning of the character class (or directly after the carat if it's a negated character class), the dash is literal. Otherwise, it is taken to mean "anything between these two characters." a-z matches any lowercase letter, A-Z matches any uppercase letter, and 0-9 matches any number. These can be combined. For example, A-Za-z matches any letter at all, while ^A-Za-z matches any character that isn't a letter.

Character classes can also be combined with quantifiers, which is where the fun really begins. For instance, you could use the regular expression aeiou{5} on a word list to find out which words contain five vowels in a row (which will return the word queueing).

Any one of these groups of characters

You can use normal brackets to specify a list of several groups of characters, any of which can be regarded as a match. These groups of characters are seperated by the pipe symbol (|).

Metacharacters: |

Say that you want to find all instances of blackbird and bluebird, but no other birds beginning with the letter b that might be in the text you're searching. You can use the brackets and pipe symbol to make a list of exactly which groups of characters are allowed. The regular expression (blackbird|bluebird) would match just these two words, but there's a much more concise way of saying the same thing: bl(ack|ue)bird. This essentially tells the program exactly the same thing, but in a more efficient way. You can specify as many possibilities as you like, as long as they are grouped together by brackets and separated by pipes.

You can also use brackets to combine the grouped characters with a quantifier. This even works with a list of just one group of characters. For example, the regular expression pig(eon)? matches both pig and pigeon.

Putting it all together

Now you can put all of these ideas together. For a more geeky example, let's say you're searching some old text files for any mention of the Commodore 64 computer. It's called several things, mainly the Commodore 64, Commodore-64, Commodore64, C 64, C-64 and C64. It's possible that people might even have spelt it with a lowercase letter c.

To start with, you search for the letter C. As it can be either upper or lowercase, you use Cc. Next is the optional rest of the word, so you enclose the next characters in brackets to tell the program that they're to be treated as one entity, then use a question mark to indicate that this particular entity is optional: (ommodore)?. Either a space, a dash, or nothing at all comes next, so - ? is the logical choice (remember to keep the dash to the left). Last of all, you are confident that people will use the digits 64. Putting it all together gives you Cc(ommodore)?- ?64. It looks complex when it's assembled together, but hopefully it shouldn't be difficult to make in the first place. Just remember to comment your code so that when you come back to it later, or someone else inherits your code, it isn't too difficult to work out what's going on.

The next step

This is just a beginner's guide. Hopefully you should now have an appreciation of how useful regular expressions can be, and an appetite to learn more. A good first step is to download a free version of grep and a comprehensive word list. Setting yourself tasks like "find every word that contains all five vowels in order" can be an excellent way to practice your knowledge of regular expressions. In the longer term, a good book such as Jeffrey E. F. Friedl's Mastering Regular Expressions (published by the ubiquitous O'Reilly Media) can provide more in depth knowledge, including the particulars of each different flavour of regular expressions.

Example regular expressions

Here are some example regular expressions to help you on your way:

Regular expression Matches
-_a-zA-Z0-9+@-_a-zA-Z0-9.+ Any e-mail address
alt(\.a-z0-9+)+ Any alt. newsgroup

The example wordlist

This is the wordlist used in my guide:

bat
blackbird
bluebird
cat
groundhog
guineapig
ladybird
lion
lioness
lovebird
moose
mouse
pig
pigeon
rabbit
rat
stallion
tomcat
wombat

Resources

(This guide officially lives on my homepage, at http://bytenoise.co.uk/A_Hacker%27s_Guide_to_Regular_Expressions.)

Before we begin let's start with a relatively simple set of values. Generally with a set of values this short, you would just go through and do the calculations yourself or with a calculator. Of course, that changes when your dealing with thousands of rows of data from a database or god-forbid, a spreadsheet. But I have traveled both roads. I have had to sample test the 10,000 records until I was sure I had every variation. So while this may look like a simple list, it is deceptively so. Look at the regex it takes to accomplish the task. Before we get too serious, how about a nice little web-comic.

Stand back! I know regular expressions!

OK, first let's start with a string value of a variation of numbers, one per line. Nothing fancy, right?

$v='
	1/3
	1.3
    .3
	2 1/2
	12.25
    3.25
    7.6
	1 2/3
	3 4/5
	7.5
    0.01
	0.02
	0
	100
	12 1/3
	1000
';

Since we are only dealing with numbers (a task uncommon with real data) we get a break. Let's use some pattern recognition to figure out the types of numbers we are up against. I was able to devise three types of numbers to develop patterns for. 1. We have our whole numbers. 2. We have our decimal numbers. 3. We have the most varying numbers of the bunch, the fractions.

Pattern 1

The Whole Numbers

The code for the whole numbers is simple and remember to use parenthesis to capture our values to save them for our calculations later.

(\d+) //match all numbers of (plus sign means) one or greater

Pattern 2

The Fractional Numbers

Probably the most daunting looking pattern to match at first. You have a combination of any number of digits, the forward slash which has to be escaped, but in reality it's not too bad. Remember that you may be getting simple fractions like 2/3 or the combination of whole and fractional numbers like 3 2/3 so your pattern must account for both. Again, parenthesis to capture our values.

(\d* *\d+\/\d+) /*match all numbers of (asterisk means) 0 or one, a space or not, numbers of one or more, 
a forward-slash (which is a character of special meaning in regex, if you want to treat it as a regular character, 
we escape with a backslash. Backslash is used all over the place to escape characters so you better mind-meld with 
it or something), and then all numbers of one or more.*/

Pattern 3

The Decimal Numbers

The period is a little tricky because it literally means "anything" in a regex pattern, so it has to be escaped if you are actually trying to match a period or decimal. Things get a little more complicated with these because we can have numbers on both sides of the decimal point or we can have no numbers on the left side if our recorder refuses to use a zero place-holder. Also, the numbers can be as big or as small as possible.

(\d{0,}\.\d+) // match all numbers of zero or more, then decimal, then digits of one or more

Pretty straight forward as long as you remember that "." captures "anything" and depending on if you are using POSIX or PCRE or whatever that can mean unicode and that is something way to big to get into in this writeup.

Putting our RegEx Together

Of course you know as a regex guru that the character | (not I) means or. So that's how we're going to link our patterns together. But I introduced the patterns I did in a certain order for a reason. If the whole number value was first it would match everything, every single little digit. So we go with our complicated patterns first and work our way down to the simple ones. Thus, our pattern becomes:

(\d{0,}\.\d+)|(\d* *\d+\/\d+)|(\d+)

But that's missing the all important /regex/ to really make a Regular Expression. More on that in a second.

We need a function to do something with these values we captured, this function just happens to be in PHP, don't worry the hate will die down as we get into other languages.

function matchValues($value, $pattern){
	preg_match_all("/$pattern/", $value, $matches, PREG_PATTERN_ORDER);
	return $matches;
}

$pattern = "(\d{0,}\.\d+)|(\d* *\d+\/\d+)|(\d+)";
$m = matchValues($v,$pattern);
print_r($m);

Bear in mind that this isn't a real world example just a code to show you how to grab all the various values, and I came up with as many different kinds as I could think of. I also chose the FLAG PREG_PATTERN_ORDER, because it assembles all our captures in the first array. You may want to read up on preg_match_all. And I just added print_r at the end in case you want to run it and see how it grabs the values. Once you have the values you can do what you want with them. Convert them to similar types, add, multiply, take your pick. I just wanted to show how even the simplest of values can have tricky regular expressions.

ADVANCED: proceed with caution

In a language like Perl or Javascript you don't treat a regular expression like a string that just happens to have forward-slashes at the beginning and the end. Forward slashes in Unix, Perl, Javascript and a number of languages use forward-slashes to define a regular expression. Javascript has a whole class for putting together a RegEx, which seems like overkill to me, but constructors and javascript are so easy they almost create themselves.

When searching for data you are almost always going to run into grep. Me, I prefer a perl file called ack which is way faster and more convenient, but that is not for this writeup. Often with grep, you just want to type a command, search all php file for include (for example), and you would type

grep include *.php

But let's say you have a file of arbitrary phone-number type data. There was no restriction on how it was entered so it's up to you to find all the phone numbers and format them into real numbers. I'll show you how to find the numbers. First, a short list of phone numbers in a file, phonenumber.

559-456-4214
526-699-9993
1-1234567893
(559)-456-4563

So then we use regex to find all our values. Of course our real data set is going to be much bigger, this is just for ease of example.

grep '[1-]*\(*[0-9]\)*\{3\}-*[0-9]\{3\}-*[0-9]\{4\}' phonenumber

"Ohmigod!" You say, "What the hell is that?" Yes, regex can seem like cryptic voodoo, but it is incredibly powerful and I have built entire clean databases off of nothing but many, many regular expressions.

But let's get into Perl and do a little search and replace which is incredibly easy to do in Perl even if it looks a little cryptic at first.

s/foo/bar/gi;

Me, I have always thought of that s as a "search" abbreviation. What it will do is find all "foo" and replace it with "bar". Also, I want to introduce you to flags. What are those letters at the end of the sequence of symbols? What is the g and the i? Well, g stands for global, meaning that it will replace all instances it finds instead of just the first which is default behavior. i means, make our search case-insensitive. Very handy, that. In fact, we're going to use it to switch our first foo to BAR and then switch only the bar to foo. Since we have marked our special BAR with uppercase, we just turn it lowercase again. There are a ton of ways to do that, but let's stick with regex for now. Here is the whole perl file (almost forgot to mention the use of =~ to perform pattern matches m// and replaces s//. Once you get used to Perl's crazy syntax it becomes so much easier to type in s/search/replace then preg_match_all(pattern, subject, matches, FLAG).

$foo = "Bigfoot is the coolest monster and if you ever meet him at a bar buy him a foot-tall pint, and he is sure to thank you with a teeth-baring smile";
$foo =~ s/foo/BAR/gi;
$foo =~ s/bar/foo/g;
$foo =~ s/BAR/bar/g;
print $foo;

Now let's see what that print-out is on the last line.

Bigbart is the coolest monster and if you ever meet him at a foo buy him a bart-tall pint, and he is sure to thank you with a teeth-fooing smile

That's all for today folks! I would read this after reading all the other writeups at the top and you're on your way to a good understanding and use of regular expressions.

Log in or register to write something here or to contact authors.