Archive

Regular Expression


What is a regular expression?

A regular expression (or regex or regexp) is used when searching for things. A regex tells the computer what to look for – it’s a pattern to match against.

Why do I care about those then?

Regular expressions are used all over the place in UNIX and Linux systems – usually but not always in command-line tools like grep and sed. There are regular expression search-and-replace engines available in most programming languages, including Perl, PHP, C and Java. Regex search and replace functions are available in emacs, sed and vi, amongst others.

The syntax of regular expressions is terse, obscure, daunting (at least at first), hard to debug, and fickle. It’s also possibly the most powerful tool you will ever add to your computing armoury.

Basic tools (what everyone should know)

The first principle of a regex is that “non-special” characters are just matched directly. So, to find all words in the system dictionary which contain the string “bed”, you would simply use:

$ grep bed /usr/share/dict/words

grep, like most regular expression handlers, is case sensitive, so the above command will find “bed”, but not “BED” or “Bed” or “bEd”, for example.

Single characters

While in theory a plain string like “bed” is a regular expression, it’s not very powerful. For example, suppose you want to find all words in the dictionary which have a “b” and a “d” separated by one letter. In a regular expression, the “.” character is “special” (it’s a metacharacter in regex parlance). A “.” in a regular expression matches any one character. So, in our example, we’d use:

$ grep b.d /usr/share/dict/words

A “.” in a regular expression matches anything, even single whitespace characters (like a space or a tab).

Character sets

Sometimes, you only want to match certain characters out of the total available. The [] metacharacters in regexes are designed for this. Putting a list of characters in the brackets will match any one of the characters in the list. For example, to find all the words which have a “b” and a “d” separated by a vowel, you would use:

$ grep b[aeiou]d /usr/share/dict/words

You can specify ranges of contiguous ASCII values in []s, so to find all of the words with captial letters in them in the dictionary, you might use this:

$ grep [A-Z] /usr/share/dict/words

Or to find, say, all occurences of “www” followed by a number in your Apache configuration, you’d use:

$ grep www[0-9] /etc/apache/httpd.conf

Repetition

Now, the examples above are all very well, but they obviously don’t cover everything. For example, what if you wanted to find any words which contain a “b” and a “d”, separated my any number of letters? You might think of doing this:

$ grep bd /usr/share/dict/words $ grep b.d /usr/share/dict/words $ grep b..d /usr/share/dict/words $ grep b...d /usr/share/dict/words etc.

but that’s going to get very long and boring to do.

Instead, you can write this:

$ grep b.*d /usr/share/dict/words

The * is another special character which repeats the previous character no times or more. This is important to realise: .* could match nothing at all, so “bd” would be picked up as matching in the example above.

* doesn’t just apply to the dot. It applies to everything. So, if you want to find, say, anything like “bd”, “bad”, “baad”, “baaad”, etc, you would use something like this:

$ grep ba*d myfile

Or, to match say any student ID of the form initials-year from 1999:

$ grep [a-z]*-1999 studentlist

Note that in the last example, the * applies to the immediately preceding block (i.e. the [a-z]).

Some examples

 b.d  * Matches `bed bad bxd`  * Doesn't match `bead bd`

A few more strings to your bow

More repetition

Standard regular expressions offer two other forms of repetition. Where * means “none or more”, a plus + means “one or more”. Therefore, the following two expressions are equivalent:

 baa*d  ba+d

The third form of item counting is the question mark ?, which means “none or one of”. So, the following expression:

 ba?d

matches bd and bad, but not baad.

Line ends and beginnings

You can find things at the beginning or the end of a line using the ^ and $ characters, respectively. Thus, to find all words in the dictionary beginning with dis, you might use:

 $ grep ^dis /usr/share/dict/words

and all the words ending in ing would need:

 $ grep ing$ /usr/share/dict/words

Grouping

The * and + characters in regular expressions only operate on the immediately preceding item, which makes it rather difficult to find, say,

 bana  banana  bananana  banananana  ...

Fortunately, regular expressions give you the ability to group bits of regular expression together. To match all of the above bananas, you could use:

 ba(na)+

or possibly:

 b(an)+a

Alternatives

You can use the | character to indicate alternatives. So, to match the Kray twins, you might use:

 (Ron|Reg)

or, possibly, this:

 R(on|eg)

You can specify more than two alternatives in the same expression, so:

 (John|Paul|Ringo|George)

Putting it all together

Advanced bits

Search-and-replace

() 1 $1

Pre-defined character sets

:alpha: etc

Counted matches

If you want an expression to match a precise number of times, you can either write the expression that number of times:

 z[0-9][0-9][0-9]z

or you can use the {n} construction to match n times:

 z[0-9]{3}z

{n} behaves in the same way as the * and ? symbols, in that it modifies the preceding expression to match varying numbers of times.

You can also use the {} construction to match a range of times. To match at least 3 times, and no more than 5 times:

 z[0-9]{3,5}z
  • matches z123z, z1234z and z12345z

  • doesn’t match z12z or z123456

If you don’t want to specify an upper (or lower) limit, they can be omitted:

 z[0-9]{3,}z

is equivalent to any of:

 z[0-9][0-9][0-9][0-9]*z  z[0-9][0-9][0-9]+z  z[0-9]{3}[0-9]*z  z[0-9]{2}[0-9]+z

Application-specific extensions

PCRE, etc.

What you can use where

Not all applications are born equal. The usability of some regex features is different depending on which application you have (and even on how you call it).

Tool

. * [] ^ $ character classes

? () + | {}

1

$1

grep

Yes

Escaped

Yes

No

sed

Yes

Escaped

Yes

No

egrep, grep -e

Yes

Yes

Yes

No

perl, grep -P

Yes

Yes

Yes

Yes

Where it says “Escaped” in the above table, use in front of the character to get the effect. So, to match the banana example above with grep, use:

$ grep 'b(an)*a' /usr/share/dict/words

but with egrep:

$ egrep 'b(an)*a' /usr/share/dict/words

Other Resources

Beware

You can do an awful lot with regular expressions. You can also do a lot of awful regular expressions. Some things are better left to more obviously algorithmic methods…

^(?:(?:(?:0?[13578]|1[02])(/|-|.)31)1|(?:(?:0?[1,3-9]|1[0-2])(/|-|. )(?:29|30)2))(?:(?:1[6-9]|[2-9]d)?d{2})$|^(?:0?2(/|-|.)293(?:(?:(? :1[6-9]|[2-9]d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]| [3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(/|-|.)(?:0?[1-9]|1d|2[ 0-8])4(?:(?:1[6-9]|[2-9]d)?d{2})$

(And, no, I didn’t write it myself, and, yes, I do know what it does, and, no, I’m not telling you. You can work it out for yourself).

Leave a Reply