What is a regular expression?
A regular expression (or regex or regexp) is used when searching for things. A regex tells the computer what to look for – it’s a pattern to match against.
Why do I care about those then?
Regular expressions are used all over the place in UNIX and Linux systems – usually but not always in command-line tools like grep and sed. There are regular expression search-and-replace engines available in most programming languages, including Perl, PHP, C and Java. Regex search and replace functions are available in emacs, sed and vi, amongst others.
The syntax of regular expressions is terse, obscure, daunting (at least at first), hard to debug, and fickle. It’s also possibly the most powerful tool you will ever add to your computing armoury.
Basic tools (what everyone should know)
The first principle of a regex is that “non-special” characters are just matched directly. So, to find all words in the system dictionary which contain the string “bed”, you would simply use:
$ grep bed /usr/share/dict/words
grep, like most regular expression handlers, is case sensitive, so the above command will find “bed”, but not “BED” or “Bed” or “bEd”, for example.
Single characters
While in theory a plain string like “bed” is a regular expression, it’s not very powerful. For example, suppose you want to find all words in the dictionary which have a “b” and a “d” separated by one letter. In a regular expression, the “.” character is “special” (it’s a metacharacter in regex parlance). A “.” in a regular expression matches any one character. So, in our example, we’d use:
$ grep b.d /usr/share/dict/words
A “.” in a regular expression matches anything, even single whitespace characters (like a space or a tab).
Character sets
Sometimes, you only want to match certain characters out of the total available. The [] metacharacters in regexes are designed for this. Putting a list of characters in the brackets will match any one of the characters in the list. For example, to find all the words which have a “b” and a “d” separated by a vowel, you would use:
$ grep b[aeiou]d /usr/share/dict/words
You can specify ranges of contiguous ASCII values in []s, so to find all of the words with captial letters in them in the dictionary, you might use this:
$ grep [A-Z] /usr/share/dict/words
Or to find, say, all occurences of “www” followed by a number in your Apache configuration, you’d use:
$ grep www[0-9] /etc/apache/httpd.conf
Repetition
Now, the examples above are all very well, but they obviously don’t cover everything. For example, what if you wanted to find any words which contain a “b” and a “d”, separated my any number of letters? You might think of doing this:
$ grep bd /usr/share/dict/words $ grep b.d /usr/share/dict/words $ grep b..d /usr/share/dict/words $ grep b...d /usr/share/dict/words etc.
but that’s going to get very long and boring to do.
Instead, you can write this:
$ grep b.*d /usr/share/dict/words
The * is another special character which repeats the previous character no times or more. This is important to realise: .* could match nothing at all, so “bd” would be picked up as matching in the example above.
* doesn’t just apply to the dot. It applies to everything. So, if you want to find, say, anything like “bd”, “bad”, “baad”, “baaad”, etc, you would use something like this:
$ grep ba*d myfile
Or, to match say any student ID of the form initials-year from 1999:
$ grep [a-z]*-1999 studentlist
Note that in the last example, the * applies to the immediately preceding block (i.e. the [a-z]).
Some examples
b.d * Matches `bed bad bxd` * Doesn't match `bead bd`
A few more strings to your bow
More repetition
Standard regular expressions offer two other forms of repetition. Where * means “none or more”, a plus + means “one or more”. Therefore, the following two expressions are equivalent:
baa*d ba+d
The third form of item counting is the question mark ?, which means “none or one of”. So, the following expression:
ba?d
matches bd and bad, but not baad.
Line ends and beginnings
You can find things at the beginning or the end of a line using the ^ and $ characters, respectively. Thus, to find all words in the dictionary beginning with dis, you might use:
$ grep ^dis /usr/share/dict/words
and all the words ending in ing would need:
$ grep ing$ /usr/share/dict/words
Grouping
The * and + characters in regular expressions only operate on the immediately preceding item, which makes it rather difficult to find, say,
bana banana bananana banananana ...
Fortunately, regular expressions give you the ability to group bits of regular expression together. To match all of the above bananas, you could use:
ba(na)+
or possibly:
b(an)+a
Alternatives
You can use the | character to indicate alternatives. So, to match the Kray twins, you might use:
(Ron|Reg)
or, possibly, this:
R(on|eg)
You can specify more than two alternatives in the same expression, so:
(John|Paul|Ringo|George)
Putting it all together
Advanced bits
Search-and-replace
() 1 $1
Pre-defined character sets
:alpha: etc
Counted matches
If you want an expression to match a precise number of times, you can either write the expression that number of times:
z[0-9][0-9][0-9]z
or you can use the {n} construction to match n times:
z[0-9]{3}z
{n} behaves in the same way as the * and ? symbols, in that it modifies the preceding expression to match varying numbers of times.
You can also use the {} construction to match a range of times. To match at least 3 times, and no more than 5 times:
z[0-9]{3,5}z
-
matches z123z, z1234z and z12345z
-
doesn’t match z12z or z123456
If you don’t want to specify an upper (or lower) limit, they can be omitted:
z[0-9]{3,}z
is equivalent to any of:
z[0-9][0-9][0-9][0-9]*z z[0-9][0-9][0-9]+z z[0-9]{3}[0-9]*z z[0-9]{2}[0-9]+z
Application-specific extensions
PCRE, etc.
What you can use where
Not all applications are born equal. The usability of some regex features is different depending on which application you have (and even on how you call it).
Tool |
. * [] ^ $ character classes |
? () + | {} |
1 |
$1 |
grep |
Yes |
Escaped |
Yes |
No |
sed |
Yes |
Escaped |
Yes |
No |
egrep, grep -e |
Yes |
Yes |
Yes |
No |
perl, grep -P |
Yes |
Yes |
Yes |
Yes |
Where it says “Escaped” in the above table, use in front of the character to get the effect. So, to match the banana example above with grep, use:
$ grep 'b(an)*a' /usr/share/dict/words
but with egrep:
$ egrep 'b(an)*a' /usr/share/dict/words
Other Resources
-
Mastering Regular Expressions, 2nd Edition By Jeffrey E. F. Friedl
Beware
You can do an awful lot with regular expressions. You can also do a lot of awful regular expressions. Some things are better left to more obviously algorithmic methods…
^(?:(?:(?:0?[13578]|1[02])(/|-|.)31)1|(?:(?:0?[1,3-9]|1[0-2])(/|-|. )(?:29|30)2))(?:(?:1[6-9]|[2-9]d)?d{2})$|^(?:0?2(/|-|.)293(?:(?:(? :1[6-9]|[2-9]d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]| [3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(/|-|.)(?:0?[1-9]|1d|2[ 0-8])4(?:(?:1[6-9]|[2-9]d)?d{2})$
(And, no, I didn’t write it myself, and, yes, I do know what it does, and, no, I’m not telling you. You can work it out for yourself).
Leave a Reply
You must be logged in to post a comment.