spare
Some people, when confronted with a problem, think
I know, I'll use regular expressions.
Now they have two
problems.
— attributed to Jamie Zawinski
Goal
The Society for the Prevention of the Abuse of Regular
Expressions – spare –
has a very simple goal. To educate programmers as to what is
appropriate use of regular expressions, and what is abuse of
regular expressions.
Why spare?
Regular expressions are a very powerful tool. So much so that
more than one language has honored them by providing syntax
specifically for regular expressions. However, they do have their
limits. Lately, I've seen a lot of people using regular
expressions for things where they simply don't make sense. To help
prevent that, I decided to found SPARE.
The two forms of abuse
Regular expression abuse happens whenever a regular expression
is used to solve a problem which is outside the range of problems
for which regular expressions are appropriate. This happens when
the pattern the regular expression matches is too simple, or when
the pattern the regular expression matches is too complex –
or not expressible as a regular expression.
Things that are too simple for regular expressions
The straw that caused SPARE to be
formed was someone searching for a fixed string with a regular
expression. You don't need a regular expression for this, as most
modern languages have a string library or class that includes the
ability to find a fixed string inside another string.
Similarly, splitting a string at the occurrence of some
substring is a common operation. Regular expressions make this
easy. But a modern language's string facilities make it even
easier. Such facilities often include the ability to split a
string up on whitespace, which extends the ability of the string
facility to more than just fixed strings.
That said, it should be noted that regular expression matching
uses very sophisticated search techniques, possible because the
regular expression goes through a compilation
phase. This
means that it will be faster than a straightforward string
search, at least if you don't count the compilation time. If you
can compile a fixed string as a regular expression for repeated
use, it could represent a significant time saving.
Things that are too complicated for regular expressions
On the other end of the spectrum, you often see regular
expressions being used to parse – or more accurately, to try
to parse – languages that are to complex for regular
expressions. These attempts frequently work for subsets of the
problem, but never handle all the legal variants of the language
in question. In such cases, a tool for general purpose parsing
should be used. For popular languages – like xml – your programming language may
come bundled with a parser for the language.
xml – particularly xhtml – makes a good example. On
first sight, a regular expression to extract some part of such a
document looks easy. For instance, removing all the tags and
leaving just the text of the documents – about the simplest
thing you can do – is easy. All you have to do is remove all
text that matches <[^>]*>, right?
No, because that will fail on this legal xhtml fragment:
<h1 onclick="if (x > 5) { alert('Test worked'); }">Test</h1>
Yes, most people would write this with an > in the
attribute, and that's what's recommended, but it's not
required. The above fragment is legal xhtml 1.0. If you're going to parse xhtml, you need to handle it. So you change
your expression to deal with > in a quoted string. Then you
realize that xml attributes can be
quoted with ' as well as ", so you fix the regular expression
again. It now looks like
<([^>'"]|"[^"]*"|'[^']*')*>. Not quite so simple. Is
it right yet?
Simply using an xml library would
have solved this problem with well-tested code the first
time. This case is still simple enough that using a library isn't
a clear win – but this is also about the simplest possible
thing you can do to an xhtml
document. More complicated problems – like finding the
target of all links in the document – require noticeably
more complicated regular expressions, but won't require noticeably
more complicated code using an xml
parsing library.