Writing‎ > ‎

Society for the Prevention of Abuse of Regular Expressions

spare

Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.
— attributed to Jamie Zawinski

Goal

The Society for the Prevention of the Abuse of Regular Expressions – spare – has a very simple goal. To educate programmers as to what is appropriate use of regular expressions, and what is abuse of regular expressions.

Why spare?

Regular expressions are a very powerful tool. So much so that more than one language has honored them by providing syntax specifically for regular expressions. However, they do have their limits. Lately, I've seen a lot of people using regular expressions for things where they simply don't make sense. To help prevent that, I decided to found SPARE.

The two forms of abuse

Regular expression abuse happens whenever a regular expression is used to solve a problem which is outside the range of problems for which regular expressions are appropriate. This happens when the pattern the regular expression matches is too simple, or when the pattern the regular expression matches is too complex – or not expressible as a regular expression.

Things that are too simple for regular expressions

The straw that caused SPARE to be formed was someone searching for a fixed string with a regular expression. You don't need a regular expression for this, as most modern languages have a string library or class that includes the ability to find a fixed string inside another string.

Similarly, splitting a string at the occurrence of some substring is a common operation. Regular expressions make this easy. But a modern language's string facilities make it even easier. Such facilities often include the ability to split a string up on whitespace, which extends the ability of the string facility to more than just fixed strings.

That said, it should be noted that regular expression matching uses very sophisticated search techniques, possible because the regular expression goes through a compilation phase. This means that it will be faster than a straightforward string search, at least if you don't count the compilation time. If you can compile a fixed string as a regular expression for repeated use, it could represent a significant time saving.

Things that are too complicated for regular expressions

On the other end of the spectrum, you often see regular expressions being used to parse – or more accurately, to try to parse – languages that are to complex for regular expressions. These attempts frequently work for subsets of the problem, but never handle all the legal variants of the language in question. In such cases, a tool for general purpose parsing should be used. For popular languages – like xml – your programming language may come bundled with a parser for the language.

xml – particularly xhtml – makes a good example. On first sight, a regular expression to extract some part of such a document looks easy. For instance, removing all the tags and leaving just the text of the documents – about the simplest thing you can do – is easy. All you have to do is remove all text that matches <[^>]*>, right?

No, because that will fail on this legal xhtml fragment:

<h1 onclick="if (x > 5) { alert('Test worked'); }">Test</h1>

Yes, most people would write this with an &gt; in the attribute, and that's what's recommended, but it's not required. The above fragment is legal xhtml 1.0. If you're going to parse xhtml, you need to handle it. So you change your expression to deal with > in a quoted string. Then you realize that xml attributes can be quoted with ' as well as ", so you fix the regular expression again. It now looks like <([^>'"]|"[^"]*"|'[^']*')*>. Not quite so simple. Is it right yet?

Simply using an xml library would have solved this problem with well-tested code the first time. This case is still simple enough that using a library isn't a clear win – but this is also about the simplest possible thing you can do to an xhtml document. More complicated problems – like finding the target of all links in the document – require noticeably more complicated regular expressions, but won't require noticeably more complicated code using an xml parsing library.

Comments