Self-documenting regular expressions

When they get long and complicated, regular expressions are difficult to write, hard to read, and still harder to document. Even a regular expression used to test for a very well understood and well-defined pattern can be lengthy and, frankly, baffling.

The structure of an email address, for instance, can be exhaustively defined in a few short paragraphs, but one such regular expression begins like this and continues for dozens more lines:

(?:(?:\r\n)?[ \t])(?:(?:(?:[^()<>@,;:\".[] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[["()<>@,;:\".[]]))|"(?:[^"\r\]|\.|(?:(?:\r\n)?[ \t]))"(?:(?: \r\n)?[ \t]))(?:.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\".[] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[["()<>@,;:\".[]]))|"(?:[^"\r\]|\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\".[] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[["()<>@,;:\".[]]))|[([^[]\r\]|\.)
](?:(?:\r\n)?[ \t])
)(?:.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\".[] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[["()<>@,;:\".[]]))|[([^[]\r\]|\.)](?: (?:\r\n)?[ \t])))|(?:[^()<>@,;:\".[] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[["()<>@,;:\".[]]))|"(?:[^"\r\]|\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n) ?[ \t]))<(?:(?:\r\n)?[ \t])(?:@(?:[^()<>@,;:\".[] \000-\031]+(?:(?:(?:
r\n)?[ \t])+|\Z|(?=[["()<>@,;:\".[]]))|[([^[]\r\]|\.)](?:(?:\r\n)?[ \t]))(?:.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\".[] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[["()<>@,;:\".[]]))|[([^[]\r\]|\.)](?:(?:\r\n)?[ \t] )))(?:,@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\".[] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[["()<>@,;:\".[]]))|[([^[]\r\]|\.)](?:(?:\r\n)?[ \t])*

(Plus much, much, much more...)

It’s not immediately obvious how best to document something like this. Of course, we’d never use regular expressions for email addresses at all (use filter_var($email_address, FILTER_VALIDATE_EMAIL) instead!), and in the above case, we’d probably do better just to document the algorithm used to generate the regex.

But there are times when we need to employ long and complex regular expressions, and for those cases there are three different ways we can comment regular expressions in PHP: standard comments, and two kinds of comments specific to regular expressions.

1. Use PHP comments

Obviously it's possible to add a comment just before the pattern definition, and in a relatively simple expression, this works well enough:

// This pattern case-insensitively matches a

element, including an // opening tag, with any number of attributes, its contents, and the // closing tag. It also captures the element's contents. $pattern = '/<div[^>]>(.?)</div>/i';

But for any really complex pattern, the comments could run to dozens of lines. It would be hard to do in a way that would actually be helpful to any future developer who needed to understand or work with the code.

2. Use regex comments: option 1

PHP offers a specific way to comment parts of regular expression patterns. Any string in parentheses beginning with a question mark and a hash–(?# Like this )–will be disregarded when pattern matching:

$pattern = '/<div[^>]>(.?)</div>(?# Matches div tag and captures its contents. )/i';

This is somewhat useful for very short patterns, but is probably even worse than ordinary comments for documenting lengthy or complex patterns.

3. Use regex comments: option 2 (PCRE_EXTENDED)

The main problem with both of the previous solutions is that a pattern cannot easily be written with multiple comments:

// PHP comments require concatenation. $regex = ‘/<div[^>]>’ // Matches opening tag and zero or more attributes. . ‘(.?)’ // Captures element contents. . ‘</div>/i’; // Matches closing tag.

// Regex comments don’t need concatenation, but we can’t separate them from the // pattern because the whitespace will be interpreted as part of the pattern. $regex = ‘/<div[^>]>(?# Matches opening tag and zero or more attributes. ) (.?)(?# Captures element contents.) </div>(?# Matches closing tag. )/i’;

However, it turns out that PHP also offers a solution that addresses these problems, though it's not one I see used very often. Recently while browsing the source of the Phone module, I found a series of regular expressions like this:

// define regular expression $regex = '/ \D* # ignore non-digits (\d*) # an optional 1 \D* # optional separator ([2-9][0-8]\d) # area code (Allowed range of [2-9] for the first digit, [0-8] for the second, and [0-9] for the third digit) \D* # optional separator ([2-9]\d{2}) # 3-digit prefix (cannot start with 0 or 1) \D* # optional separator (\d{4}) # 4-digit line number \D* # optional separator (\d*) # optional extension \D* # ignore trailing non-digits /x';

Here again, the comments are actually in the expression itself.

PHP's Perl-compatible regular expressions include Pattern Modifiers that are used to configure how the characters within the pattern are evaluated by the regular expression engine.

For example, the 'i' at the end of the sample regular expression above means that "…letters in the pattern match both upper and lower case letters," so the pattern /<div[^>]*>(.*?)<\/div>/i will match <DIV>REGULAR EXPRESSIONS</DIV> or <div>Regular Expressions</div> (as well as variations containing DiV etc).

The pattern from the Phone module, above, uses the x (or PCRE_EXTENDED) pattern modifier. According to the documentation, when this modifier is used

…whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored.

In other words, whitespace characters and characters between a hash and a newline in the pattern don't 'count'. This is what makes it possible to insert comments inside a pattern.

Applying this to our HTML tag-matching pattern from above, we can exchange three lines of comments for three extra self-documenting lines within the pattern itself.

$pattern = '/ <div[^>]> # Matches opening tag with 0 or more attributes. (.?) # Matches and captures tag contents. </div> # Matches closing tag. /xi';

This has an additional advantage; since the comments are within the string, they don't run afoul of tools used to ensure Drupal's coding standards are followed such as Coder or PHP Code Sniffer the way something like this would:

$pattern = '/<div[^>]>(.?)</div>/i'; // PHP CS flags comments after expressions.

But don’t overdo the comments

Long, complex regular expressions may need commenting. In those cases, I prefer to use the PCRE_EXTENDED modifier and write the comments in the pattern itself. But there’s no need to overdo it—basic regular expressions (e.g. /colou?r/i) seldom need much in the way of comment. Make sure any comments actually illuminate the code for any future coder (you, your colleague, or your client) who might have to read or modify your code.