< >

Regular Expression Matching can be Ugly and Slow

Posted 15 years ago

If you open the first few pages of O'Reilly's Beautiful Code, you will find a well written chapter by Brian Kernighan (Personal motto: "No, I didn't invent C. Who told you that?"). The non-C inventing professor describes how a limited form of regular expressions can be implemented elegantly in only a few lines of C code.

/* match: search for re anywhere in text */
int match(char *re, char *text)
{
   if (re[0] == '^')
      return matchhere(re+1, text);
   do { /* must look at empty string */
      if (matchhere(re, text))
         return 1;
   } while (*text++ != '');
   return 0;
}

/* matchhere: search for re at beginning of text */
int matchhere(char *re, char *text)
{
   if (re[0] == '')
      return 1;
   if (re[1] == '*')
      return matchstar(re[0], re+2, text);
   if (re[0] == '$' && re[1] == '')
      return *text == '';
   if (*text!='' && (re[0]=='.' || re[0]==*text))
      return matchhere(re+1, text+1);
   return 0;
}

/* matchstar: search for c*re at beginning of text */
int matchstar(int c, char *re, char *text)
{
   do { /* a * matches zero or more instances */
      if (matchhere(re, text))
         return 1;
   } while (*text!='' && (*text++==c || c=='.'));
   return 0;
}

However, if you try to actually use it for something you will realize that it is really just a glorified filename globber, which is a much simpler problem than regular expressions. In particular, it doesn't support alternation (|), subexpressions, character ranges, or capturing.

I wondered what it would take to implement the whole shebang Ã¢ï¿½ï¿½ capturing regular expressions as concisely as possible, in C.

int _capslot = 0;

int match_char(char* re, char str)
{
    return *re == '' ? 
        re[1] == 'd' ? isdigit(str) : 
            re[1] == 's' ? isspace(str) :
            re[1] == 'w' ? isalnum(str) || str == '_' :
            re[1] == 'n' ? str == 'n' :
            re[1] == 'r' ? str == 'r' :
            re[1] == 't' ? str == 't' :
            re[1] == str 
        : (*re == '.' || *re == str);
}

// Performs a capturing match-here operation. Returns the number of characters
// matched.
//
// re: Indirect pointer to regular expression
// str: Indirect pointer to string to match.
// skip: must be 0. (Used internally for recursion)
// capture: Returned array of captured subexpressions. To determine the size
// needed, count the number of unescaped '(' in the expression. Subexpressions
// may be nested.
int parse_expr(char** re, char** str, int skip, char* capture[])
{
    int ok = 0;
    char* backup_str = *str;
    for( ;; ) {
        int ok2 = 1, skip2 = skip;
        while( **re != 0 && **re != '|' && **re != ')' ) {
            int ok3 = 1, cnt=0, backup_capslot = _capslot;
            char* backup_re = *re;
            for( ;; *re = backup_re, _capslot = backup_capslot ) {
                char* backup_str = *str;
                if ( **re == 0 ) {
                } else if ( **re == '(' ) {
                    char* start = *str;
                    int slot = _capslot++;
                    (*re)++;
                    ok3 = !!parse_expr(re,str,skip2, capture);
                    if ( ok3 && capture ) {
                        free( capture[slot] );
                        capture[slot] = (char*)malloc( *str-start+1 );
                        memcpy( capture[slot], start, *str-start );
                        capture[slot][*str-start] = 0;
                    }
                    if ( **re == ')' ) (*re)++;
                } else if ( **re == '[' ) {
                    int inv = 0;
                    (*re)++;
                    ok3 = 0;
                    while( **re && **re != ']' ) {
                        if ( **re == '^' ) {
                            inv = 1;
                        } else if ( **re != '' && (*re)[1] == '-' ) {
                            ok3 |= **str >= (*re)[0] && **str <= (*re)[2];
                            *re += 2;
                        } else if ( match_char( *re, **str ) ) {
                            ok3 = 1;
                        }
                        (*re)++;
                    }

                    ok3 ^= inv;
                    ok3 && (*str)++;

                    if ( **re == ']' ) (*re)++;
                } else if ( skip2 ) {
                    *re += (**re == '') + 1;
                } else if ( match_char(*re, **str ) ) {
                    *re += (**re == '') + 1;
                    (*str)++;
                } else {
                    *re += (**re == '') + 1;
                    ok3 = 0;
                }
                if ( !ok3 ) *str = backup_str;
                cnt += ok3;
                if ( **re != '+' && **re != '*' || skip2 || !ok3 ) break;
            }

            if ( **re == '?' ) {
                (*re)++;
                ok3 = 1;
            } else if ( **re == '+' ) {
                (*re)++;
                ok3 = cnt >= 1;
            } else if ( **re == '*' ) {
                (*re)++;
                ok3 = cnt >= 0;
            }

            ok2 &= ok3;
            skip2 |= !ok2;
        }
        ok |= ok2;
        if ( **re != '|' ) break;
        (*re)++;
        ok && (skip = 1) || (*str = backup_str);
    } 
    if ( !ok ) *str = backup_str;
    return *str - backup_str;
}

int main(void)
{
    char* text = "September 22, 1966";
    char* expr = "([A-Za-z]+) (d+), (d+)";

    char* fields[3];

    parse_expr( &expr, &text, 0, fields );

    printf("Month: %s, Day: %s, Year: %sn", fields[0], fields[1], fields[2]);

    return 0;
}

This is horrible on so many levels. First of all, it is impossible to understand. Secondly, it uses backtracking, which means if the expression did not match a certain way it keeps trying until all possibilities are exhausted. Thirdly, I forgot to put in "$" and "^".

There is a better way

A more efficient way to match regular expressions has been known for decades, and is described nicely in Regular Expression Matching can be Simple And Fast. I won't repeat it here (yet).

However, I will point out some code that is beautiful.

In 1993, Anthony Howe's concise implementation of grep won the Obfuscated C Coding competition. It lazily constructs a finite state machine for the expression as it goes, forming only the parts needed to match or reject the input text.

Howe kindly released an unobfuscated version of his entry. If you are interested in the innards of regular expressions, after you read Regular Expression Matching can be Simple And Fast, then take a look at the elegant representation of NFAs in Howe's comments. He describes how the state machine is stored using a flat array. No structures, nodes, or pointers are needed.

If you're used to making an object for everything, then spending 15 minutes to understand this is guaranteed to expand your mind.

The full source code is now on github. (Look for agag.c).

/*
 * Non-deterministic Finite-state Automaton
 *
 * An NFA is a directed graph whose nodes are called states.  Each
 * state is either a labeled state denoting a literal value, or a
 * NIL state that has no label.  Each state has at most two edges
 * leaving it.  There is only one start state and one final state,
 * which may be the same state.
 *
 * Various state-structures can be represented by an array.  These 
 * structures have only one start point being the left most index,
 * and one end point being the right most index.
 *
 *	a		...[a]...
 *
 *	ab		...[a][b]...
 *
 *	^a		[bol][a]...
 *
 *	a$		...[a][eol]
 *
 *	a.b		...[a][wild][b]...
 *	        		_
 *			       v 
 *	a*		...[][a][]...
 *			     ____^
 *
 *	a?		...[][a]...
 *			     ___^
 *			            _______________
 *			           /               v
 *	a|b		...[][a][/][stop][pass][b]...
 *			     _____________^
 *
 *	[a0-9]		...[class][a][0][1][2][3][4][5][6][7][8][9][stop]...
 *
 *	[^a0-9]		...[nclass][a][0][1][2][3][4][5][6][7][8][9][stop]...
 *
 *			            _____________  _______________
 *			           /             v/               v
 *	a|b|c		...[][a][/][stop][][b][/][stop][pass][c]...
 *			     _____________^_____________^
 *
 *			        _            _______________
 *			       v           /               v
 *	a*(b|c)		...[][a][][][b][/][stop][pass][c]...
 *			     ____^   _____________^
 */

#define BOL		('n')
#define EOL		('n')
#define STOP		('n')
#define WILD		(0)
#define CLASS		(-1)
#define NCLASS		(-2)
#define PASS		(-3)
#define JUMP(n)		(-4-(n))
#define CHAR(c)		(c)
#define ISJUMP(x)	((x) <= PASS)
#define ISCHAR(x)	(0 <= (x))

Steve Hanov makes a living working on Rhymebrain.com, rapt.ink, www.websequencediagrams.com, and Zwibbler.com. He lives in Waterloo, Canada.

Post comment

edit

Anthony Howe

13 years ago

Thank you for the very kind praise of my work. Nice to know you enjoyed it and found a use for it as a teaching aid.

edit

Henry

15 years ago

This is an excellent article, thank you!

Blame the extensions (comic)

Drawing Graphs with Physics

To my surprise, I found that there is a very simple way to arrange graphs that can be expressed in only a few lines of code, using force-directed placement...

Is 2009 the year of Linux malware?

Is 2009 the year of the linux desktop malware? How long until we see headlines like, "Researchers find massive botnet based on linux 2.30"?

Succinct Data Structures: Cramming 80,000 words into a Javascript file.

jQuery creator John Resig needs a little help storing lists of words in his side project. Let's go overkill and explore a little known branch of computer science called Succinct Data Structures.

[comic] Appreciation of xkcd comics vs. technical ability

Creating portable binaries on Linux

Distributing applications on Linux is hard. Sure, with modern package management, installing software is easy. But if you are distributing an application, you probably need one Windows version, plus umpteen different versions for Linux. In this article, we'll create a dummy application that targets the following operating systems, which are commonly used in business environments...

How wide should you make your web page?

Based on 22500 unique IP addresses over the past week.

Exploring sound with Wavelets

Here's a program to create scalograms of sound files.

How I run my business selling software to Americans

Here's what you can do to get the most out of your business in Canada if all of your revenue comes in US dollars.

How a programmer reads your resume (comic)

People thought it was a comic, so I never corrected them.