Hate UML?

Draw sequence diagrams in seconds.
http://www.websequencediagrams.com

Regular Expression Matching can be Ugly and Slow
Posted on: 2010-03-13 11:00:00
If you open the first few pages of O'Reilly's Beautiful Code, you will find a well written chapter by Brian Kernighan (Personal motto: "No, I didn't invent C. Who told you that?"). The non-C inventing professor describes how a limited form of regular expressions can be implemented elegantly in only a few lines of C code.
/* match: search for re anywhere in text */
int match(char *re, char *text)
{
   if (re[0] == '^')
      return matchhere(re+1, text);
   do { /* must look at empty string */
      if (matchhere(re, text))
         return 1;
   } while (*text++ != '\0');
   return 0;
}

/* matchhere: search for re at beginning of text */
int matchhere(char *re, char *text)
{
   if (re[0] == '\0')
      return 1;
   if (re[1] == '*')
      return matchstar(re[0], re+2, text);
   if (re[0] == '$' && re[1] == '\0')
      return *text == '\0';
   if (*text!='\0' && (re[0]=='.' || re[0]==*text))
      return matchhere(re+1, text+1);
   return 0;
}

/* matchstar: search for c*re at beginning of text */
int matchstar(int c, char *re, char *text)
{
   do { /* a * matches zero or more instances */
      if (matchhere(re, text))
         return 1;
   } while (*text!='\0' && (*text++==c || c=='.'));
   return 0;
}

However, if you try to actually use it for something you will realize that it is really just a glorified filename globber, which is a much simpler problem than regular expressions. In particular, it doesn't support alternation (|), subexpressions, character ranges, or capturing.

I wondered what it would take to implement the whole shebang – capturing regular expressions as concisely as possible, in C.

int _capslot = 0;

int match_char(char* re, char str)
{
    return *re == '\\' ? 
        re[1] == 'd' ? isdigit(str) : 
            re[1] == 's' ? isspace(str) :
            re[1] == 'w' ? isalnum(str) || str == '_' :
            re[1] == 'n' ? str == '\n' :
            re[1] == 'r' ? str == '\r' :
            re[1] == 't' ? str == '\t' :
            re[1] == str 
        : (*re == '.' || *re == str);
}

// Performs a capturing match-here operation. Returns the number of characters
// matched.
//
// re: Indirect pointer to regular expression
// str: Indirect pointer to string to match.
// skip: must be 0. (Used internally for recursion)
// capture: Returned array of captured subexpressions. To determine the size
// needed, count the number of unescaped '(' in the expression. Subexpressions
// may be nested.
int parse_expr(char** re, char** str, int skip, char* capture[])
{
    int ok = 0;
    char* backup_str = *str;
    for( ;; ) {
        int ok2 = 1, skip2 = skip;
        while( **re != 0 && **re != '|' && **re != ')' ) {
            int ok3 = 1, cnt=0, backup_capslot = _capslot;
            char* backup_re = *re;
            for( ;; *re = backup_re, _capslot = backup_capslot ) {
                char* backup_str = *str;
                if ( **re == 0 ) {
                } else if ( **re == '(' ) {
                    char* start = *str;
                    int slot = _capslot++;
                    (*re)++;
                    ok3 = !!parse_expr(re,str,skip2, capture);
                    if ( ok3 && capture ) {
                        free( capture[slot] );
                        capture[slot] = (char*)malloc( *str-start+1 );
                        memcpy( capture[slot], start, *str-start );
                        capture[slot][*str-start] = 0;
                    }
                    if ( **re == ')' ) (*re)++;
                } else if ( **re == '[' ) {
                    int inv = 0;
                    (*re)++;
                    ok3 = 0;
                    while( **re && **re != ']' ) {
                        if ( **re == '^' ) {
                            inv = 1;
                        } else if ( **re != '\\' && (*re)[1] == '-' ) {
                            ok3 |= **str >= (*re)[0] && **str <= (*re)[2];
                            *re += 2;
                        } else if ( match_char( *re, **str ) ) {
                            ok3 = 1;
                        }
                        (*re)++;
                    }

                    ok3 ^= inv;
                    ok3 && (*str)++;

                    if ( **re == ']' ) (*re)++;
                } else if ( skip2 ) {
                    *re += (**re == '\\') + 1;
                } else if ( match_char(*re, **str ) ) {
                    *re += (**re == '\\') + 1;
                    (*str)++;
                } else {
                    *re += (**re == '\\') + 1;
                    ok3 = 0;
                }
                if ( !ok3 ) *str = backup_str;
                cnt += ok3;
                if ( **re != '+' && **re != '*' || skip2 || !ok3 ) break;
            }

            if ( **re == '?' ) {
                (*re)++;
                ok3 = 1;
            } else if ( **re == '+' ) {
                (*re)++;
                ok3 = cnt >= 1;
            } else if ( **re == '*' ) {
                (*re)++;
                ok3 = cnt >= 0;
            }

            ok2 &= ok3;
            skip2 |= !ok2;
        }
        ok |= ok2;
        if ( **re != '|' ) break;
        (*re)++;
        ok && (skip = 1) || (*str = backup_str);
    } 
    if ( !ok ) *str = backup_str;
    return *str - backup_str;
}

int main(void)
{
    char* text = "September 22, 1966";
    char* expr = "([A-Za-z]+) (\\d+), (\\d+)";

    char* fields[3];

    parse_expr( &expr, &text, 0, fields );

    printf("Month: %s, Day: %s, Year: %s\n", fields[0], fields[1], fields[2]);

    return 0;
}

This is horrible on so many levels. First of all, it is impossible to understand. Secondly, it uses backtracking, which means if the expression did not match a certain way it keeps trying until all possibilities are exhausted. Thirdly, I forgot to put in "$" and "^".

There is a better way

A more efficient way to match regular expressions has been known for decades, and is described nicely in Regular Expression Matching can be Simple And Fast. I won't repeat it here (yet).

However, I will point out some code that is beautiful.

In 1993, Anthony Howe's concise implementation of grep won the Obfuscated C Coding competition. It lazily constructs a finite state machine for the expression as it goes, forming only the parts needed to match or reject the input text.

Howe kindly released an unobfuscated version of his entry. If you are interested in the innards of regular expressions, after you read Regular Expression Matching can be Simple And Fast, then take a look at the elegant representation of NFAs in Howe's comments. He describes how the state machine is stored using a flat array. No structures, nodes, or pointers are needed.

If you're used to making an object for everything, then spending 15 minutes to understand this is guaranteed to expand your mind.

The full source code is hidden in Anthony's web site (Look for agag.c).

/*
 * Non-deterministic Finite-state Automaton
 *
 * An NFA is a directed graph whose nodes are called states.  Each
 * state is either a labeled state denoting a literal value, or a
 * NIL state that has no label.  Each state has at most two edges
 * leaving it.  There is only one start state and one final state,
 * which may be the same state.
 *
 * Various state-structures can be represented by an array.  These 
 * structures have only one start point being the left most index,
 * and one end point being the right most index.
 *
 *	a		...[a]...
 *
 *	ab		...[a][b]...
 *
 *	^a		[bol][a]...
 *
 *	a$		...[a][eol]
 *
 *	a.b		...[a][wild][b]...
 *	        		_
 *			       v \
 *	a*		...[\][a][\]...
 *			     \____^
 *
 *	a?		...[\][a]...
 *			     \___^
 *			            _______________
 *			           /               v
 *	a|b		...[\][a][/][stop][pass][b]...
 *			     \_____________^
 *
 *	[a0-9]		...[class][a][0][1][2][3][4][5][6][7][8][9][stop]...
 *
 *	[^a0-9]		...[nclass][a][0][1][2][3][4][5][6][7][8][9][stop]...
 *
 *			            _____________  _______________
 *			           /             v/               v
 *	a|b|c		...[\][a][/][stop][\][b][/][stop][pass][c]...
 *			     \_____________^\_____________^
 *
 *			        _            _______________
 *			       v \          /               v
 *	a*(b|c)		...[\][a][\][\][b][/][stop][pass][c]...
 *			     \____^   \_____________^
 */

#define BOL		('\n')
#define EOL		('\n')
#define STOP		('\n')
#define WILD		(0)
#define CLASS		(-1)
#define NCLASS		(-2)
#define PASS		(-3)
#define JUMP(n)		(-4-(n))
#define CHAR(c)		(c)
#define ISJUMP(x)	((x) <= PASS)
#define ISCHAR(x)	(0 <= (x))

Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts

Post comment

Real Name:
Your Email (Not displayed):

Text only. No HTML. If you write "http:" your message will be ignored.
Choose an edit password if you want to be able to edit or delete your comment later.
Editing Password (Optional):

Henry

2010-03-14 17:08:57
This is an excellent article, thank you!

Anthony Howe

2011-06-03 07:53:07
Thank you for the very kind praise of my work. Nice to know you enjoyed it and found a use for it as a teaching aid.
Email
steve.hanov@gmail.com

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets