Regular Expression Matching can be Ugly and Slow
/* match: search for re anywhere in text */ int match(char *re, char *text) { if (re[0] == '^') return matchhere(re+1, text); do { /* must look at empty string */ if (matchhere(re, text)) return 1; } while (*text++ != ''); return 0; } /* matchhere: search for re at beginning of text */ int matchhere(char *re, char *text) { if (re[0] == '') return 1; if (re[1] == '*') return matchstar(re[0], re+2, text); if (re[0] == '$' && re[1] == '') return *text == ''; if (*text!='' && (re[0]=='.' || re[0]==*text)) return matchhere(re+1, text+1); return 0; } /* matchstar: search for c*re at beginning of text */ int matchstar(int c, char *re, char *text) { do { /* a * matches zero or more instances */ if (matchhere(re, text)) return 1; } while (*text!='' && (*text++==c || c=='.')); return 0; }
However, if you try to actually use it for something you will realize that it is really just a glorified filename globber, which is a much simpler problem than regular expressions. In particular, it doesn't support alternation (|), subexpressions, character ranges, or capturing.
I wondered what it would take to implement the whole shebang � capturing regular expressions as concisely as possible, in C.
int _capslot = 0; int match_char(char* re, char str) { return *re == '' ? re[1] == 'd' ? isdigit(str) : re[1] == 's' ? isspace(str) : re[1] == 'w' ? isalnum(str) || str == '_' : re[1] == 'n' ? str == 'n' : re[1] == 'r' ? str == 'r' : re[1] == 't' ? str == 't' : re[1] == str : (*re == '.' || *re == str); } // Performs a capturing match-here operation. Returns the number of characters // matched. // // re: Indirect pointer to regular expression // str: Indirect pointer to string to match. // skip: must be 0. (Used internally for recursion) // capture: Returned array of captured subexpressions. To determine the size // needed, count the number of unescaped '(' in the expression. Subexpressions // may be nested. int parse_expr(char** re, char** str, int skip, char* capture[]) { int ok = 0; char* backup_str = *str; for( ;; ) { int ok2 = 1, skip2 = skip; while( **re != 0 && **re != '|' && **re != ')' ) { int ok3 = 1, cnt=0, backup_capslot = _capslot; char* backup_re = *re; for( ;; *re = backup_re, _capslot = backup_capslot ) { char* backup_str = *str; if ( **re == 0 ) { } else if ( **re == '(' ) { char* start = *str; int slot = _capslot++; (*re)++; ok3 = !!parse_expr(re,str,skip2, capture); if ( ok3 && capture ) { free( capture[slot] ); capture[slot] = (char*)malloc( *str-start+1 ); memcpy( capture[slot], start, *str-start ); capture[slot][*str-start] = 0; } if ( **re == ')' ) (*re)++; } else if ( **re == '[' ) { int inv = 0; (*re)++; ok3 = 0; while( **re && **re != ']' ) { if ( **re == '^' ) { inv = 1; } else if ( **re != '' && (*re)[1] == '-' ) { ok3 |= **str >= (*re)[0] && **str <= (*re)[2]; *re += 2; } else if ( match_char( *re, **str ) ) { ok3 = 1; } (*re)++; } ok3 ^= inv; ok3 && (*str)++; if ( **re == ']' ) (*re)++; } else if ( skip2 ) { *re += (**re == '') + 1; } else if ( match_char(*re, **str ) ) { *re += (**re == '') + 1; (*str)++; } else { *re += (**re == '') + 1; ok3 = 0; } if ( !ok3 ) *str = backup_str; cnt += ok3; if ( **re != '+' && **re != '*' || skip2 || !ok3 ) break; } if ( **re == '?' ) { (*re)++; ok3 = 1; } else if ( **re == '+' ) { (*re)++; ok3 = cnt >= 1; } else if ( **re == '*' ) { (*re)++; ok3 = cnt >= 0; } ok2 &= ok3; skip2 |= !ok2; } ok |= ok2; if ( **re != '|' ) break; (*re)++; ok && (skip = 1) || (*str = backup_str); } if ( !ok ) *str = backup_str; return *str - backup_str; } int main(void) { char* text = "September 22, 1966"; char* expr = "([A-Za-z]+) (d+), (d+)"; char* fields[3]; parse_expr( &expr, &text, 0, fields ); printf("Month: %s, Day: %s, Year: %sn", fields[0], fields[1], fields[2]); return 0; }
This is horrible on so many levels. First of all, it is impossible to understand. Secondly, it uses backtracking, which means if the expression did not match a certain way it keeps trying until all possibilities are exhausted. Thirdly, I forgot to put in "$" and "^".
There is a better way
A more efficient way to match regular expressions has been known for decades, and is described nicely in Regular Expression Matching can be Simple And Fast. I won't repeat it here (yet).However, I will point out some code that is beautiful.
In 1993, Anthony Howe's concise implementation of grep won the Obfuscated C Coding competition. It lazily constructs a finite state machine for the expression as it goes, forming only the parts needed to match or reject the input text.
Howe kindly released an unobfuscated version of his entry. If you are interested in the innards of regular expressions, after you read Regular Expression Matching can be Simple And Fast, then take a look at the elegant representation of NFAs in Howe's comments. He describes how the state machine is stored using a flat array. No structures, nodes, or pointers are needed.
If you're used to making an object for everything, then spending 15 minutes to understand this is guaranteed to expand your mind.
The full source code is now on github. (Look for agag.c).
/* * Non-deterministic Finite-state Automaton * * An NFA is a directed graph whose nodes are called states. Each * state is either a labeled state denoting a literal value, or a * NIL state that has no label. Each state has at most two edges * leaving it. There is only one start state and one final state, * which may be the same state. * * Various state-structures can be represented by an array. These * structures have only one start point being the left most index, * and one end point being the right most index. * * a ...[a]... * * ab ...[a][b]... * * ^a [bol][a]... * * a$ ...[a][eol] * * a.b ...[a][wild][b]... * _ * v * a* ...[][a][]... * ____^ * * a? ...[][a]... * ___^ * _______________ * / v * a|b ...[][a][/][stop][pass][b]... * _____________^ * * [a0-9] ...[class][a][0][1][2][3][4][5][6][7][8][9][stop]... * * [^a0-9] ...[nclass][a][0][1][2][3][4][5][6][7][8][9][stop]... * * _____________ _______________ * / v/ v * a|b|c ...[][a][/][stop][][b][/][stop][pass][c]... * _____________^_____________^ * * _ _______________ * v / v * a*(b|c) ...[][a][][][b][/][stop][pass][c]... * ____^ _____________^ */ #define BOL ('n') #define EOL ('n') #define STOP ('n') #define WILD (0) #define CLASS (-1) #define NCLASS (-2) #define PASS (-3) #define JUMP(n) (-4-(n)) #define CHAR(c) (c) #define ISJUMP(x) ((x) <= PASS) #define ISCHAR(x) (0 <= (x))
VP trees: A data structure for finding stuff fast
Let's say you have millions of pictures of faces tagged with names. Given a new photo, how do you find the name of person that the photo most resembles?In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.