Let's continue our short tour of data structures for storing words. Today, we will over-optimize John Resig's Word Game. Along the way, we shall learn about a little-known branch of computer science, called succinct data structures.
John wants to load a large dictionary of words into a web application, so his Javascript program can quickly check if a word is in the dictionary. He could transfer the words as long string, separated by spaces. This doesn't take much space once it is gzip-compressed by the web server. However, we also have to consider the amount of memory used in the browser itself. In a mobile application, memory is at a premium. If the user switches tabs, everything not being used is swapped out to flash memory. This results in long pauses when switching back.
One of the best data structures for searching a dictionary is a trie. The speed of search does not depend on the number of words in the dictionary. It depends only on the number of letters in the word. For example, here is a trie containing the words "hat", "it", "is", and "a". The trie seems to compress the data, since words sharing the same beginnings only show up once.
We need to solve two problems. If we transmit the word list to the web browser, it then has to build the trie structure. This takes up a lot of time and memory. To save time, we could pre-encode the trie on the server in JSON format, which is parsed very quickly by the web browser. However, JSON is not a compact format, so some bandwidth is wasted downloading the data to the browser. We could avoid the wasted bandwidth by compressing the trie using a more compact format. The data is then smaller, but the web browser still has to decompress it to use it. In any case, the browser needs to create the trie in memory.
This leads us to the the second major problem. Despite appearances, tries use a lot of memory to store all of those links between nodes.
Fortunately, there is a way to store these links in a tiny amount of space.
A succinct data structure encodes data very efficiently, so that it does not need to be decoded to be used. Everything is accessed in-place, by reading bits at various positions in the data. To achieve optimal encoding, we use bits instead of bytes. All of our structures are encoded as a series of 0's and 1's.
Two important functions for succinct structures are:
Corresponding functions exist to find the rank/select of 0's instead of 1's. The rank function can be implemented in O(1) time using a lookup table (called a "directory"), which summarizes the number of 1's in certain parts of the string. The select() function is implemented in O(logn) time by performing binary search on the rank() function. It is possible to implement select in constant time, but it is complicated and space-hungry.
| p | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Bit | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| rank(p) | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 3 |
| select(p) | 0 | 1 | 7 |
We then process the nodes in level order -- that is, we go row by row and process the nodes left to right. We encode them to the bit string in that order.
In the picture below, I've labeled each node in level order for convenience. I've also placed the nodes encoding above it. The encoding is a "1" for each child, plus a 0. So a node with 5 children would be "111110" and a node with no children is "0".
Now, we encode the nodes one after another. In the example, the bits would be 10111010110010000. I've separated them out in this table so you can see what's going on, but only the middle row is actually stored.
| Position | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bit | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Node | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||||||
We then encode the data for each node after that. To get the data for a given node, just read it directly from that node's index in the data array.
hiaatst
The main thing that we want to do with a trie is follow links from each node to its children. Using our encoding, we can follow a link using a simple formula. If a node is numbered i, then the number of its first child is select0(i + 1) - i. The second child is the one after that, and so forth. To obtain the number of children, look up the first child of the i+1th node and subtract, since they are stored consecutively.
For example: We want the first child of node 2. The 3rd 0 is at position 7. Seven minus two is five. Therefore the first child is numbered 5. Similarly the first child of node 3 is found to be 7 by this formula (no, it doesn't really exist, but it works for the calculation). So node 2 has 7 minus 5 equals 2 children.
Using this encoding method, a 611K dictionary containing 80000 words is compressed to 216K, or 132K gzipped. The browser does not need to decode it to use it. The whole trie takes as much space as a 216K string.
To avoid problems with UTF encoding formats and escaped characters, the bit strings are encoded in BASE-64. All of the bit decoding functions are configured to operated on BASE64 encoded units, so that the input string does not need to be decoded before being used.
We only handle the letters "a" to "z" in lower case. That way, we can encode each letter in 5 bits.
You can decrease space usage and performance by increasing the L2 constant, and setting L1 = L2*L2. This controls the number of bits summarized in each section of the rank directory. L2 is the maximum number of bits that have to be scanned to implement rank(). More bits means fewer directory entries, but the select() and rank() functions will take longer to scan the range of bits.
In part 1 of this series, I described how to find the closest match in a dictionary of words using a Trie. Such searches are useful because users often mistype queries. But tries can take a lot of memory -- so much that they may not even fit in the 2 to 4 GB limit imposed by 32-bit operating systems.

In part 2, I described how to build a MA-FSA (also known as a DAWG). The MA-FSA greatly reduces the number of nodes needed to store the same information as a trie. They are quick to build, and you can safely substitute an MA-FSA for a trie in the fuzzy search algorithm.

There is a problem. Since the last node in a word is shared with other words, it is not possible to store data in it. We can use the MA-FSA to check if a word (or a close match) is in the dictionary, but we cannot look up any other information about the word!
If we need extra information about the words, we can use an additional data structure along with the MA-FSA. We can store it in a hash table. Here's an example of a hash table that uses separate chaining. To look up a word, we run it through a hash function, H() which returns a number. We then look at all the items in that "bucket" to find the data. Since there should be only a small number of words in each bucket, the search is very fast.

Notice that the table needs to store the keys (the words that we want to look up) as well as the data associated with them. It needs them to resolve collisions -- when two words hash to the same bucket. Sometimes these keys take up too much storage space. For example, you might be storing information about all of the URLs on the entire Internet, or parts of the human genome. In our case, we already store the words in the MA-FSA, and it is redundant to duplicate them in the hash table as well. If we could guarantee that there were no collisions, we could throw away the keys of the hash table.
Perfect hashing is a technique for building a hash table with no collisions. It is only possible to build one when we know all of the keys in advance. Minimal perfect hashing implies that the resulting table contains one entry for each key, and no empty slots.
We use two levels of hash functions. The first one, H(key), gets a position in an intermediate array, G. The second function, F(d, key), uses the extra information from G to find the unique position for the key. The scheme will always returns a value, so it works as long as we know for sure that what we are searching for is in the table. Otherwise, it will return bad information. Fortunately, our MA-FSA can tell us whether a value is in the table. If we did not have this information, then we could also store the keys with the values in the value table.
In the example below, the words "blue" and "cat" both hash to the same position using the H() function. However, the second level hash, F, combined with the d-value, puts them into different slots.

In step 1, we place the keys into buckets according to the first hash function, H.
In step 2, we process the buckets largest first and try to place all the keys it contains in an empty slot of the value table using F(d=1, key). If that is unsuccessful, we keep trying with successively larger values of d. It sounds like it would take a long time, but in reality it doesn't. Since we try to find the d value for the buckets with the most items early, they are likely to find empty spots. When we get to buckets with just one item, we can simply place them into the next unoccopied spot.
Here's some python code to demonstrate the technique. In it, we use H = F(0, key) to simplify things.
#!/usr/bin/python
# Easy Perfect Minimal Hashing
# By Steve Hanov. Released to the public domain.
#
# Based on:
# Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud,
# "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
# also a good reference:
# Compress, Hash, and Displace algorithm by Djamal Belazzougui,
# Fabiano C. Botelho, and Martin Dietzfelbinger
import sys
DICTIONARY = "/usr/share/dict/words"
TEST_WORDS = sys.argv[1:]
if len(TEST_WORDS) == 0:
TEST_WORDS = ['hello', 'goodbye', 'dog', 'cat']
# Calculates a distinct hash function for a given string. Each value of the
# integer d results in a different hash value.
def hash( d, str ):
if d == 0: d = 0x01000193
# Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/
for c in str:
d = ( (d * 0x01000193) ^ ord(c) ) & 0xffffffff;
return d
# Computes a minimal perfect hash table using the given python dictionary. It
# returns a tuple (G, V). G and V are both arrays. G contains the intermediate
# table of values needed to compute the index of the value in V. V contains the
# values of the dictionary.
def CreateMinimalPerfectHash( dict ):
size = len(dict)
# Step 1: Place all of the keys into buckets
buckets = [ [] for i in range(size) ]
G = [0] * size
values = [None] * size
for key in dict.keys():
buckets[hash(0, key) % size].append( key )
# Step 2: Sort the buckets and process the ones with the most items first.
buckets.sort( key=len, reverse=True )
for b in xrange( size ):
bucket = buckets[b]
if len(bucket) <= 1: break
d = 1
item = 0
slots = []
# Repeatedly try different values of d until we find a hash function
# that places all items in the bucket into free slots
while item < len(bucket):
slot = hash( d, bucket[item] ) % size
if values[slot] != None or slot in slots:
d += 1
item = 0
slots = []
else:
slots.append( slot )
item += 1
G[hash(0, bucket[0]) % size] = d
for i in range(len(bucket)):
values[slots[i]] = dict[bucket[i]]
if ( b % 1000 ) == 0:
print "bucket %d r" % (b),
sys.stdout.flush()
# Only buckets with 1 item remain. Process them more quickly by directly
# placing them into a free slot. Use a negative value of d to indicate
# this.
freelist = []
for i in xrange(size):
if values[i] == None: freelist.append( i )
for b in xrange( b, size ):
bucket = buckets[b]
if len(bucket) == 0: break
slot = freelist.pop()
# We subtract one to ensure it's negative even if the zeroeth slot was
# used.
G[hash(0, bucket[0]) % size] = -slot-1
values[slot] = dict[bucket[0]]
if ( b % 1000 ) == 0:
print "bucket %d r" % (b),
sys.stdout.flush()
return (G, values)
# Look up a value in the hash table, defined by G and V.
def PerfectHashLookup( G, V, key ):
d = G[hash(0,key) % len(G)]
if d < 0: return V[-d-1]
return V[hash(d, key) % len(V)]
print "Reading words"
dict = {}
line = 1
for key in open(DICTIONARY, "rt").readlines():
dict[key.strip()] = line
line += 1
print "Creating perfect hash"
(G, V) = CreateMinimalPerfectHash( dict )
for word in TEST_WORDS:
line = PerfectHashLookup( G, V, word )
print "Word %s occurs on line %d" % (word, line)
| Number of items | Time (s) |
|---|---|
| 100000 | 2.24 |
| 200000 | 4.48 |
| 300000 | 6.68 |
| 400000 | 9.27 |
| 500000 | 11.71 |
| 600000 | 13.81 |
| 700000 | 16.72 |
| 800000 | 18.78 |
| 900000 | 21.12 |
| 1000000 | 24.816 |
Here's a pretty chart.
In the 80's, computers started instantly. They were READY to go when they first turned on.
Over the next few decades, people wanted to do more things and operating systems got slower to initialize. To solve this, OS and hardware manufacturers created hibernate and standby modes.
Now, many people have stopped using native applications and moved to the web. When I load facebook or gmail, it takes dozens of seconds to start up, and minutes over a slower connection. During this time,
It would be trivial to snapshot the DOM and application state in Javascript and provide access to these snapshots with a simple API. The API would also allow you to discard an application version that is too old, or convert the state to the newer one. Then, application startup would be instantaneous.
Or, without any co-operation from standards, browsers can do this RIGHT NOW and snapshot commonly used pages instead of discarding them when users close a tab. When the url is re-entered, from the application perspective it is just as if the machine went into standby and then resumed. The browser could take cookie expiration into account, or to be totally safe, web pages could opt in with a meta tag.
Just sayin'.
Are you looking for a nifty way to choose colours that stand out? Are you the type of person who is not satisfied until you have mathematically proven that your choice is optimal?
One way to do it is to treat red, green, and blue colour values as coordinates in a cube. Two colours are different if the distance between their coordinates is large. But the RGB colour space is not perceptually uniform. Because of the way the human eye works, lots of greens look the same, but we can easily see the difference between subtle shades of yellow. That's why George Takei is hawking TVs. It's also why perceptually uniform colour spaces, such as LAB or LUV warp the cube, as if it were made of play-dough, and left out in the sun for a while. The result is that the differences between the coordinates almost correspond to the perceived difference between two colours for most people.
The colour-space calculations are all on wikipedia, and they are dead simple to implement. For fun, I put them into a simple force system using Javascript (You'll need an HTML5 browser to view. If you're using an RSS reader, you'll have to go to my blog to see it.)
Below are all of the CSS colours which have names commonly recognized by most browsers. Every colour name from "AliceBlue" to "Gainsboro" to "YellowGreen" is there. The circles float freely, and are repelled by each other and the four sides of their container.
When you click on a colour, the background changes to that colour. All of the circles are then attracted to a vertical position based on how different they are from the background. Those near the top are close to the background colour. Those near the bottom are further away from the background. You can change the colour space in which the distance is calculated by clicking on band at the top of the container.
For HSV, the H parameter is divided by 360 before the distance is calculated, to make its influence fair, since the S and V values range from 0 to 1.
To my eye, both RGB and LAB perform well for finding differences, but HSV results in some odd choices. To choose a contrasting colour, use RGB or LAB and avoid picking anything less than a third of the way down.
The source code is released to the public domain.
Last time, I wrote about how to speed up spell checking using a trie (also known as a prefix tree). However, for large dictionaries, a trie can waste a lot of memory. If you're trying to squeeze an application into a mobile device, every kilobyte counts. Consider this trie, with two words in it.
It can be shortened in a way so that any program accessing it would not even notice.
As early as 1988, ScrabbleTM programs were using structures like the above to shrink the their dictionaries. Over the years, the structure has been called many things. Some web pages call it a DAWG (Direct Acyclic Word Graph). But computer scientists have adopted the name "Minimal Acyclic Finite State Automaton", because some papers were already using the name DAWG for something else.
The most obvious way to build a MA-FSA, as suggested in many other web pages, is to first build the trie, and look for duplicate branches. I tried this on a list of 7 million words that I had. I wrote the algorithm in C++, but no matter how hard I tried, I kept running out of memory. A trie (or prefix tree) uses a lot of memory compared to a DAWG. It would be much better if one could create the DAWG right away, without first creating a trie. Jan Duciuk describes such a method in his paper. The central idea is to check for duplicates after you insert each word, so that the structure never gets huge.
The paper that I am paraphrasing, by Jan Daciuk and others, also describes a way to insert words out of order. But it is more complicated. In most cases, you can arrange to add your words in alphabetical order.
Two nodes are considered the same if they are both the final part of a word, or they are both not the final part of a word. They also need to have exactly the same edges pointing to exactly the same other nodes.
We start eliminating duplicates starting from the bottom of the branch, so each elimination can reveal more duplicates. Eventually, the branch of the trie zips together with a prior branch.
Step 1:

Several steps later:
If you have a large word list, you could run it through gzip and get much better compression. The reason for storing a dictionary this way is to save space and remain easily searchable, without needing to decompress it first. Tries and MA-FSAs can support fuzzy search and prefix queries, so you can do spell checking and auto-completion. They can easily scale up to billions of entries. They have even been used to store large portions of the human genome. If you don't care about memory or speed, just store your words in an SQL database, or spin up 100 machines "in the cloud". I don't mind. More power to you!
MA-FSAs can be stored in as little as 4 bytes per edge-connector, as described by this web page.
#!/usr/bin/python
# By Steve Hanov, 2011. Released to the public domain.
import sys
import time
DICTIONARY = "/usr/share/dict/words"
QUERY = sys.argv[1:]
# This class represents a node in the directed acyclic word graph (DAWG). It
# has a list of edges to other nodes. It has functions for testing whether it
# is equivalent to another node. Nodes are equivalent if they have identical
# edges, and each identical edge leads to identical states. The __hash__ and
# __eq__ functions allow it to be used as a key in a python dictionary.
class DawgNode:
NextId = 0
def __init__(self):
self.id = DawgNode.NextId
DawgNode.NextId += 1
self.final = False
self.edges = {}
def __str__(self):
arr = []
if self.final:
arr.append("1")
else:
arr.append("0")
for (label, node) in self.edges.iteritems():
arr.append( label )
arr.append( str( node.id ) )
return "_".join(arr)
def __hash__(self):
return self.__str__().__hash__()
def __eq__(self, other):
return self.__str__() == other.__str__()
class Dawg:
def __init__(self):
self.previousWord = ""
self.root = DawgNode()
# Here is a list of nodes that have not been checked for duplication.
self.uncheckedNodes = []
# Here is a list of unique nodes that have been checked for
# duplication.
self.minimizedNodes = {}
def insert( self, word ):
if word < self.previousWord:
raise Exception("Error: Words must be inserted in alphabetical " +
"order.")
# find common prefix between word and previous word
commonPrefix = 0
for i in range( min( len( word ), len( self.previousWord ) ) ):
if word[i] != self.previousWord[i]: break
commonPrefix += 1
# Check the uncheckedNodes for redundant nodes, proceeding from last
# one down to the common prefix size. Then truncate the list at that
# point.
self._minimize( commonPrefix )
# add the suffix, starting from the correct node mid-way through the
# graph
if len(self.uncheckedNodes) == 0:
node = self.root
else:
node = self.uncheckedNodes[-1][2]
for letter in word[commonPrefix:]:
nextNode = DawgNode()
node.edges[letter] = nextNode
self.uncheckedNodes.append( (node, letter, nextNode) )
node = nextNode
node.final = True
self.previousWord = word
def finish( self ):
# minimize all uncheckedNodes
self._minimize( 0 );
def _minimize( self, downTo ):
# proceed from the leaf up to a certain point
for i in range( len(self.uncheckedNodes) - 1, downTo - 1, -1 ):
(parent, letter, child) = self.uncheckedNodes[i];
if child in self.minimizedNodes:
# replace the child with the previously encountered one
parent.edges[letter] = self.minimizedNodes[child]
else:
# add the state to the minimized nodes.
self.minimizedNodes[child] = child;
self.uncheckedNodes.pop()
def lookup( self, word ):
node = self.root
for letter in word:
if letter not in node.edges: return False
node = node.edges[letter]
return node.final
def nodeCount( self ):
return len(self.minimizedNodes)
def edgeCount( self ):
count = 0
for node in self.minimizedNodes:
count += len(node.edges)
return count
dawg = Dawg()
WordCount = 0
words = open(DICTIONARY, "rt").read().split()
words.sort()
start = time.time()
for word in words:
WordCount += 1
dawg.insert(word)
if ( WordCount % 100 ) == 0: print "%dr" % WordCount,
dawg.finish()
print "Dawg creation took %g s" % (time.time()-start)
EdgeCount = dawg.edgeCount()
print "Read %d words into %d nodes and %d edges" % ( WordCount,
dawg.nodeCount(), EdgeCount )
print "This could be stored in as little as %d bytes" % (EdgeCount * 4)
for word in QUERY:
if not dawg.lookup( word ):
print "%s not in dictionary." % word
else:
print "%s is in the dictionary." % word
Using this code, a list of 7 million words, taking up 63 MB, was translated into 6 million edges. Although it took more than a gigabyte of memory in Python, such a list could be stored in as little as 24 MB. Of course, gzip could do better, but the result would not be quickly searchable.
A MA-FSA is great for testing whether words are in a dictionary. But in the form I gave, it's not possible to retrieve values associated with the words. It is possible to include associated values in the automaton. Such structures are called "Minimal Acyclic Finite State Transducers". In fact, the algorithm I above can be easily modified to include a value. However, it causes the number of nodes to blow up, and you are much better off using a minimal perfect hash function in addition to your MA-FSA to store your data. I discuss this in part 3.
If you have a web site with a search function, you will rapidly realize that most mortals are terrible typists. Many searches contain mispelled words, and users will expect these searches to magically work. This magic is often done using levenshtein distance. In this article, I'll compare two ways of finding the closest matching word in a large dictionary. I'll describe how I use it on rhymebrain.com not for corrections, but to search 2.6 million words for rhymes, for every request, with no caching, on my super-powerful sock-drawer datacenter:

The levenshtein function take two words and returns how far apart they are. It's an O(N*M) algorithm, where N is the length of one word, and M is the length of the other. If you want to know how it works, go to this wikipedia page.
But comparing two words at a time isn't useful. Usually you want to find the closest matching words in a whole dictionary, possibly with many thousands of words. Here's a quick python program to do that, using the straightforward, but slow way. It uses the file /usr/share/dict/words. The first argument is the misspelled word, and the second argument is the maximum distance. It will print out all the words with that distance, as well as the time spent actually searching. For example:
smhanov@ubuntu1004:~$ ./method1.py goober 1
('goober', 0)
('goobers', 1)
('gooier', 1)
Search took 4.5575 s
Here's the program:
#!/usr/bin/python
#By Steve Hanov, 2011. Released to the public domain
import time
import sys
DICTIONARY = "/usr/share/dict/words";
TARGET = sys.argv[1]
MAX_COST = int(sys.argv[2])
# read dictionary file
words = open(DICTIONARY, "rt").read().split();
# for brevity, we omit transposing two characters. Only inserts,
# removals, and substitutions are considered here.
def levenshtein( word1, word2 ):
columns = len(word1) + 1
rows = len(word2) + 1
# build first row
currentRow = [0]
for column in xrange( 1, columns ):
currentRow.append( currentRow[column - 1] + 1 )
for row in xrange( 1, rows ):
previousRow = currentRow
currentRow = [ previousRow[0] + 1 ]
for column in xrange( 1, columns ):
insertCost = currentRow[column - 1] + 1
deleteCost = previousRow[column] + 1
if word1[column - 1] != word2[row - 1]:
replaceCost = previousRow[ column - 1 ] + 1
else:
replaceCost = previousRow[ column - 1 ]
currentRow.append( min( insertCost, deleteCost, replaceCost ) )
return currentRow[-1]
def search( word, maxCost ):
results = []
for word in words:
cost = levenshtein( TARGET, word )
if cost <= maxCost:
results.append( (word, cost) )
return results
start = time.time()
results = search( TARGET, MAX_COST )
end = time.time()
for result in results: print result
print "Search took %g s" % (end - start)
| k | a | t | e | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| c | 1 | 1 | 2 | 3 | 4 |
| a | 2 | 2 | 1 | 2 | 3 |
| t | 3 | 3 | 2 | 1 | 2 |
| k | a | t | e | ||
|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | |
| c | 1 | 1 | 2 | 3 | 4 |
| a | 2 | 2 | 1 | 2 | 3 |
| t | 3 | 3 | 2 | 1 | 2 |
| s | 4 | 4 | 3 | 2 | 2 |
Only the last row changes. We can avoid a lot of work if we can process the words in order, so we never need to repeat a row for the same prefix of letters. The trie data structure is perfect for this. A trie is a giant tree, where each node represents a partial or complete word. Here's one with the words cat, cats, catacomb, and catacombs in it (courtesy of zwibbler.com). Nodes that represent a word are marked in black.

#!/usr/bin/python
#By Steve Hanov, 2011. Released to the public domain
import time
import sys
DICTIONARY = "/usr/share/dict/words";
TARGET = sys.argv[1]
MAX_COST = int(sys.argv[2])
# Keep some interesting statistics
NodeCount = 0
WordCount = 0
# The Trie data structure keeps a set of words, organized with one node for
# each letter. Each node has a branch for each letter that may follow it in the
# set of words.
class TrieNode:
def __init__(self):
self.word = None
self.children = {}
global NodeCount
NodeCount += 1
def insert( self, word ):
node = self
for letter in word:
if letter not in node.children:
node.children[letter] = TrieNode()
node = node.children[letter]
node.word = word
# read dictionary file into a trie
trie = TrieNode()
for word in open(DICTIONARY, "rt").read().split():
WordCount += 1
trie.insert( word )
print "Read %d words into %d nodes" % (WordCount, NodeCount)
# The search function returns a list of all words that are less than the given
# maximum distance from the target word
def search( word, maxCost ):
# build first row
currentRow = range( len(word) + 1 )
results = []
# recursively search each branch of the trie
for letter in trie.children:
searchRecursive( trie.children[letter], letter, word, currentRow,
results, maxCost )
return results
# This recursive helper is used by the search function above. It assumes that
# the previousRow has been filled in already.
def searchRecursive( node, letter, word, previousRow, results, maxCost ):
columns = len( word ) + 1
currentRow = [ previousRow[0] + 1 ]
# Build one row for the letter, with a column for each letter in the target
# word, plus one for the empty string at column 0
for column in xrange( 1, columns ):
insertCost = currentRow[column - 1] + 1
deleteCost = previousRow[column] + 1
if word[column - 1] != letter:
replaceCost = previousRow[ column - 1 ] + 1
else:
replaceCost = previousRow[ column - 1 ]
currentRow.append( min( insertCost, deleteCost, replaceCost ) )
# if the last entry in the row indicates the optimal cost is less than the
# maximum cost, and there is a word in this trie node, then add it.
if currentRow[-1] <= maxCost and node.word != None:
results.append( (node.word, currentRow[-1] ) )
# if any entries in the row are less than the maximum cost, then
# recursively search each branch of the trie
if min( currentRow ) <= maxCost:
for letter in node.children:
searchRecursive( node.children[letter], letter, word, currentRow,
results, maxCost )
start = time.time()
results = search( TARGET, MAX_COST )
end = time.time()
for result in results: print result
print "Search took %g s" % (end - start)
Here are the results:
smhanov@ubuntu1004:~$ ./method1.py goober 1
Read 98568 words into 225893 nodes
('goober', 0)
('goobers', 1)
('gooier', 1)
Search took 0.0141618 s
The second algorithm is over 300 times faster than the first. Why? Well, we create at most one row of the table for each node in the trie. The upper bound for the runtime is O(<max word length> * <number of nodes in the trie>). For most dictionaries, considerably less than O(<number of words> * <max word length>^2)
In December, I realized that Google had released their N-grams data, a list of all of the words in all of the books that they have scanned for their Books search feature. When I imported them all into RhymeBrain, my dictionary size at once increased from 260,000 to 2.6 million, and I was having performance problems.
I already stored the words in a trie, indexed by pronunciation instead of letters. However, to search it, I was first performing a quick and dirty scan to find words that might possibly rhyme. Then I took that large list and ran each one through the levenshtein function to calculate RhymeRankTM. The user is presented with only the top 50 entries of that list.
After a lot of deep thinking, I realized that the levenshtein function could be evaluated incrementally, as I described above. Of course, I might have realized this sooner if I had read one of the many scholarly papers on the subject, which describe this exact method. But who has time for that? :)
With the new algorithm, queries take between 19 and 50 ms even for really long words, but the best part is that I don't need to maintain two separate checks (quick and full), and the RhymeRankTM algorithm is performed uniformly for each of the 2.6 million words on my 1GHz Acer Aspire One datacenter.
(Previous articles on RhymeBrain)
In the blog article Fuzzy String Matching, the author presents a recursive solution using memoization (caching). This is equivalent to flood-filling a diagonal band across the table. It gives a runtime of O(k * <number of nodes in the trie>), where k is the maximum cost. You can modify my algorithm above to only fill in only some entries of the table. I tried it, but it made the examples too complex and actually slowed it down. I blame my python skills.
Update: I just realized the author has created a new solution for dictionary search, also based on tries. I quickly tried it on my machine and dictionary, and got a time of 0.009301, assuming the prefix tree is prebuilt. It's slightly faster for an edit distance of 1! But somethings going on, because it takes 1.5 s for an edit distance of 4, whereas my code takes only 0.44. Phew!
And of course, you could create a levenshtein automaton, essentially a big honking regular expression that matches all possible mispellings. But to do it efficiently you need to write big honking gobs of code. (The code in the linked article does not do it efficiently, but states that it is possible to do so.) Alternatively, you could enumerate all possible mispellings and insert them into a MA-FSA or DAWG to obtain the regular expression.
The imaginary Larmin Corp is designing the next killer product: A mood ring. Okay it's too big to wear around your finger and is more of a wrist device. But it works with 80% accuracy and it's got its own app store and it is expected to be a big hit at CES. There's a snag: unnamed sources are attributing the delay in the product launch to the "On/Off" problem. Larmin Corp denied all rumours and promptly launched lawsuits against the unnamed sources, their children, and pets, and the everyone at the bar that night.
Here's how the device works:
How do you turn on and off this system? One way is to ignore the delays and pretend it takes no time to turn on and off. The UI simply freezes up until the task is done.
The problem is sometimes mood detector takes a little longer to turn on, and user's think it's crashed and exhibit extreme anger. Some even start banging the entire device on the desk.
So we don't freeze the UI during the turn-on procedure. But this leads to the following behaviour:
As users get impatient waiting for it to turn on, they keep restarting the procedure. But if you try to turn off the detector while it is turning on, it crashes. The UI team first decides to handle this by adding another layer above the mood detector. If you send a command to it, and the mood detector is busy, it stores it in a queue for later. As soon as the mood detector completes, the layer replays the next queued action.
The problem is the user gets impatient and starts repeatedly hitting the button, and the device eventually gets so many commands queued up that it just sits there, repeatedly turning on and off until the user slams it against the wall and the battery falls out. Also, if mood detector ever turns on while the user is angry, it screws up the detector's calibration. (In version 1, users are instructed to be in a neutral mood when activating the ring).
So the design architects bring out the big guns and propose a "OnOffManager". Instead of using a queue of commands, the OnOffManager remembers the last requested state and uses it.

This works pretty well, except that during the design phase the graphics designer gets fed up with the whole debate and and simply grays out the button with an ajax spinny thing, so that any further clicks are ignored during turn on. The OnOffManager code is left in, because it took six months to design, but it is never exercised. Everyone lives happily ever after.
Wait, scratch that. Shortly before release, someone writes a location aware app which periodically turns on the mood detector and sends its status to Facebook. Another group is working on the highly secretive "mood gestures" app, which turns off the mood detector if the user thinks a certain sequence of moods. It's not long before somebody complains about their mood ring randomly turning on and off all the time. After analysis, we see the following sequence:
It's an easy fix. The On/Off manager is modified to keep a count of every app that wants it on. The mood detector is only turned on when the counter goes from 0 to 1, and off when the counter goes from 1 to 0. All other states are ignored.
All development is halted and design committees are formed. Soon, no meeting room at Larmin is available because they are all full of developers talking about the problem. Curious discussions like this one are overheard: "If I turn you on, but George turns you off, are you on or are you off?" This is followed by snickers.
And then someone proposes a solution: The mood ring will have a "soft off" function. You can turn it off, but it will still be allowed to turn on again by third party apps, unless you turn it really off. Provisional patents are quickly filed for the software and the design of the power button itself. The software folks toss around the ideas of what off, and really off mean, and whether it makes sense for the ring to be really turned on *snicker*.
Eventually, they come up with the generalized solution. The On/Off manager shall have two counters. One is for a special class of apps that are designated as "System Apps". There's only one for now -- the on off button. But there could be a plurality in the future. The other counter works as before for third-party apps. If the system counter is 0, the mood detector is off and any other commands are ignored. However, if the system counter is non-zero, then the value of the app counter is used to determine if the mood detector should be on. This is illustrated in the following sequence diagram.
And that, folks, is how something like turning the system on and off can grow in complexity very quickly. Soon, Larmin Corp will add low power modes and the special "BlueMood" peripheral, which transmits the moods to other users, but due to brain wave interference patterns, it only works with the mood detector is off even though the user has buttons for both independently.
Come back next time, to read about how the moods are sent from the detector to the display, in "States of confusion".
Making a web application mashable -- useable in another web page -- has some challenges in the area of cross-domain communications. Here is how I solved those problems for Zwibbler.com. (See the API demo here)
Zwibbler consists of a large javascript program and a little HTML. The javascript part uses Ajax methods to send POST requests back to the zwibbler.com server, to render some items without the limitations of the CANVAS tag. In particular, this allows it to support PDF output as well as SVG and PNG.
If you want to include Zwibbler.com on another web site, the zwibbler application still needs to communicate with zwibbler.com in order to perform these tasks. However, browsers will not allow this due to security restrictions. Javascript code can only communicate with the server that the main web page came from.
HTML allows you to embed one web page inside another, in the <iframe> element. They remain essentially separated. The container web site is only allowed to talk to its web server, and the iframe is only allowed to talk to its originating server. Furthermore, because they have different origins, the browser disallows any contact between the two frames. That includes function calls, and variable accesses.

For example, to pass a message from the container web page to the iframe, we use:
iframe.contentWindow.postMessage("hello there", "http://zwibbler.com");
The receiver of the message must have previously registered for an HTML event named "message". This event arrives via the same mechanism as mouse clicks.
window.addEventListener("message", function( event ) {
if ( event.data === "hello there" ) {
// event.origin contains the host of the sending window.
alert("Why, hello to you too, " + event.origin);
}
}, false );
The Messenger object must also keep track of how to direct the replies it receives. It assigns each request a unique ticket, and stores them in a table along with the reply function. When a reply with a matching ticket is recieved, the corresponding function is called:
Messenger.prototype = {
init: function( targetFrame, targetDomain) {
// The DOM node of the target iframe.
this.targetFrame = targetFrame;
// The domain, including http:// of the target iframe.
this.targetDomain = targetDomain;
// A map from ticket number strings to functions awaiting replies.
this.replies = {};
this.nextTicket = 0;
var self = this;
window.addEventListener("message", function(e) {
self.receive(e);
}, false );
},
send: function( functionName, args, replyFn ) {
var ticket = "ticket_" + (this.nextTicket++);
var text = JSON.stringify( {
"function": functionName,
"args": args,
"ticket": ticket
});
if ( replyFn ) {
this.replies[ticket] = replyFn;
}
this.targetFrame.postMessage( text, this.targetDomain );
},
The receive function first checks the origin of the message. If it is not the one that we expected, then we ignore the message. Maybe it's from another iframe, such as an ad or a game that happens to be on the same page. It then checks to see if it has a ticket number. If so, it decodes the arguments and calls the associated reply function.
receive: function( e ) {
if ( e.origin !== this.targetDomain ) {
// not for us: ignore.
return;
}
var json;
try {
json = JSON.parse( e.data );
} catch(except) {
alert( "Syntax error in response from " + e.origin + ": " + e.data );
return;
}
if ( !(json["ticket"] in this.replies ) ) {
// no reply ticket.
return;
}
var replyFn = this.replies[json["ticket"]];
delete this.replies[json["ticket"]];
var args = [];
if ( "args" in json ) {
args = json["args"];
}
replyFn.apply( undefined, args );
},
So we have to add a bit of logic to the above code. When the iframe completes initializing, it sends a message consisting of the text "ready" to its parent window. If the Messenger is asked to send a message, and it has not yet received the "ready" message, then instead of sending it, it adds it to a queue for later. When it finally receives the ready message, it loops through the queue and finally sends all of the waiting messages to the iframe.
The complete code is contained in component.js
There are at least two kinds of programming interviews. One type is where you are asked for details about your prior work experience. The other one is where they put you in a room, give you a problem, and stare at you while you fumble around with markers on a whiteboard for 45 minutes. The first focuses on what you have done in the past. The second focuses on what you can do in the room right now without looking anything up. You should be prepared for either.
The problem is that they require deep thought and introspection to answer, so it’s important to do that thinking in advance. Take an hour and think about the answers to these questions (you can use the same answer for more than one). For questions where you need to tell a story, your answer should follow this format:
Before each interview, go through the entire list and practice your answers out loud. Doing this will give you an edge over the other candidates, because it will make you more comfortable during the interview. When asked a question, other candidates will be staring at the ceiling saying "ummm", trying to remember everything that happened to them in the the past five years. Meanwhile, you'll smile, look the interviewer in the eye, and launch into your story.
Some interviewers believe that being able to solve brain-teasers equates to good programming ability. In case you get one of these, you should develop a passing interest in puzzles and techniques for solving them. A visit to your local library will result in a dozen books, filled with puzzles to practice. Pick some interesting problems to tackle, and resist looking up the answers until you have spent at least a half hour on each one.
A powerful technique is to show real enthusiasm. As human beings, we can’t help responding in kind and becoming excited to work with you. On the other hand, we also have evolved the ability to see through fake smiles, so it’s vital that you be genuinely yourself.
The best interviewers will try to get you to talk about something that you are passionate about, even if it doesn't directly relate to the job. Most interviewers, however, will not. You will have to think about something that you've done that excites you, and look for opportunities to talk about it. Do this early in the interview. After the first 10 minutes it is probably too late, since the interviewers will have already ranked you.
Picture yourself coming in to work at this new job on the first day, turning on the new quad-core development workstation, meeting some interesting new friends, and learning about life at the company. There’s got to be something exciting about that. Otherwise, why are you applying?
If you install the default "ubuntu-desktop" you also get with it a gigabyte of crap that you will never use. But if you don't install the ubuntu desktop, you get a system with a text-only login: prompt, and it's not clear what to install to get it to a usable state.
I have an irrational need to optimize my Ubuntu installation. I did some investigating and came up with this method, which gives a minimal graphical 1.2 GB install, with gnome, networking, and no applications.
When you get to the "task selection" screen with the option to install ubuntu base server, kubuntu, etc, do not change anything at all.
sudo apt-get install gnome-panel gdm gnome-terminal
This will install a graphical environment and the login screen, so it will let you login by clicking on your username.
When it completes, reboot using "sudo reboot", or if you are super-geeky, type in the secret command to avoid rebooting.
sudo apt-get install network-manager network-manager-gnome gnome-power-manager hibernateThat will give you a battery monitor and the icon that lets you configure wireless networking. It also gives you the hibernate button when you shut down.
sudo apt-get install chromium-browser flashplugin-installer
sudo apt-get install ubuntu-artwork
sudo apt-get install gnome-panel gdm gnome-terminal network-manager-gnome network-manager gnome-power-manager hibernate chromium-browser ubuntu-artwork flashplugin-installer