Succinct Data Structures: Cramming 80,000 words into a Javascript file.
Posted on: 2011-03-20 22:47:59

Let's continue our short tour of data structures for storing words. Today, we will over-optimize John Resig's Word Game. Along the way, we shall learn about a little-known branch of computer science, called succinct data structures.

John wants to load a large dictionary of words into a web application, so his Javascript program can quickly check if a word is in the dictionary. He could transfer the words as long string, separated by spaces. This doesn't take much space once it is gzip-compressed by the web server. However, we also have to consider the amount of memory used in the browser itself. In a mobile application, memory is at a premium. If the user switches tabs, everything not being used is swapped out to flash memory. This results in long pauses when switching back.

One of the best data structures for searching a dictionary is a trie. The speed of search does not depend on the number of words in the dictionary. It depends only on the number of letters in the word. For example, here is a trie containing the words "hat", "it", "is", and "a". The trie seems to compress the data, since words sharing the same beginnings only show up once.

We need to solve two problems. If we transmit the word list to the web browser, it then has to build the trie structure. This takes up a lot of time and memory. To save time, we could pre-encode the trie on the server in JSON format, which is parsed very quickly by the web browser. However, JSON is not a compact format, so some bandwidth is wasted downloading the data to the browser. We could avoid the wasted bandwidth by compressing the trie using a more compact format. The data is then smaller, but the web browser still has to decompress it to use it. In any case, the browser needs to create the trie in memory.

This leads us to the the second major problem. Despite appearances, tries use a lot of memory to store all of those links between nodes.

Fortunately, there is a way to store these links in a tiny amount of space.

Succinct Data Structures

Succinct data structures were introduced in Guy Jacobson's 1989 thesis, which you cannot read because it is not available anywhere. Fortunately, this important work has been referenced by many other papers since then.

A succinct data structure encodes data very efficiently, so that it does not need to be decoded to be used. Everything is accessed in-place, by reading bits at various positions in the data. To achieve optimal encoding, we use bits instead of bytes. All of our structures are encoded as a series of 0's and 1's.

Two important functions for succinct structures are:

  • rank(x) - returns the number of bits set to 1, up to and including position x
  • select(y) - returns the position of the yth 1. This is the inverse of the rank function. For example, if select(8) = 10, then rank(10) = 8.

Corresponding functions exist to find the rank/select of 0's instead of 1's. The rank function can be implemented in O(1) time using a lookup table (called a "directory"), which summarizes the number of 1's in certain parts of the string. The select() function is implemented in O(logn) time by performing binary search on the rank() function. It is possible to implement select in constant time, but it is complicated and space-hungry.

p 0 1 2 3 4 5 6 7
Bit 1 1 0 0 0 0 0 1
rank(p) 1 2 2 2 2 2 2 3
select(p) 0 1 7

A Succinct Trie

Here's a trie containing the words "hat", "is", "it", and "a".

First, we add a "super root". This is just an additional node above the root. It's there to make the math work out later.

We then process the nodes in level order -- that is, we go row by row and process the nodes left to right. We encode them to the bit string in that order.

In the picture below, I've labeled each node in level order for convenience. I've also placed the nodes encoding above it. The encoding is a "1" for each child, plus a 0. So a node with 5 children would be "111110" and a node with no children is "0".

Now, we encode the nodes one after another. In the example, the bits would be 10111010110010000. I've separated them out in this table so you can see what's going on, but only the middle row is actually stored.

Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Bit 1 0 1 1 1 0 1 0 1 1 0 0 1 0 0 0 0
Node 0 1 2 3 4 5 6 7

We then encode the data for each node after that. To get the data for a given node, just read it directly from that node's index in the data array.

hiaatst

Getting the data

The main thing that we want to do with a trie is follow links from each node to its children. Using our encoding, we can follow a link using a simple formula. If a node is numbered i, then the number of its first child is select0(i + 1) - i. The second child is the one after that, and so forth. To obtain the number of children, look up the first child of the i+1th node and subtract, since they are stored consecutively.

For example: We want the first child of node 2. The 3rd 0 is at position 7. Seven minus two is five. Therefore the first child is numbered 5. Similarly the first child of node 3 is found to be 7 by this formula (no, it doesn't really exist, but it works for the calculation). So node 2 has 7 minus 5 equals 2 children.

Demo

Here is a demonstration, hosted on my faster server. (Source code: Bits.js) (It doesn't work in RSS readers -- go to my blog to see it. Paste a list of words in the top text area (or click Load dictionary to load one). Click "Encode" to create the trie and encode it. This step can be very slow, because I did not optimize the encoding process. Once encoding is complete, you can use the Lookup button to check if words are in the dictionary.

Using this encoding method, a 611K dictionary containing 80000 words is compressed to 216K, or 132K gzipped. The browser does not need to decode it to use it. The whole trie takes as much space as a 216K string.

Details

The directory contains the information needed to compute the rank and select functions quickly. The trie is the bitstring representing the trie and the connections between all of its nodes.

To avoid problems with UTF encoding formats and escaped characters, the bit strings are encoded in BASE-64. All of the bit decoding functions are configured to operated on BASE64 encoded units, so that the input string does not need to be decoded before being used.

We only handle the letters "a" to "z" in lower case. That way, we can encode each letter in 5 bits.

You can decrease space usage and performance by increasing the L2 constant, and setting L1 = L2*L2. This controls the number of bits summarized in each section of the rank directory. L2 is the maximum number of bits that have to be scanned to implement rank(). More bits means fewer directory entries, but the select() and rank() functions will take longer to scan the range of bits.

Caveats

I described how to create an MA-FSA in a previous article. There is no known way to succinctly encode one. You must store one pointer for each edge. However, as the number of words increases, an MA-FSA (also known as a DAWG) may eventually become more compact than the trie. This is because a trie does not compress common word endings together.

More...

Throw away the keys: Easy, Minimal Perfect Hashing
Posted on: 2011-03-09 18:00:00
CORRECTION: In this article, I incorrectly state that an acyclic finite state automata (aka a DAWG) cannot be used to retrieve values associated with its keys. I have since learned that it can. By storing in each internal node the number of leaf nodes that are reachable from it, we can, upon arriving at the destination key, know its unique index number. Then its value can be looked up in an array. I am now using this technique on rhymebrain.com

More...

Why don't web browsers do this?
Posted on: 2011-02-06 17:10:14

In the 80's, computers started instantly. They were READY to go when they first turned on.

Over the next few decades, people wanted to do more things and operating systems got slower to initialize. To solve this, OS and hardware manufacturers created hibernate and standby modes.

Now, many people have stopped using native applications and moved to the web. When I load facebook or gmail, it takes dozens of seconds to start up, and minutes over a slower connection. During this time,

  1. The source files for the application are loaded from the server,
  2. The source code is compiled and run.
  3. Requests are made to retrieve the application state from the server, and
  4. the DOM is manipulated to present the state to the user.

It would be trivial to snapshot the DOM and application state in Javascript and provide access to these snapshots with a simple API. The API would also allow you to discard an application version that is too old, or convert the state to the newer one. Then, application startup would be instantaneous.

Or, without any co-operation from standards, browsers can do this RIGHT NOW and snapshot commonly used pages instead of discarding them when users close a tab. When the url is re-entered, from the application perspective it is just as if the machine went into standby and then resumed. The browser could take cookie expiration into account, or to be totally safe, web pages could opt in with a meta tag.

Just sayin'.

More...

Fun with Colour Difference
Posted on: 2011-02-04 10:46:55

Are you looking for a nifty way to choose colours that stand out? Are you the type of person who is not satisfied until you have mathematically proven that your choice is optimal?

One way to do it is to treat red, green, and blue colour values as coordinates in a cube. Two colours are different if the distance between their coordinates is large. But the RGB colour space is not perceptually uniform. Because of the way the human eye works, lots of greens look the same, but we can easily see the difference between subtle shades of yellow. That's why George Takei is hawking TVs. It's also why perceptually uniform colour spaces, such as LAB or LUV warp the cube, as if it were made of play-dough, and left out in the sun for a while. The result is that the differences between the coordinates almost correspond to the perceived difference between two colours for most people.

The colour-space calculations are all on wikipedia, and they are dead simple to implement. For fun, I put them into a simple force system using Javascript (You'll need an HTML5 browser to view. If you're using an RSS reader, you'll have to go to my blog to see it.)

Explanation

Below are all of the CSS colours which have names commonly recognized by most browsers. Every colour name from "AliceBlue" to "Gainsboro" to "YellowGreen" is there. The circles float freely, and are repelled by each other and the four sides of their container.

When you click on a colour, the background changes to that colour. All of the circles are then attracted to a vertical position based on how different they are from the background. Those near the top are close to the background colour. Those near the bottom are further away from the background. You can change the colour space in which the distance is calculated by clicking on band at the top of the container.

For HSV, the H parameter is divided by 360 before the distance is calculated, to make its influence fair, since the S and V values range from 0 to 1.

Sorry, you have to use a browser that supports HTML5 and go to my blog to see it.

Observations

In RGB, black is the most different from white. But in LAB, black is in the middle somewhere, and other dark colours are more distant.

To my eye, both RGB and LAB perform well for finding differences, but HSV results in some odd choices. To choose a contrasting colour, use RGB or LAB and avoid picking anything less than a third of the way down.

Source code

The source code is released to the public domain.

You might be interested in a previous article about exploiting colour difference for edge detection.

Other articles from my blog

More...

Compressing dictionaries with a DAWG
Posted on: 2011-01-24 21:40:57

Last time, I wrote about how to speed up spell checking using a trie (also known as a prefix tree). However, for large dictionaries, a trie can waste a lot of memory. If you're trying to squeeze an application into a mobile device, every kilobyte counts. Consider this trie, with two words in it.

It can be shortened in a way so that any program accessing it would not even notice.

As early as 1988, ScrabbleTM programs were using structures like the above to shrink the their dictionaries. Over the years, the structure has been called many things. Some web pages call it a DAWG (Direct Acyclic Word Graph). But computer scientists have adopted the name "Minimal Acyclic Finite State Automaton", because some papers were already using the name DAWG for something else.

The most obvious way to build a MA-FSA, as suggested in many other web pages, is to first build the trie, and look for duplicate branches. I tried this on a list of 7 million words that I had. I wrote the algorithm in C++, but no matter how hard I tried, I kept running out of memory. A trie (or prefix tree) uses a lot of memory compared to a DAWG. It would be much better if one could create the DAWG right away, without first creating a trie. Jan Duciuk describes such a method in his paper. The central idea is to check for duplicates after you insert each word, so that the structure never gets huge.

  1. Ensure that words are inserted in alphabetical order. That way, when you insert a word, you will then know for sure whether the previous word ended an entire branch. For example, "cat" followed by "catnip" does not result in a branch, because the s just added to the end. But when you follow it with "cats" you know that the "nip" part of the previous word needs checking.
  2. Each time you complete a branch in the trie, check it for duplicate nodes. When a duplicate is found, redirect all incoming edges to the existing one and eliminate the duplicate.

The paper that I am paraphrasing, by Jan Daciuk and others, also describes a way to insert words out of order. But it is more complicated. In most cases, you can arrange to add your words in alphabetical order.

What's a duplicate node?

Two nodes are considered the same if they are both the final part of a word, or they are both not the final part of a word. They also need to have exactly the same edges pointing to exactly the same other nodes.

We start eliminating duplicates starting from the bottom of the branch, so each elimination can reveal more duplicates. Eventually, the branch of the trie zips together with a prior branch.

Step 1:

Several steps later:

Why go through so much trouble?

If you have a large word list, you could run it through gzip and get much better compression. The reason for storing a dictionary this way is to save space and remain easily searchable, without needing to decompress it first. Tries and MA-FSAs can support fuzzy search and prefix queries, so you can do spell checking and auto-completion. They can easily scale up to billions of entries. They have even been used to store large portions of the human genome. If you don't care about memory or speed, just store your words in an SQL database, or spin up 100 machines "in the cloud". I don't mind. More power to you!

MA-FSAs can be stored in as little as 4 bytes per edge-connector, as described by this web page.

Implementation

Here's a python implementation. I tried it and it could easily handle seven million words in a couple minutes.
#!/usr/bin/python
# By Steve Hanov, 2011. Released to the public domain.
import sys
import time

DICTIONARY = "/usr/share/dict/words"
QUERY = sys.argv[1:]

# This class represents a node in the directed acyclic word graph (DAWG). It
# has a list of edges to other nodes. It has functions for testing whether it
# is equivalent to another node. Nodes are equivalent if they have identical
# edges, and each identical edge leads to identical states. The __hash__ and
# __eq__ functions allow it to be used as a key in a python dictionary.
class DawgNode:
    NextId = 0
    
    def __init__(self):
        self.id = DawgNode.NextId
        DawgNode.NextId += 1
        self.final = False
        self.edges = {}

    def __str__(self):        
        arr = []
        if self.final: 
            arr.append("1")
        else:
            arr.append("0")

        for (label, node) in self.edges.iteritems():
            arr.append( label )
            arr.append( str( node.id ) )

        return "_".join(arr)

    def __hash__(self):
        return self.__str__().__hash__()

    def __eq__(self, other):
        return self.__str__() == other.__str__()

class Dawg:
    def __init__(self):
        self.previousWord = ""
        self.root = DawgNode()

        # Here is a list of nodes that have not been checked for duplication.
        self.uncheckedNodes = []

        # Here is a list of unique nodes that have been checked for
        # duplication.
        self.minimizedNodes = {}

    def insert( self, word ):
        if word < self.previousWord:
            raise Exception("Error: Words must be inserted in alphabetical " +
                "order.")

        # find common prefix between word and previous word
        commonPrefix = 0
        for i in range( min( len( word ), len( self.previousWord ) ) ):
            if word[i] != self.previousWord[i]: break
            commonPrefix += 1

        # Check the uncheckedNodes for redundant nodes, proceeding from last
        # one down to the common prefix size. Then truncate the list at that
        # point.
        self._minimize( commonPrefix )

        # add the suffix, starting from the correct node mid-way through the
        # graph
        if len(self.uncheckedNodes) == 0:
            node = self.root
        else:
            node = self.uncheckedNodes[-1][2]

        for letter in word[commonPrefix:]:
            nextNode = DawgNode()
            node.edges[letter] = nextNode
            self.uncheckedNodes.append( (node, letter, nextNode) )
            node = nextNode

        node.final = True
        self.previousWord = word

    def finish( self ):
        # minimize all uncheckedNodes
        self._minimize( 0 );

    def _minimize( self, downTo ):
        # proceed from the leaf up to a certain point
        for i in range( len(self.uncheckedNodes) - 1, downTo - 1, -1 ):
            (parent, letter, child) = self.uncheckedNodes[i];
            if child in self.minimizedNodes:
                # replace the child with the previously encountered one
                parent.edges[letter] = self.minimizedNodes[child]
            else:
                # add the state to the minimized nodes.
                self.minimizedNodes[child] = child;
            self.uncheckedNodes.pop()

    def lookup( self, word ):
        node = self.root
        for letter in word:
            if letter not in node.edges: return False
            node = node.edges[letter]

        return node.final

    def nodeCount( self ):
        return len(self.minimizedNodes)

    def edgeCount( self ):
        count = 0
        for node in self.minimizedNodes:
            count += len(node.edges)
        return count

        
dawg = Dawg()
WordCount = 0
words = open(DICTIONARY, "rt").read().split()
words.sort()
start = time.time()    
for word in words:
    WordCount += 1
    dawg.insert(word)
    if ( WordCount % 100 ) == 0: print "%dr" % WordCount,
dawg.finish()
print "Dawg creation took %g s" % (time.time()-start)    

EdgeCount = dawg.edgeCount()
print "Read %d words into %d nodes and %d edges" % ( WordCount,
        dawg.nodeCount(), EdgeCount )
print "This could be stored in as little as %d bytes" % (EdgeCount * 4)    

for word in QUERY:
    if not dawg.lookup( word ):
        print "%s not in dictionary." % word
    else:
        print "%s is in the dictionary." % word
Using this code, a list of 7 million words, taking up 63 MB, was translated into 6 million edges. Although it took more than a gigabyte of memory in Python, such a list could be stored in as little as 24 MB. Of course, gzip could do better, but the result would not be quickly searchable.

Extensions

A MA-FSA is great for testing whether words are in a dictionary. But in the form I gave, it's not possible to retrieve values associated with the words. It is possible to include associated values in the automaton. Such structures are called "Minimal Acyclic Finite State Transducers". In fact, the algorithm I above can be easily modified to include a value. However, it causes the number of nodes to blow up, and you are much better off using a minimal perfect hash function in addition to your MA-FSA to store your data. I discuss this in part 3.

More...

Fast and Easy Levenshtein distance using a Trie
Posted on: 2011-01-14 20:07:53

If you have a web site with a search function, you will rapidly realize that most mortals are terrible typists. Many searches contain mispelled words, and users will expect these searches to magically work. This magic is often done using levenshtein distance. In this article, I'll compare two ways of finding the closest matching word in a large dictionary. I'll describe how I use it on rhymebrain.com not for corrections, but to search 2.6 million words for rhymes, for every request, with no caching, on my super-powerful sock-drawer datacenter:

Algorithm #1

The levenshtein function take two words and returns how far apart they are. It's an O(N*M) algorithm, where N is the length of one word, and M is the length of the other. If you want to know how it works, go to this wikipedia page.

But comparing two words at a time isn't useful. Usually you want to find the closest matching words in a whole dictionary, possibly with many thousands of words. Here's a quick python program to do that, using the straightforward, but slow way. It uses the file /usr/share/dict/words. The first argument is the misspelled word, and the second argument is the maximum distance. It will print out all the words with that distance, as well as the time spent actually searching. For example:

smhanov@ubuntu1004:~$ ./method1.py goober 1
('goober', 0)
('goobers', 1)
('gooier', 1)
Search took 4.5575 s

Here's the program:

#!/usr/bin/python
#By Steve Hanov, 2011. Released to the public domain
import time
import sys

DICTIONARY = "/usr/share/dict/words";
TARGET = sys.argv[1]
MAX_COST = int(sys.argv[2])

# read dictionary file
words = open(DICTIONARY, "rt").read().split();

# for brevity, we omit transposing two characters. Only inserts,
# removals, and substitutions are considered here.
def levenshtein( word1, word2 ):
    columns = len(word1) + 1
    rows = len(word2) + 1

    # build first row
    currentRow = [0]
    for column in xrange( 1, columns ):
        currentRow.append( currentRow[column - 1] + 1 )

    for row in xrange( 1, rows ):
        previousRow = currentRow
        currentRow = [ previousRow[0] + 1 ]

        for column in xrange( 1, columns ):

            insertCost = currentRow[column - 1] + 1
            deleteCost = previousRow[column] + 1

            if word1[column - 1] != word2[row - 1]:
                replaceCost = previousRow[ column - 1 ] + 1
            else:                
                replaceCost = previousRow[ column - 1 ]

            currentRow.append( min( insertCost, deleteCost, replaceCost ) )

    return currentRow[-1]

def search( word, maxCost ):
    results = []
    for word in words:
        cost = levenshtein( TARGET, word )

        if cost <= maxCost:
            results.append( (word, cost) )

    return results

start = time.time()
results = search( TARGET, MAX_COST )
end = time.time()

for result in results: print result        

print "Search took %g s" % (end - start)

Runtime

For each word, we have to fill in an N x M table. An upper bound for the runtime is O( <number of words> * <max word length> ^2 )

Improving it

Sorry, now you need to know how the algorithm works and I'm not going to explain it. (You really need to read the wikipedia page.) The important things to know are that it fills in a N x M sized table, like this one, and the answer is in the bottom-right square.
kate
01234
c11234
a22123
t33212

But wait, what's it going to do when it moves on to the next word after cat? In my dictionary, that's "cats" so here it is:
kate
01234
c11234
a22123
t33212
s44322

Only the last row changes. We can avoid a lot of work if we can process the words in order, so we never need to repeat a row for the same prefix of letters. The trie data structure is perfect for this. A trie is a giant tree, where each node represents a partial or complete word. Here's one with the words cat, cats, catacomb, and catacombs in it (courtesy of zwibbler.com). Nodes that represent a word are marked in black.

With a trie, all shared prefixes in the dictionary are collaped into a single path, so we can process them in the best order for building up our levenshtein tables one row at a time. Here's a python program to do that:
#!/usr/bin/python
#By Steve Hanov, 2011. Released to the public domain
import time
import sys

DICTIONARY = "/usr/share/dict/words";
TARGET = sys.argv[1]
MAX_COST = int(sys.argv[2])

# Keep some interesting statistics
NodeCount = 0
WordCount = 0

# The Trie data structure keeps a set of words, organized with one node for
# each letter. Each node has a branch for each letter that may follow it in the
# set of words.
class TrieNode:
    def __init__(self):
        self.word = None
        self.children = {}

        global NodeCount
        NodeCount += 1

    def insert( self, word ):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = TrieNode()

            node = node.children[letter]

        node.word = word

# read dictionary file into a trie
trie = TrieNode()
for word in open(DICTIONARY, "rt").read().split():
    WordCount += 1
    trie.insert( word )

print "Read %d words into %d nodes" % (WordCount, NodeCount)

# The search function returns a list of all words that are less than the given
# maximum distance from the target word
def search( word, maxCost ):

    # build first row
    currentRow = range( len(word) + 1 )

    results = []

    # recursively search each branch of the trie
    for letter in trie.children:
        searchRecursive( trie.children[letter], letter, word, currentRow, 
            results, maxCost )

    return results

# This recursive helper is used by the search function above. It assumes that
# the previousRow has been filled in already.
def searchRecursive( node, letter, word, previousRow, results, maxCost ):

    columns = len( word ) + 1
    currentRow = [ previousRow[0] + 1 ]

    # Build one row for the letter, with a column for each letter in the target
    # word, plus one for the empty string at column 0
    for column in xrange( 1, columns ):

        insertCost = currentRow[column - 1] + 1
        deleteCost = previousRow[column] + 1

        if word[column - 1] != letter:
            replaceCost = previousRow[ column - 1 ] + 1
        else:                
            replaceCost = previousRow[ column - 1 ]

        currentRow.append( min( insertCost, deleteCost, replaceCost ) )

    # if the last entry in the row indicates the optimal cost is less than the
    # maximum cost, and there is a word in this trie node, then add it.
    if currentRow[-1] <= maxCost and node.word != None:
        results.append( (node.word, currentRow[-1] ) )

    # if any entries in the row are less than the maximum cost, then 
    # recursively search each branch of the trie
    if min( currentRow ) <= maxCost:
        for letter in node.children:
            searchRecursive( node.children[letter], letter, word, currentRow, 
                results, maxCost )

start = time.time()
results = search( TARGET, MAX_COST )
end = time.time()

for result in results: print result        

print "Search took %g s" % (end - start)
Here are the results:
smhanov@ubuntu1004:~$ ./method1.py goober 1
Read 98568 words into 225893 nodes
('goober', 0)
('goobers', 1)
('gooier', 1)
Search took 0.0141618 s
The second algorithm is over 300 times faster than the first. Why? Well, we create at most one row of the table for each node in the trie. The upper bound for the runtime is O(<max word length> * <number of nodes in the trie>). For most dictionaries, considerably less than O(<number of words> * <max word length>^2)

Saving memory

Building a trie can take a lot of memory. In Part 2, I discuss how to construct a MA-FSA (or DAWG) which contains the same information in a more compact form.

RhymeBrain

In December, I realized that Google had released their N-grams data, a list of all of the words in all of the books that they have scanned for their Books search feature. When I imported them all into RhymeBrain, my dictionary size at once increased from 260,000 to 2.6 million, and I was having performance problems.

I already stored the words in a trie, indexed by pronunciation instead of letters. However, to search it, I was first performing a quick and dirty scan to find words that might possibly rhyme. Then I took that large list and ran each one through the levenshtein function to calculate RhymeRankTM. The user is presented with only the top 50 entries of that list.

After a lot of deep thinking, I realized that the levenshtein function could be evaluated incrementally, as I described above. Of course, I might have realized this sooner if I had read one of the many scholarly papers on the subject, which describe this exact method. But who has time for that? :)

With the new algorithm, queries take between 19 and 50 ms even for really long words, but the best part is that I don't need to maintain two separate checks (quick and full), and the RhymeRankTM algorithm is performed uniformly for each of the 2.6 million words on my 1GHz Acer Aspire One datacenter.

(Previous articles on RhymeBrain)

Other references

In his article How to write a spelling corrector, Peter Norvig approaches the problem using a different way of thinking. He first stores his dictionary in a hash-table for fast lookup. Then he goes through hundreds, or even thousands of combinations of spelling mutations of the target word and checks if each one is in the dictionary. This system is clever, but breaks down quickly if you want to find words with an error greater than 1. Also, it would not work for me, since I needed to modify the cost functions for insert, delete, and substitution.

In the blog article Fuzzy String Matching, the author presents a recursive solution using memoization (caching). This is equivalent to flood-filling a diagonal band across the table. It gives a runtime of O(k * <number of nodes in the trie>), where k is the maximum cost. You can modify my algorithm above to only fill in only some entries of the table. I tried it, but it made the examples too complex and actually slowed it down. I blame my python skills.

Update: I just realized the author has created a new solution for dictionary search, also based on tries. I quickly tried it on my machine and dictionary, and got a time of 0.009301, assuming the prefix tree is prebuilt. It's slightly faster for an edit distance of 1! But somethings going on, because it takes 1.5 s for an edit distance of 4, whereas my code takes only 0.44. Phew!

And of course, you could create a levenshtein automaton, essentially a big honking regular expression that matches all possible mispellings. But to do it efficiently you need to write big honking gobs of code. (The code in the linked article does not do it efficiently, but states that it is possible to do so.) Alternatively, you could enumerate all possible mispellings and insert them into a MA-FSA or DAWG to obtain the regular expression.

More...

The Curious Complexity of Being Turned On
Posted on: 2010-11-29 10:26:01

The imaginary Larmin Corp is designing the next killer product: A mood ring. Okay it's too big to wear around your finger and is more of a wrist device. But it works with 80% accuracy and it's got its own app store and it is expected to be a big hit at CES. There's a snag: unnamed sources are attributing the delay in the product launch to the "On/Off" problem. Larmin Corp denied all rumours and promptly launched lawsuits against the unnamed sources, their children, and pets, and the everyone at the bar that night.

Here's how the device works:

  • It is comprised of two parts: The mood detector, and the User Interface (UI)
  • The user interface runs all the time (It uses your brain waves for energy)
  • The mood detector can be turned on and off. However, interpreting your brain waves is complex business, so it can take several seconds to switch on or off.

How do you turn on and off this system? One way is to ignore the delays and pretend it takes no time to turn on and off. The UI simply freezes up until the task is done.


Click to edit

The problem is sometimes mood detector takes a little longer to turn on, and user's think it's crashed and exhibit extreme anger. Some even start banging the entire device on the desk.

So we don't freeze the UI during the turn-on procedure. But this leads to the following behaviour:


Click to edit

As users get impatient waiting for it to turn on, they keep restarting the procedure. But if you try to turn off the detector while it is turning on, it crashes. The UI team first decides to handle this by adding another layer above the mood detector. If you send a command to it, and the mood detector is busy, it stores it in a queue for later. As soon as the mood detector completes, the layer replays the next queued action.


Click to edit

The problem is the user gets impatient and starts repeatedly hitting the button, and the device eventually gets so many commands queued up that it just sits there, repeatedly turning on and off until the user slams it against the wall and the battery falls out. Also, if mood detector ever turns on while the user is angry, it screws up the detector's calibration. (In version 1, users are instructed to be in a neutral mood when activating the ring).

So the design architects bring out the big guns and propose a "OnOffManager". Instead of using a queue of commands, the OnOffManager remembers the last requested state and uses it.

This works pretty well, except that during the design phase the graphics designer gets fed up with the whole debate and and simply grays out the button with an ajax spinny thing, so that any further clicks are ignored during turn on. The OnOffManager code is left in, because it took six months to design, but it is never exercised. Everyone lives happily ever after.

Wait, scratch that. Shortly before release, someone writes a location aware app which periodically turns on the mood detector and sends its status to Facebook. Another group is working on the highly secretive "mood gestures" app, which turns off the mood detector if the user thinks a certain sequence of moods. It's not long before somebody complains about their mood ring randomly turning on and off all the time. After analysis, we see the following sequence:

It's an easy fix. The On/Off manager is modified to keep a count of every app that wants it on. The mood detector is only turned on when the counter goes from 0 to 1, and off when the counter goes from 1 to 0. All other states are ignored.

Everything is great, until the charismatic C.E.O of Larmin Corp, George Jalopsky is giving a keynote speech. During the speech, his mood ring turns on without him realizing it. The video goes viral. Jalopsky flies back in a huff and holds a meeting of all of software development. "You must fix this problem," he cries, waving his wrists around, projecting bright crimson onto the walls and the faces of the engineers. "When I turn it off, I want it to stay off!"

All development is halted and design committees are formed. Soon, no meeting room at Larmin is available because they are all full of developers talking about the problem. Curious discussions like this one are overheard: "If I turn you on, but George turns you off, are you on or are you off?" This is followed by snickers.

And then someone proposes a solution: The mood ring will have a "soft off" function. You can turn it off, but it will still be allowed to turn on again by third party apps, unless you turn it really off. Provisional patents are quickly filed for the software and the design of the power button itself. The software folks toss around the ideas of what off, and really off mean, and whether it makes sense for the ring to be really turned on *snicker*.

Eventually, they come up with the generalized solution. The On/Off manager shall have two counters. One is for a special class of apps that are designated as "System Apps". There's only one for now -- the on off button. But there could be a plurality in the future. The other counter works as before for third-party apps. If the system counter is 0, the mood detector is off and any other commands are ignored. However, if the system counter is non-zero, then the value of the app counter is used to determine if the mood detector should be on. This is illustrated in the following sequence diagram.


Click to edit

And that, folks, is how something like turning the system on and off can grow in complexity very quickly. Soon, Larmin Corp will add low power modes and the special "BlueMood" peripheral, which transmits the moods to other users, but due to brain wave interference patterns, it only works with the mood detector is off even though the user has buttons for both independently.

Come back next time, to read about how the moods are sent from the detector to the display, in "States of confusion".

More...

Cross-domain communication the HTML5 way
Posted on: 2010-11-25 12:30:12

Making a web application mashable -- useable in another web page -- has some challenges in the area of cross-domain communications. Here is how I solved those problems for Zwibbler.com. (See the API demo here)

Zwibbler consists of a large javascript program and a little HTML. The javascript part uses Ajax methods to send POST requests back to the zwibbler.com server, to render some items without the limitations of the CANVAS tag. In particular, this allows it to support PDF output as well as SVG and PNG.

If you want to include Zwibbler.com on another web site, the zwibbler application still needs to communicate with zwibbler.com in order to perform these tasks. However, browsers will not allow this due to security restrictions. Javascript code can only communicate with the server that the main web page came from.

HTML allows you to embed one web page inside another, in the <iframe> element. They remain essentially separated. The container web site is only allowed to talk to its web server, and the iframe is only allowed to talk to its originating server. Furthermore, because they have different origins, the browser disallows any contact between the two frames. That includes function calls, and variable accesses.

But what if you want to get some data in between the two separate windows? For example, a zwibbler document might be a megabyte long when converted to a string. I want the containing web page to be able to get a copy of that string when it wants, so it can save it. Also, it should be able to access the saved PDF, PNG, or SVG image that the user produces. HTML5 provides a restricted way to communicate between different frames of the same window, called window.postMessage(). The postMessage function takes two parameters:
  • A string to pass
  • The target's origin, or "*" to allow any origin.

For example, to pass a message from the container web page to the iframe, we use:

iframe.contentWindow.postMessage("hello there", "http://zwibbler.com");
The receiver of the message must have previously registered for an HTML event named "message". This event arrives via the same mechanism as mouse clicks.
window.addEventListener("message", function( event ) {
    if ( event.data === "hello there" ) {
        // event.origin contains the host of the sending window.
        alert("Why, hello to you too, " + event.origin);
    }
}, false );

Problem 1: Two way communication

This method of communication is one way, but for a method call, we have to allow two way communication. We add a simple wrapper on top, called a Messenger, to allow two way communication. Each time you call a method in the iframe, you pass a reply function that is called with the results of that method call. We use JSON for the parameter marshalling.

The Messenger object must also keep track of how to direct the replies it receives. It assigns each request a unique ticket, and stores them in a table along with the reply function. When a reply with a matching ticket is recieved, the corresponding function is called:

Messenger.prototype = {
    init: function( targetFrame, targetDomain) {
        // The DOM node of the target iframe.
        this.targetFrame = targetFrame;

        // The domain, including http:// of the target iframe.
        this.targetDomain = targetDomain;
        
        // A map from ticket number strings to functions awaiting replies.
        this.replies = {};
        this.nextTicket = 0;

        var self = this;
        window.addEventListener("message", function(e) {
            self.receive(e);
        }, false );
    },

    send: function( functionName, args, replyFn ) {
        var ticket = "ticket_" + (this.nextTicket++);
        var text = JSON.stringify( {
            "function": functionName,
            "args": args,
            "ticket": ticket
        });

        if ( replyFn ) {
            this.replies[ticket] = replyFn;
        }

        this.targetFrame.postMessage( text, this.targetDomain );
    },
The receive function first checks the origin of the message. If it is not the one that we expected, then we ignore the message. Maybe it's from another iframe, such as an ad or a game that happens to be on the same page. It then checks to see if it has a ticket number. If so, it decodes the arguments and calls the associated reply function.
    receive: function( e ) {          
        if ( e.origin !== this.targetDomain ) {
            // not for us: ignore.
            return;
        }

        var json;

        try {
            json = JSON.parse( e.data );
        } catch(except) {
            alert( "Syntax error in response from " + e.origin + ": " + e.data );
            return;
        }

        if ( !(json["ticket"] in this.replies ) ) {
            // no reply ticket.
            return;
        }

        var replyFn = this.replies[json["ticket"]];
        delete this.replies[json["ticket"]];

        var args = [];
        if ( "args" in json ) {
            args = json["args"];
        }

        replyFn.apply( undefined, args );
    },

Problem 2: Delayed loading

There is one other complexity to handle. When we load the iframe, it takes some time to initialize before it is ready to receive events. If you send a message before it has registered to receive it, I'm not sure what happens, but it didn't work when I tried it.

So we have to add a bit of logic to the above code. When the iframe completes initializing, it sends a message consisting of the text "ready" to its parent window. If the Messenger is asked to send a message, and it has not yet received the "ready" message, then instead of sending it, it adds it to a queue for later. When it finally receives the ready message, it loops through the queue and finally sends all of the waiting messages to the iframe.

The complete code is contained in component.js

More...

Five essential steps to prepare for your next programming interview
Posted on: 2010-09-27 18:00:00

There are at least two kinds of programming interviews. One type is where you are asked for details about your prior work experience. The other one is where they put you in a room, give you a problem, and stare at you while you fumble around with markers on a whiteboard for 45 minutes. The first focuses on what you have done in the past. The second focuses on what you can do in the room right now without looking anything up. You should be prepared for either.

Step 1: Get your stories straight

You will spend a large chunk of time in a job interview talking about things that you have done in the past. If haven’t figured out a half dozen stories that best represent your skills, then you need to do that now. Here is a list of questions from a standard list. Many of them are stupid, but trust me -- they force you to think about yourself. Even if you aren't asked a question identical to one on this list, you will use your prepared answers during an interview. The point of this exercise is to build a repertoire of examples from your work life that you can use to answer questions.

  1. Tell me about yourself
  2. What are your short-term goals? What about in 2 and 5 years from now?
  3. What is your own vision/mission statement?
  4. What do you think you will be looking for in the job following this position?
  5. Why do you feel you will be successful in this work?
  6. What other types of work are you looking for in addition to this role?
  7. What supervisory or leadership roles have you had?
  8. What experience have you had working on a team?
  9. What have been your most satisfying/disappointing experiences?
  10. What are your strengths/weaknesses?
  11. What kinds of problems do you handle the best?
  12. How do you reduce stress and try to achieve balance in your life?
  13. How did you handle a request to do something contrary to your moral code or business ethics?
  14. What was the result the last time you tried to sell your idea to others?
  15. Why did you apply to our organization and what do you know about us?
  16. What do you think are advantages/disadvantages of joining our organization?
  17. What is the most important thing you are looking for in an employer?
  18. What were some of the common characteristics of your past supervisors?
  19. What characteristics do you think a person would need to have to work effectively in our company with its policies of staying ahead of the competition?
  20. What courses did you like best/least? Why?
  21. What did you learn or gain from your part-time/summer/co-op/internship experiences?
  22. What are your plans for further studies?
  23. Why are your grades low?
  24. How do you spend your spare time?
  25. If I asked your friends to describe you, what do you think they would say?
  26. What frustrates you the most?
  27. When were you last angry at work and what was the outcome?
  28. What things could you do to increase your overall effectiveness?
  29. What was the toughest decision you had to make in the last year? Why was it difficult?
  30. Why haven’t you found a job yet?
  31. You don’t seem to have any experience in ___ (e.g., sales, fundraising, bookkeeping), do you?
  32. Why should I hire you?
Source: The University of Waterloo Career Development Manual

The problem is that they require deep thought and introspection to answer, so it’s important to do that thinking in advance. Take an hour and think about the answers to these questions (you can use the same answer for more than one). For questions where you need to tell a story, your answer should follow this format:

  1. 20 seconds: Describe the situation. “The code was crashing and the whole team had to stop and figure out why.”
  2. 30 seconds: Describe what you did “I thought of doing a memory dump, and I noticed that the AbstractMemberCreationFactory had a lot of instances but it was supposed to be a singleton.”
  3. 20 seconds: Describe the results. “I fixed the memory leak with one line of code and we shipped the product on time. Later on, I added a test to make sure this wouldn’t happen again.”

Before each interview, go through the entire list and practice your answers out loud. Doing this will give you an edge over the other candidates, because it will make you more comfortable during the interview. When asked a question, other candidates will be staring at the ceiling saying "ummm", trying to remember everything that happened to them in the the past five years. Meanwhile, you'll smile, look the interviewer in the eye, and launch into your story.

Step 2: Build confidence by solving the most common programming exercises beforehand

Pianists have to learn a specific set of short pieces before they advance to the next level. These tunes will never be a hit at parties, but they exercise particular things, such as the right hand little finger, or syncopation. Likewise, certain problems keep coming up in programming interviews, although you will probably never, ever use them in your code. You will probably be asked one of the these time worn classics.
  • Reverse a singly linked list (in one pass through the list)
  • Reverse a string (in one pass). Reverse the order of words in a paragraph (in two passes)
  • Draw a circle of arbitrary size by printing out "*" characters. (hint: calculating whether to go "one down, two over" is the wrong approach)
  • Convert an integer to a string. Convert a string to an integer. (Manually, of course, by looping through each digit somehow.)
  • Write a function to return the number of 1's in the binary representation of an integer.
  • Write a function that will display all possible arrangements of letters in a string. Example: abc acb bac bca cab cba
Always start with the easiest solution that works, without considering the runtime. Then, try to make it faster. It's better to have something that works than spend all your time trying to optimize and end up with a page full of scribbles.

Don't cheat yourself by looking up the answers

The first time I tried to reverse a singly-linked list, it was between classes at school. I wasn't rushing, and it took me over half an hour to go from the slow and obvious solution to the fast one. But when I verified that my answer was correct, I was thrilled! I knew that I could tackle this question without looking up the answer. During interviews, when I was given a problem that I hadn't seen before, that experience gave me the confidence I needed to avoid blanking and keep trying.

Step 3: Practice your problem-solving

Some interviewers believe that being able to solve brain-teasers equates to good programming ability. In case you get one of these, you should develop a passing interest in puzzles and techniques for solving them. A visit to your local library will result in a dozen books, filled with puzzles to practice. Pick some interesting problems to tackle, and resist looking up the answers until you have spent at least a half hour on each one.

Step 4: Show genuine enthusiasm

A powerful technique is to show real enthusiasm. As human beings, we can’t help responding in kind and becoming excited to work with you. On the other hand, we also have evolved the ability to see through fake smiles, so it’s vital that you be genuinely yourself.

The best interviewers will try to get you to talk about something that you are passionate about, even if it doesn't directly relate to the job. Most interviewers, however, will not. You will have to think about something that you've done that excites you, and look for opportunities to talk about it. Do this early in the interview. After the first 10 minutes it is probably too late, since the interviewers will have already ranked you.

Picture yourself coming in to work at this new job on the first day, turning on the new quad-core development workstation, meeting some interesting new friends, and learning about life at the company. There’s got to be something exciting about that. Otherwise, why are you applying?

Step 5: Sleep

The "Tip of the tongue" phenomenon -- the inability to recall names, words, and facts -- increases dramatically if you have a sleep debt. Don’t be caught struggling to remember an important detail during an interview. Instead, get a good night’s sleep (7-9 hours).

Further Reading

More...

Minimal usable Ubuntu with one command
Posted on: 2010-09-21 18:00:00

If you install the default "ubuntu-desktop" you also get with it a gigabyte of crap that you will never use. But if you don't install the ubuntu desktop, you get a system with a text-only login: prompt, and it's not clear what to install to get it to a usable state.

I have an irrational need to optimize my Ubuntu installation. I did some investigating and came up with this method, which gives a minimal graphical 1.2 GB install, with gnome, networking, and no applications.

Install the base system

Use UNetBootin and create a usb key with the network install. (If you are still burning non-archival data-DVD's in the year 2010, you must also live in a cave.). Plug the computer directly into the network using an ethernet cable. Boot from the USB key and install ubuntu as usual, over the network. Using the network-install means you aren't even downloading the packages you aren't going to use. It also means you don't have to immediately update your system and re-download everything, because the network packages are already up to date.

When you get to the "task selection" screen with the option to install ubuntu base server, kubuntu, etc, do not change anything at all.

Install the minmal gnome

When installation completes and it boots up, you get a text-only system. Login and type the following command:
sudo apt-get install gnome-panel gdm gnome-terminal

This will install a graphical environment and the login screen, so it will let you login by clicking on your username.

When it completes, reboot using "sudo reboot", or if you are super-geeky, type in the secret command to avoid rebooting.

Wireless Networking for laptops

If you have a laptop, you will probably want:
sudo apt-get install network-manager network-manager-gnome gnome-power-manager hibernate
That will give you a battery monitor and the icon that lets you configure wireless networking. It also gives you the hibernate button when you shut down.

Where's my browser?

sudo apt-get install chromium-browser flashplugin-installer

Make it look good

Gnome's default theme looks like something from the year 2000. Get some of the Ubuntu 10 goodness by doing this:
sudo apt-get install ubuntu-artwork

All together now

As promised, here is the one command that combines all of the above.
sudo apt-get install gnome-panel gdm gnome-terminal network-manager-gnome network-manager gnome-power-manager hibernate chromium-browser ubuntu-artwork flashplugin-installer

More...

More Posts

Email
steve.hanov@gmail.com

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets