Let's continue our short tour of data structures for storing words. Today, we will over-optimize John Resig's Word Game. Along the way, we shall learn about a little-known branch of computer science, called succinct data structures.
One of the best data structures for searching a dictionary is a trie. The speed of search does not depend on the number of words in the dictionary. It depends only on the number of letters in the word. For example, here is a trie containing the words "hat", "it", "is", and "a". The trie seems to compress the data, since words sharing the same beginnings only show up once.
We need to solve two problems. If we transmit the word list to the web browser, it then has to build the trie structure. This takes up a lot of time and memory. To save time, we could pre-encode the trie on the server in JSON format, which is parsed very quickly by the web browser. However, JSON is not a compact format, so some bandwidth is wasted downloading the data to the browser. We could avoid the wasted bandwidth by compressing the trie using a more compact format. The data is then smaller, but the web browser still has to decompress it to use it. In any case, the browser needs to create the trie in memory.
This leads us to the the second major problem. Despite appearances, tries use a lot of memory to store all of those links between nodes.
Fortunately, there is a way to store these links in a tiny amount of space.
A succinct data structure encodes data very efficiently, so that it does not need to be decoded to be used. Everything is accessed in-place, by reading bits at various positions in the data. To achieve optimal encoding, we use bits instead of bytes. All of our structures are encoded as a series of 0's and 1's.
Two important functions for succinct structures are:
Corresponding functions exist to find the rank/select of 0's instead of 1's. The rank function can be implemented in O(1) time using a lookup table (called a "directory"), which summarizes the number of 1's in certain parts of the string. The select() function is implemented in O(logn) time by performing binary search on the rank() function. It is possible to implement select in constant time, but it is complicated and space-hungry.
First, we add a "super root". This is just an additional node above the root. It's there to make the math work out later.
We then process the nodes in level order -- that is, we go row by row and process the nodes left to right. We encode them to the bit string in that order.
In the picture below, I've labeled each node in level order for convenience. I've also placed the nodes encoding above it. The encoding is a "1" for each child, plus a 0. So a node with 5 children would be "111110" and a node with no children is "0".
Now, we encode the nodes one after another. In the example, the bits would be 10111010110010000. I've separated them out in this table so you can see what's going on, but only the middle row is actually stored.
We then encode the data for each node after that. To get the data for a given node, just read it directly from that node's index in the data array.
The main thing that we want to do with a trie is follow links from each node to its children. Using our encoding, we can follow a link using a simple formula. If a node is numbered i, then the number of its first child is select0(i + 1) - i. The second child is the one after that, and so forth. To obtain the number of children, look up the first child of the i+1th node and subtract, since they are stored consecutively.
For example: We want the first child of node 2. The 3rd 0 is at position 7. Seven minus two is five. Therefore the first child is numbered 5. Similarly the first child of node 3 is found to be 7 by this formula (no, it doesn't really exist, but it works for the calculation). So node 2 has 7 minus 5 equals 2 children.
Using this encoding method, a 611K dictionary containing 80000 words is compressed to 216K, or 132K gzipped. The browser does not need to decode it to use it. The whole trie takes as much space as a 216K string.
To avoid problems with UTF encoding formats and escaped characters, the bit strings are encoded in BASE-64. All of the bit decoding functions are configured to operated on BASE64 encoded units, so that the input string does not need to be decoded before being used.
We only handle the letters "a" to "z" in lower case. That way, we can encode each letter in 5 bits.
You can decrease space usage and performance by increasing the L2 constant, and setting L1 = L2*L2. This controls the number of bits summarized in each section of the rank directory. L2 is the maximum number of bits that have to be scanned to implement rank(). More bits means fewer directory entries, but the select() and rank() functions will take longer to scan the range of bits.