Hate UML?

Draw sequence diagrams in seconds.
http://www.websequencediagrams.com

Compressing dictionaries with a DAWG
Posted on: 2011-01-24 21:40:57

Last time, I wrote about how to speed up spell checking using a trie (also known as a prefix tree). However, for large dictionaries, a trie can waste a lot of memory. If you're trying to squeeze an application into a mobile device, every kilobyte counts. Consider this trie, with two words in it.

It can be shortened in a way so that any program accessing it would not even notice.

As early as 1988, ScrabbleTM programs were using structures like the above to shrink the their dictionaries. Over the years, the structure has been called many things. Some web pages call it a DAWG (Direct Acyclic Word Graph). But computer scientists have adopted the name "Minimal Acyclic Finite State Automaton", because some papers were already using the name DAWG for something else.

The most obvious way to build a MA-FSA, as suggested in many other web pages, is to first build the trie, and look for duplicate branches. I tried this on a list of 7 million words that I had. I wrote the algorithm in C++, but no matter how hard I tried, I kept running out of memory. A trie (or prefix tree) uses a lot of memory compared to a DAWG. It would be much better if one could create the DAWG right away, without first creating a trie. Jan Duciuk describes such a method in his paper. The central idea is to check for duplicates after you insert each word, so that the structure never gets huge.

  1. Ensure that words are inserted in alphabetical order. That way, when you insert a word, you will then know for sure whether the previous word ended an entire branch. For example, "cat" followed by "catnip" does not result in a branch, because the s just added to the end. But when you follow it with "cats" you know that the "nip" part of the previous word needs checking.
  2. Each time you complete a branch in the trie, check it for duplicate nodes. When a duplicate is found, redirect all incoming edges to the existing one and eliminate the duplicate.

The paper that I am paraphrasing, by Jan Daciuk and others, also describes a way to insert words out of order. But it is more complicated. In most cases, you can arrange to add your words in alphabetical order.

What's a duplicate node?

Two nodes are considered the same if they are both the final part of a word, or they are both not the final part of a word. They also need to have exactly the same edges pointing to exactly the same other nodes.

We start eliminating duplicates starting from the bottom of the branch, so each elimination can reveal more duplicates. Eventually, the branch of the trie zips together with a prior branch.

Step 1:

Several steps later:

Why go through so much trouble?

If you have a large word list, you could run it through gzip and get much better compression. The reason for storing a dictionary this way is to save space and remain easily searchable, without needing to decompress it first. Tries and MA-FSAs can support fuzzy search and prefix queries, so you can do spell checking and auto-completion. They can easily scale up to billions of entries. They have even been used to store large portions of the human genome. If you don't care about memory or speed, just store your words in an SQL database, or spin up 100 machines "in the cloud". I don't mind. More power to you!

MA-FSAs can be stored in as little as 4 bytes per edge-connector, as described by this web page.

Implementation

Here's a python implementation. I tried it and it could easily handle seven million words in a couple minutes.
#!/usr/bin/python
# By Steve Hanov, 2011. Released to the public domain.
import sys
import time

DICTIONARY = "/usr/share/dict/words"
QUERY = sys.argv[1:]

# This class represents a node in the directed acyclic word graph (DAWG). It
# has a list of edges to other nodes. It has functions for testing whether it
# is equivalent to another node. Nodes are equivalent if they have identical
# edges, and each identical edge leads to identical states. The __hash__ and
# __eq__ functions allow it to be used as a key in a python dictionary.
class DawgNode:
    NextId = 0
    
    def __init__(self):
        self.id = DawgNode.NextId
        DawgNode.NextId += 1
        self.final = False
        self.edges = {}

    def __str__(self):        
        arr = []
        if self.final: 
            arr.append("1")
        else:
            arr.append("0")

        for (label, node) in self.edges.iteritems():
            arr.append( label )
            arr.append( str( node.id ) )

        return "_".join(arr)

    def __hash__(self):
        return self.__str__().__hash__()

    def __eq__(self, other):
        return self.__str__() == other.__str__()

class Dawg:
    def __init__(self):
        self.previousWord = ""
        self.root = DawgNode()

        # Here is a list of nodes that have not been checked for duplication.
        self.uncheckedNodes = []

        # Here is a list of unique nodes that have been checked for
        # duplication.
        self.minimizedNodes = {}

    def insert( self, word ):
        if word < self.previousWord:
            raise Exception("Error: Words must be inserted in alphabetical " +
                "order.")

        # find common prefix between word and previous word
        commonPrefix = 0
        for i in range( min( len( word ), len( self.previousWord ) ) ):
            if word[i] != self.previousWord[i]: break
            commonPrefix += 1

        # Check the uncheckedNodes for redundant nodes, proceeding from last
        # one down to the common prefix size. Then truncate the list at that
        # point.
        self._minimize( commonPrefix )

        # add the suffix, starting from the correct node mid-way through the
        # graph
        if len(self.uncheckedNodes) == 0:
            node = self.root
        else:
            node = self.uncheckedNodes[-1][2]

        for letter in word[commonPrefix:]:
            nextNode = DawgNode()
            node.edges[letter] = nextNode
            self.uncheckedNodes.append( (node, letter, nextNode) )
            node = nextNode

        node.final = True
        self.previousWord = word

    def finish( self ):
        # minimize all uncheckedNodes
        self._minimize( 0 );

    def _minimize( self, downTo ):
        # proceed from the leaf up to a certain point
        for i in range( len(self.uncheckedNodes) - 1, downTo - 1, -1 ):
            (parent, letter, child) = self.uncheckedNodes[i];
            if child in self.minimizedNodes:
                # replace the child with the previously encountered one
                parent.edges[letter] = self.minimizedNodes[child]
            else:
                # add the state to the minimized nodes.
                self.minimizedNodes[child] = child;
            self.uncheckedNodes.pop()

    def lookup( self, word ):
        node = self.root
        for letter in word:
            if letter not in node.edges: return False
            node = node.edges[letter]

        return node.final

    def nodeCount( self ):
        return len(self.minimizedNodes)

    def edgeCount( self ):
        count = 0
        for node in self.minimizedNodes:
            count += len(node.edges)
        return count

        
dawg = Dawg()
WordCount = 0
words = open(DICTIONARY, "rt").read().split()
words.sort()
start = time.time()    
for word in words:
    WordCount += 1
    dawg.insert(word)
    if ( WordCount % 100 ) == 0: print "%dr" % WordCount,
dawg.finish()
print "Dawg creation took %g s" % (time.time()-start)    

EdgeCount = dawg.edgeCount()
print "Read %d words into %d nodes and %d edges" % ( WordCount,
        dawg.nodeCount(), EdgeCount )
print "This could be stored in as little as %d bytes" % (EdgeCount * 4)    

for word in QUERY:
    if not dawg.lookup( word ):
        print "%s not in dictionary." % word
    else:
        print "%s is in the dictionary." % word
Using this code, a list of 7 million words, taking up 63 MB, was translated into 6 million edges. Although it took more than a gigabyte of memory in Python, such a list could be stored in as little as 24 MB. Of course, gzip could do better, but the result would not be quickly searchable.

Extensions

A MA-FSA is great for testing whether words are in a dictionary. But in the form I gave, it's not possible to retrieve values associated with the words. It is possible to include associated values in the automaton. Such structures are called "Minimal Acyclic Finite State Transducers". In fact, the algorithm I above can be easily modified to include a value. However, it causes the number of nodes to blow up, and you are much better off using a minimal perfect hash function in addition to your MA-FSA to store your data. I discuss this in part 3.

Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts

Post comment

Real Name:
Your Email (Not displayed):

Text only. No HTML. If you write "http:" your message will be ignored.
Choose an edit password if you want to be able to edit or delete your comment later.
Editing Password (Optional):

Peter

2011-01-25 00:16:00
> The central idea is to check for duplicates after you insert each word, so that the structure never gets huge.

You can check for duplicates with a hash table of subtrees. I implemented this approach in C few years ago:

www.strchr.com/ternary_dags

BlueRaja

2011-03-24 16:51:06
That 4-bytes-per-node only allows up to 65k nodes. You can increase that to 16 million, still using 4-bytes-per-node, by assigning 3 bytes to the child-pointer, and combining the "flags" and "character" into one byte (2 bits for the flags, 6 bits for the character). However, expect searching the trie to be much slower.

Dung Nguyen Le

2011-03-25 03:15:31
Hey, i was found a error

"For example, "cat" followed by <b>"catsnip"</b> does not result in a branch"

Thank for this document :D

Marc Lepage

2011-03-25 10:11:37
In Programming Pearls, John Bentley describes how 75000 words were fit into PDP-11's 64K address space. It's another fascinating read on the same topic.

Marc Lepage

2011-03-25 12:04:50
I designed some word list data structures for mobile games on Palm and Pocket PC, where memory space was important, and also the ability to quickly search for matches (by looking for the word in the word lists).

I did not use a trie, but rather a more compact data structure where I kept lists of words broken out by length (3 letters to 8 letters in my case) and then by the first two letters (e.g. "ta" vs. "to"). For each such group, I just stored the remaining letters squished together in an array. So for 3 letter words starting with "to" I might store "npt" representing "ton", "top", "tot"; for 4 letter words starting with "to" I might store "admert" representing "toad", "tome", "tort".

To look up a word, I just use its length and first to letters, then start skipping through the array (binary search) of legal suffixes to see if it's present or not.

Ralph Corderoy

2011-07-10 09:10:43
"because the s just added to the end" That bit doesn't seem to make sense; typo?

atkers

2012-06-13 11:12:49
Once the DAWG is built, how can I print it to see what it looks like?

atkers

2012-06-13 11:31:58
Can you explain the structure here?

Steve Hanov

2013-12-27 04:07:58
James,

I am using this code unchanged in several projects and it is working.

somashekar

2014-03-04 07:58:09
in that code what data is there in DICTIONARY file and what delimeter is used between the words
Email
steve.hanov@gmail.com

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets