## Hate UML?Draw sequence diagrams in seconds.http://www.websequencediagrams.com |

<<< Why don't web browsers do this?Succinct Data Structures: Cramming 80,000 words into a Javascript file. >>>

Throw away the keys: Easy, Minimal Perfect Hashing

Posted on: 2011-03-09 18:00:00

This technique is described in Kowaltowski, T.; CL. Lucchesi (1993), "Applications of finite automata representing large vocabularies", Software-Practice and Experience 1993

In part 1 of this series, I described how to find the closest match in a dictionary of words using a Trie. Such searches are useful because users often mistype queries. But tries can take a lot of memory -- so much that they may not even fit in the 2 to 4 GB limit imposed by 32-bit operating systems.

In part 2, I described how to build a MA-FSA (also known as a DAWG). The MA-FSA greatly reduces the number of nodes needed to store the same information as a trie. They are quick to build, and you can safely substitute an MA-FSA for a trie in the fuzzy search algorithm.

There is a problem. Since the last node in a word is shared with other words, it is not possible to store data in it. We can use the MA-FSA to check if a word (or a close match) is in the dictionary, but we cannot look up any other information about the word!

If we need extra information about the words, we can use an additional data structure along with the MA-FSA. We can store it in a hash table. Here's an example of a hash table that uses separate chaining. To look up a word, we run it through a hash function, H() which returns a number. We then look at all the items in that "bucket" to find the data. Since there should be only a small number of words in each bucket, the search is very fast.

Notice that the table needs to store the keys (the words that we want to look up) as well as the data associated with them. It needs them to resolve collisions -- when two words hash to the same bucket. Sometimes these keys take up too much storage space. For example, you might be storing information about all of the URLs on the entire Internet, or parts of the human genome. In our case, we already store the words in the MA-FSA, and it is redundant to duplicate them in the hash table as well. If we could guarantee that there were no collisions, we could throw away the keys of the hash table.

Perfect hashing is a technique for building a hash table with no collisions. It is only possible to build one when we know all of the keys in advance. *Minimal* perfect hashing implies that the resulting table contains one entry for each key, and no empty slots.

We use two levels of hash functions. The first one, H(key), gets a position in an intermediate array, G. The second function, F(d, key), uses the extra information from G to find the unique position for the key. The scheme will always returns a value, so it works as long as we know *for sure* that what we are searching for is in the table. Otherwise, it will return bad information. Fortunately, our MA-FSA can tell us whether a value is in the table. If we did not have this information, then we could also store the keys with the values in the value table.

In the example below, the words "blue" and "cat" both hash to the same position using the H() function. However, the second level hash, F, combined with the *d*-value, puts them into different slots.

**In step 1**, we place the keys into buckets according to the first hash function, H.

**In step 2**, we process the buckets *largest first* and try to place all the keys it contains in an empty slot of the value table using F(d=1, key). If that is unsuccessful, we keep trying with successively larger values of d. It *sounds like* it would take a long time, but in reality it doesn't. Since we try to find the *d* value for the buckets with the most items early, they are likely to find empty spots. When we get to buckets with just one item, we can simply place them into the next unoccopied spot.

Here's some python code to demonstrate the technique. In it, we use H = F(0, key) to simplify things.

#!/usr/bin/python # Easy Perfect Minimal Hashing # By Steve Hanov. Released to the public domain. # # Based on: # Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud, # "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121 # also a good reference: # Compress, Hash, and Displace algorithm by Djamal Belazzougui, # Fabiano C. Botelho, and Martin Dietzfelbinger import sys DICTIONARY = "/usr/share/dict/words" TEST_WORDS = sys.argv[1:] if len(TEST_WORDS) == 0: TEST_WORDS = ['hello', 'goodbye', 'dog', 'cat'] # Calculates a distinct hash function for a given string. Each value of the # integer d results in a different hash value. def hash( d, str ): if d == 0: d = 0x01000193 # Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ for c in str: d = ( (d * 0x01000193) ^ ord(c) ) & 0xffffffff; return d # Computes a minimal perfect hash table using the given python dictionary. It # returns a tuple (G, V). G and V are both arrays. G contains the intermediate # table of values needed to compute the index of the value in V. V contains the # values of the dictionary. def CreateMinimalPerfectHash( dict ): size = len(dict) # Step 1: Place all of the keys into buckets buckets = [ [] for i in range(size) ] G = [0] * size values = [None] * size for key in dict.keys(): buckets[hash(0, key) % size].append( key ) # Step 2: Sort the buckets and process the ones with the most items first. buckets.sort( key=len, reverse=True ) for b in xrange( size ): bucket = buckets[b] if len(bucket) <= 1: break d = 1 item = 0 slots = [] # Repeatedly try different values of d until we find a hash function # that places all items in the bucket into free slots while item < len(bucket): slot = hash( d, bucket[item] ) % size if values[slot] != None or slot in slots: d += 1 item = 0 slots = [] else: slots.append( slot ) item += 1 G[hash(0, bucket[0]) % size] = d for i in range(len(bucket)): values[slots[i]] = dict[bucket[i]] if ( b % 1000 ) == 0: print "bucket %d r" % (b), sys.stdout.flush() # Only buckets with 1 item remain. Process them more quickly by directly # placing them into a free slot. Use a negative value of d to indicate # this. freelist = [] for i in xrange(size): if values[i] == None: freelist.append( i ) for b in xrange( b, size ): bucket = buckets[b] if len(bucket) == 0: break slot = freelist.pop() # We subtract one to ensure it's negative even if the zeroeth slot was # used. G[hash(0, bucket[0]) % size] = -slot-1 values[slot] = dict[bucket[0]] if ( b % 1000 ) == 0: print "bucket %d r" % (b), sys.stdout.flush() return (G, values) # Look up a value in the hash table, defined by G and V. def PerfectHashLookup( G, V, key ): d = G[hash(0,key) % len(G)] if d < 0: return V[-d-1] return V[hash(d, key) % len(V)] print "Reading words" dict = {} line = 1 for key in open(DICTIONARY, "rt").readlines(): dict[key.strip()] = line line += 1 print "Creating perfect hash" (G, V) = CreateMinimalPerfectHash( dict ) for word in TEST_WORDS: line = PerfectHashLookup( G, V, word ) print "Word %s occurs on line %d" % (word, line)

Number of items | Time (s) |
---|---|

100000 | 2.24 |

200000 | 4.48 |

300000 | 6.68 |

400000 | 9.27 |

500000 | 11.71 |

600000 | 13.81 |

700000 | 16.72 |

800000 | 18.78 |

900000 | 21.12 |

1000000 | 24.816 |

Here's a pretty chart.

- How Asana Breaks the Rules About Per-Seat Pricing
- Dissecting Adsense
- Free, Raw Stock Data
- Stock Picking using Python
- Installing the Latest Debian on an Ancient Laptop

Want more programming tech talk?

Add to Circles on Google Plus

Subscribe to posts

2011-07-25 22:54:58

Hi Steve! Your blog very nice! I like it, can you help me a problem? Alrorithm MA-FSA or DAWG. Have you set by java? you for me! oK?

my mail: dungtp_53@vnu.edu.vn or phidungit@yahoo.com

coming soon!

2011-10-19 04:39:57

Steve, Thx for this article. in Bothelo's thesis the information theoretic lower bound of number of bits to describe a PMH is derived as the reciprocal of the probability that a random function from S (of size n) to the set {0,1,2..,m} is a perfect hash. Any idea how this follows? (I could in theory see the references cited there but was wondering if it is something you know off hand.)

2011-12-03 01:45:59

You do have a pretty sweet array of posts about efficiency and C++. Great work!

2012-03-19 02:27:44

I enjoy your simplifications; sketched pictures just make everything easy to understand.

Your code would need O(n log n) bits to store the intermediate table, G (n entries, log base 2 of n bits to represent the index). There are other algorithms that store a constant number of bits per entry via compressing their intermediate table using a Huffman like encoding while still maintaining O(1) access time.

2012-09-07 14:49:02

Hello,

have some stats about the memory used by your data structures for your different sets of keys ?

2012-10-31 06:39:53

This is an algorithm i developed in my thesis ... described as algorithm II in the following paper:

"Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud, Practical minimal perfect hash functions for large databases, CACM, 35(1):105-121"

2014-01-07 20:36:16

Stumbled onto this post, and tested some code:

for fake in ["fgouijnbbtasd", "hlubfanbybjhfbdsa", "oun987qwve76cvwa"]:

line = PerfectHashLookup(G,V,fake)

print "Word %s occurs on line %d" % (fake, line)

and then get:

Word fgouijnbbtasd occurs on line 88236

Word hlubfanbybjhfbdsa occurs on line 60409

Word oun987qwve76cvwa occurs on line 75101

Not sure if this is the expected behavior:

$ cat /usr/share/dict/words | awk 'NR == 88236 || NR == 60409 || NR == 75101'

masseurs

reactivation

supplanted

2014-01-24 22:30:27

(Edited) I had something here about the code getting into an infinite loop, but it was because I was testing with randomized keys and not not thinking about non-unqiue keys like a dummy.

2014-01-25 18:42:55

Hi Ryan,

This data structure will always find an entry even if you use a key that is not in it. It assumes that you only look up things that you know are stored in it. Use it only when you have some other means of finding out whether an item is valid.

2014-03-03 21:47:00

Hi Steve

There are cases when the program does not terminate. Here is a sample case:

dict = {"a": 1, "c": 2} # i.e., keys are "a" & "c"

----

Brief explanation of why the program get stuck:

----

The program is stuck forever inside the loop starting on line 56. The reason is: For a fixed "d", and len(str) == 1, the function hash(d, str) always returns a value whose last bit (LSB) depends only and only on the LSB of ord(str[0]). And in the loop (on line 56) we are doing this:

slot = hash( d, bucket[item] ) % size

In this case size == 2, so we are effectively only looking at the last bit of hash()'s output, which unfortunately will be same for both "a" (97), and "c" (99). The loop is constructed such that if same slot is returned for both keys, it will continue forever, and as we demonstrated above for same "d" slot will always be same for these two keys ("a", & "c").

Of course the argument shows that there is a family of inputs which will make the program go on forever (e.g., of len(dict) == 2 cases: {"a", "c"}, {"a", "e"}, .... and so on). Similar examples can be constructed for even len(dict) > 1 cases.

Do you have any suggestions on how to fix the problem ?

Thanks

Anurag

PS: Just for my own reference, edit password for the post: SHA-256(level 2 trivial password).

Email

steve.hanov@gmail.com
## Post comment