Usability Nightmare: Xfce Settings Manager
Rant: Why can't anyone make a good settings screen?

In this article, I address the problem of the time needed to load data into memory from disk. However, I do not make any optimization for disk caches or blocks. I am not going to talk about B-Trees or cache-oblivious structures.
--- header
4 bytes: number of words
--- index section. The words are listed in alphabetical order, so you can
--- look one up using binary search.
for each word:
4 byte ptr to word record
--- word section:
for each word:
null terminated text
4 bytes: number of related words
for each link,
ptr to linked word record
Here is a short python program for accessing the data file.
#!/usr/bin/python
# An on-disk data structure for finding related words
# By Steve Hanov. This code and data file are released to the public domain.
import sys, mmap, struct
class FrozenThesaurus:
def __init__(self, filename):
self.f = file(filename, "rb")
self.mmap = mmap.mmap( self.f.fileno(), 0, access=mmap.ACCESS_READ )
def getDword( self, ptr ):
# return the 32 bit number beginning at the given byte offset in the
# file.
return struct.unpack("<I", self.mmap[ptr:ptr+4])[0]
def getString( self, ptr ):
# return the null terminated string beginning at the given byte offset.
result = []
while self.mmap[ptr] != "\x00":
result.append(self.mmap[ptr])
ptr += 1
return "".join(result)
def getWordCount(self):
# Retrive the number of words in the file.
return self.getDword(0)
def getWord(self, index):
# Retrive a word, given its index. The index must be less then the word
# count.
return self.getString( self.getDword(4 + index * 4) )
def getIndexOf( self, word ):
# perform a binary search through the index for the given word.
high = self.getWordCount()
low = -1
while (high - low > 1):
probe = (high + low) / 2
candidate = self.getWord(probe)
if candidate == word:
return probe
elif candidate < word:
low = probe
else:
high = probe
return None
def getRelatedWords( self, word ):
# Returns the list of related words to the given word.
results = []
index = self.getIndexOf( word )
if index == None: return results
ptr = self.getDword( 4 + index * 4 )
# skip past the word text
while self.mmap[ptr] != '\x00': ptr += 1
ptr += 1
numRelated = self.getDword( ptr )
for i in range(numRelated):
ptr += 4
results.append( self.getString( self.getDword( ptr ) ) )
return results;
data = FrozenThesaurus("thesaurus.dat")
print data.getRelatedWords(sys.argv[1])
Joshua
I translated your example to Factor, if you're curious. It's on my Re: Factor blog at re-factor.blogspot.com.
Thanks,
John.
Rant: Why can't anyone make a good settings screen?
Let's say you have millions of pictures of faces tagged with names. Given a new photo, how do you find the name of person that the photo most resembles?
In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.
A practical, memory efficient way to store and search large sets of words.
Back in 2007, I created a rhyming engine based on the public domain Moby pronouncing dictionary. It simply reads the dictionary and looks for rhyming words by comparing the suffix of the words' pronunciations. Since that time, I have made some improvements.