|
Hate UML?Draw sequence diagrams in seconds.http://www.websequencediagrams.com |
In this article, I address the problem of the time needed to load data into memory from disk. However, I do not make any optimization for disk caches or blocks. I am not going to talk about B-Trees or cache-oblivious structures.
--- header
4 bytes: number of words
--- index section. The words are listed in alphabetical order, so you can
--- look one up using binary search.
for each word:
4 byte ptr to word record
--- word section:
for each word:
null terminated text
4 bytes: number of related words
for each link,
ptr to linked word record
Here is a short python program for accessing the data file.
#!/usr/bin/python
# An on-disk data structure for finding related words
# By Steve Hanov. This code and data file are released to the public domain.
import sys, mmap, struct
class FrozenThesaurus:
def __init__(self, filename):
self.f = file(filename, "rb")
self.mmap = mmap.mmap( self.f.fileno(), 0, access=mmap.ACCESS_READ )
def getDword( self, ptr ):
# return the 32 bit number beginning at the given byte offset in the
# file.
return struct.unpack("<I", self.mmap[ptr:ptr+4])[0]
def getString( self, ptr ):
# return the null terminated string beginning at the given byte offset.
result = []
while self.mmap[ptr] != ' ':
result.append(self.mmap[ptr])
ptr += 1
return "".join(result)
def getWordCount(self):
# Retrive the number of words in the file.
return self.getDword(0)
def getWord(self, index):
# Retrive a word, given its index. The index must be less then the word
# count.
return self.getString( self.getDword(4 + index * 4) )
def getIndexOf( self, word ):
# perform a binary search through the index for the given word.
high = self.getWordCount()
low = -1
while (high - low > 1):
probe = (high + low) / 2
candidate = self.getWord(probe)
if candidate == word:
return probe
elif candidate < word:
low = probe
else:
high = probe
return None
def getRelatedWords( self, word ):
# Returns the list of related words to the given word.
results = []
index = self.getIndexOf( word )
if index == None: return results
ptr = self.getDword( 4 + index * 4 )
# skip past the word text
while self.mmap[ptr] != ' ': ptr += 1
ptr += 1
numRelated = self.getDword( ptr )
for i in range(numRelated):
ptr += 4
results.append( self.getString( self.getDword( ptr ) ) )
return results;
data = FrozenThesaurus("thesaurus.dat")
print data.getRelatedWords(sys.argv[1])
Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts
I translated your example to Factor, if you're curious. It's on my Re: Factor blog at re-factor.blogspot.com.
Thanks,
John.
Joshua
Post comment