Zero load time file formats
In this article, I address the problem of the time needed to load data into memory from disk. However, I do not make any optimization for disk caches or blocks. I am not going to talk about B-Trees or cache-oblivious structures.
No waitingUsing an on-disk data structure, there is no need to load the whole file into memory or parse it. Instead of opening a file and reading its contents, we will use a memory mapped file. We tell the operating system the file name, and it will lazily load the parts of the file only when we access them. These parts remain in the disk cache even after our program exits. So if you later start the program again, it will execute similar queries more quickly. We let the operating system do the caching for us. In python, this is done using the mmap module. Mmap makes the file appear as a very long string.
Toy exampleHere is an on-disk structure for looking up related words that I prepared. (Download 11 MB of it). It has three sections: A header, an index, and a word section. The header contains the number of words. The index contains a list of pointers to word records. The word section contains the word records. It is constructed so that we can instantly query for words related to a given word by jumping around to different parts of the file.
--- header 4 bytes: number of words --- index section. The words are listed in alphabetical order, so you can --- look one up using binary search. for each word: 4 byte ptr to word record --- word section: for each word: null terminated text 4 bytes: number of related words for each link, ptr to linked word recordHere is a short python program for accessing the data file.
#!/usr/bin/python # An on-disk data structure for finding related words # By Steve Hanov. This code and data file are released to the public domain. import sys, mmap, struct class FrozenThesaurus: def __init__(self, filename): self.f = file(filename, "rb") self.mmap = mmap.mmap( self.f.fileno(), 0, access=mmap.ACCESS_READ ) def getDword( self, ptr ): # return the 32 bit number beginning at the given byte offset in the # file. return struct.unpack("<I", self.mmap[ptr:ptr+4]) def getString( self, ptr ): # return the null terminated string beginning at the given byte offset. result =  while self.mmap[ptr] != "\x00": result.append(self.mmap[ptr]) ptr += 1 return "".join(result) def getWordCount(self): # Retrive the number of words in the file. return self.getDword(0) def getWord(self, index): # Retrive a word, given its index. The index must be less then the word # count. return self.getString( self.getDword(4 + index * 4) ) def getIndexOf( self, word ): # perform a binary search through the index for the given word. high = self.getWordCount() low = -1 while (high - low > 1): probe = (high + low) / 2 candidate = self.getWord(probe) if candidate == word: return probe elif candidate < word: low = probe else: high = probe return None def getRelatedWords( self, word ): # Returns the list of related words to the given word. results =  index = self.getIndexOf( word ) if index == None: return results ptr = self.getDword( 4 + index * 4 ) # skip past the word text while self.mmap[ptr] != '\x00': ptr += 1 ptr += 1 numRelated = self.getDword( ptr ) for i in range(numRelated): ptr += 4 results.append( self.getString( self.getDword( ptr ) ) ) return results; data = FrozenThesaurus("thesaurus.dat") print data.getRelatedWords(sys.argv)
When shouldn't you use this?SQLITE uses memory mapped files internally. If you create the proper indicies, SQLITE will match the performance of any file format that you can come up with yourself, though it may be larger. If you can store your data in a relational database, you should not go through the trouble of creating your own on-disk data structure. In particular, a thesaurus could easily be stored in an SQLITE database.
When a reporter mangles your elevator pitchIf a reporter asks you about your new startup company, be careful what you say.
Give your Commodore 64 new life with an SD card readerDust off your old Commodore 64, and you could be the coolest kid on the block by plugging SD cards into it instead of floppies.
Copy a cairo surface to the windows clipboardI just spent several hours debugging clipboard copy of a DIB image. I could copy from my application, and paste into Paint. I could paste into Word. But if I pasted into WordPad, nothing showed up. If I pasted into GIMP, it crashed.
Is 2009 the year of Linux malware?Is 2009 the year of the linux desktop malware? How long until we see headlines like, "Researchers find massive botnet based on linux 2.30"?
My thoughts on various programming languagesSome ill-informed remarks on various programming languages.
Why you should go to the Business of Software Conference Next YearMost people, having already paid $2000.00 of their hard earned money, and then having flown, driven, or otherwise travelled to Boston to attend a conference, and then having paid an additional $250/night plus $33/night parking and "tourism taxes" to the Seaport Hotel -- most people, after all this, are unlikely to say that it was a waste of time and they should have stayed home watching the remaining salvaged episodes of Doctor Who on Netflix.
In fact, I found it quite useful.