Hate UML?

Draw sequence diagrams in seconds.

Zero load time file formats
Posted on: 2011-06-29 17:27:08
Sometimes you cannot afford to load data files from disk. Maybe you need results immediately, or the data is simply too large to fit into memory. A technique that I like to use is an on-disk data structure. Here is a toy example for instantly accessing lists of related words.

In this article, I address the problem of the time needed to load data into memory from disk. However, I do not make any optimization for disk caches or blocks. I am not going to talk about B-Trees or cache-oblivious structures.

No waiting

Using an on-disk data structure, there is no need to load the whole file into memory or parse it. Instead of opening a file and reading its contents, we will use a memory mapped file. We tell the operating system the file name, and it will lazily load the parts of the file only when we access them. These parts remain in the disk cache even after our program exits. So if you later start the program again, it will execute similar queries more quickly. We let the operating system do the caching for us. In python, this is done using the mmap module. Mmap makes the file appear as a very long string.

Toy example

Here is an on-disk structure for looking up related words that I prepared. (Download 11 MB of it). It has three sections: A header, an index, and a word section. The header contains the number of words. The index contains a list of pointers to word records. The word section contains the word records. It is constructed so that we can instantly query for words related to a given word by jumping around to different parts of the file.
    --- header
    4 bytes: number of words

    --- index section. The words are listed in alphabetical order, so you can
    --- look one up using binary search.
    for each word:
        4 byte ptr to word record

    --- word section:
    for each word:
       null terminated text
       4 bytes: number of related words
       for each link,
           ptr to linked word record
Here is a short python program for accessing the data file.
# An on-disk data structure for finding related words
# By Steve Hanov. This code and data file are released to the public domain.

import sys, mmap, struct

class FrozenThesaurus:
    def __init__(self, filename):
        self.f = file(filename, "rb")
        self.mmap = mmap.mmap( self.f.fileno(), 0, access=mmap.ACCESS_READ )

    def getDword( self, ptr ):
        # return the 32 bit number beginning at the given byte offset in the
        # file.
        return struct.unpack("<I", self.mmap[ptr:ptr+4])[0]

    def getString( self, ptr ):
        # return the null terminated string beginning at the given byte offset.
        result = []
        while self.mmap[ptr] != "":
            ptr += 1
        return "".join(result)

    def getWordCount(self):
        # Retrive the number of words in the file.
        return self.getDword(0)

    def getWord(self, index):
        # Retrive a word, given its index. The index must be less then the word
        # count.
        return self.getString( self.getDword(4 + index * 4) )

    def getIndexOf( self, word ):        
        # perform a binary search through the index for the given word.
        high = self.getWordCount()
        low = -1

        while (high - low > 1):
            probe = (high + low) / 2

            candidate = self.getWord(probe)

            if candidate == word:
                return probe
            elif candidate < word:    
                low = probe
                high = probe

        return None

    def getRelatedWords( self, word ):
        # Returns the list of related words to the given word.
        results = []

        index = self.getIndexOf( word )
        if index == None: return results

        ptr = self.getDword( 4 + index * 4 )

        # skip past the word text
        while self.mmap[ptr] != '': ptr += 1
        ptr += 1

        numRelated = self.getDword( ptr )
        for i in range(numRelated):
            ptr += 4
            results.append( self.getString( self.getDword( ptr ) ) )

        return results;            

data = FrozenThesaurus("thesaurus.dat")
print data.getRelatedWords(sys.argv[1])

When shouldn't you use this?

SQLITE uses memory mapped files internally. If you create the proper indicies, SQLITE will match the performance of any file format that you can come up with yourself, though it may be larger. If you can store your data in a relational database, you should not go through the trouble of creating your own on-disk data structure. In particular, a thesaurus could easily be stored in an SQLITE database.

Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts

Post comment

Real Name:
Your Email (Not displayed):

Text only. No HTML. If you write "http:" your message will be ignored.
Choose an edit password if you want to be able to edit or delete your comment later.
Editing Password (Optional):


2011-07-07 10:58:37
I did something similar to store a huge hash of URLs for a URL filtering proxy. Written in Ruby but was an order of magnitude faster than the linear-search C version.


2011-08-28 22:24:03
Hi Steve,

I translated your example to Factor, if you're curious. It's on my Re: Factor blog at re-factor.blogspot.com.



Joshua Schachter

2012-08-02 23:26:23
This didn't actually work on python 2.7.1 on OS X. You have to change '' to 'x00' for it to match the nulls.



2012-09-27 15:14:28
Algorithm to Calculate the Price for Gasoline Customer Charge


2012-10-05 11:55:17
Yeah well... that totally doesn't address endianness and struct member alignment

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets