< >

Zero load time file formats

Posted 13 years ago

Sometimes you cannot afford to load data files from disk. Maybe you need results immediately, or the data is simply too large to fit into memory. A technique that I like to use is an on-disk data structure. Here is a toy example for instantly accessing lists of related words.

In this article, I address the problem of the time needed to load data into memory from disk. However, I do not make any optimization for disk caches or blocks. I am not going to talk about B-Trees or cache-oblivious structures.

No waiting

Using an on-disk data structure, there is no need to load the whole file into memory or parse it. Instead of opening a file and reading its contents, we will use a memory mapped file. We tell the operating system the file name, and it will lazily load the parts of the file only when we access them. These parts remain in the disk cache even after our program exits. So if you later start the program again, it will execute similar queries more quickly. We let the operating system do the caching for us. In python, this is done using the mmap module. Mmap makes the file appear as a very long string.

Toy example

Here is an on-disk structure for looking up related words that I prepared. (Download 11 MB of it). It has three sections: A header, an index, and a word section. The header contains the number of words. The index contains a list of pointers to word records. The word section contains the word records. It is constructed so that we can instantly query for words related to a given word by jumping around to different parts of the file.

    --- header
    4 bytes: number of words

    --- index section. The words are listed in alphabetical order, so you can
    --- look one up using binary search.
    for each word:
        4 byte ptr to word record

    --- word section:
    for each word:
       null terminated text
       4 bytes: number of related words
       for each link,
           ptr to linked word record

Here is a short python program for accessing the data file.

#!/usr/bin/python
# An on-disk data structure for finding related words
# By Steve Hanov. This code and data file are released to the public domain.

import sys, mmap, struct

class FrozenThesaurus:
    def __init__(self, filename):
        self.f = file(filename, "rb")
        self.mmap = mmap.mmap( self.f.fileno(), 0, access=mmap.ACCESS_READ )

    def getDword( self, ptr ):
        # return the 32 bit number beginning at the given byte offset in the
        # file.
        return struct.unpack("<I", self.mmap[ptr:ptr+4])[0]

    def getString( self, ptr ):
        # return the null terminated string beginning at the given byte offset.
        result = []
        while self.mmap[ptr] != "\x00":
            result.append(self.mmap[ptr])
            ptr += 1
        return "".join(result)

    def getWordCount(self):
        # Retrive the number of words in the file.
        return self.getDword(0)

    def getWord(self, index):
        # Retrive a word, given its index. The index must be less then the word
        # count.
        return self.getString( self.getDword(4 + index * 4) )

    def getIndexOf( self, word ):        
        # perform a binary search through the index for the given word.
        high = self.getWordCount()
        low = -1

        while (high - low > 1):
            probe = (high + low) / 2

            candidate = self.getWord(probe)

            if candidate == word:
                return probe
            elif candidate < word:    
                low = probe
            else:
                high = probe

        return None

    def getRelatedWords( self, word ):
        # Returns the list of related words to the given word.
        results = []

        index = self.getIndexOf( word )
        if index == None: return results

        ptr = self.getDword( 4 + index * 4 )

        # skip past the word text
        while self.mmap[ptr] != '\x00': ptr += 1
        ptr += 1

        numRelated = self.getDword( ptr )
        for i in range(numRelated):
            ptr += 4
            results.append( self.getString( self.getDword( ptr ) ) )

        return results;            

data = FrozenThesaurus("thesaurus.dat")
print data.getRelatedWords(sys.argv[1])

When shouldn't you use this?

SQLITE uses memory mapped files internally. If you create the proper indicies, SQLITE will match the performance of any file format that you can come up with yourself, though it may be larger. If you can store your data in a relational database, you should not go through the trouble of creating your own on-disk data structure. In particular, a thesaurus could easily be stored in an SQLITE database.

Steve Hanov makes a living working on Rhymebrain.com, rapt.ink, www.websequencediagrams.com, and Zwibbler.com. He lives in Waterloo, Canada.

Post comment

edit

Gregory

twelve years ago

Yeah well... that totally doesn't address endianness and struct member alignment

edit

Joash

twelve years ago

Algorithm to Calculate the Price for Gasoline Customer Charge

edit

Joshua Schachter

twelve years ago

This didn't actually work on python 2.7.1 on OS X. You have to change '' to 'x00' for it to match the nulls.

Joshua

edit

John

13 years ago

Hi Steve,

I translated your example to Factor, if you're curious. It's on my Re: Factor blog at re-factor.blogspot.com.

Thanks,

John.

edit

rjp

13 years ago

I did something similar to store a huge hash of URLs for a URL filtering proxy. Written in Ruby but was an order of magnitude faster than the linear-search C version.

How QBASIC almost got me killed

The day arrived when my project was ready to be unleashed upon the world. I waited until the teacher was hovering nearby and then I started my application, running the FORMAT command on the network drive. Some classmates were watching the screen and she hurried over to see what all the fuss was about.

Draw waveforms and hear them

A while back I thought it would be interesting to be able to draw arbitrary waveforms and then listen to how they sound. I had an audio engine just laying around, so I whipped up a quick application to do that.

Simulating freehand drawing with Cairo

Why are all my lines fuzzy in cairo?

Make sure your lines are sharp using this simple trick.

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US

After many months, your software sale is complete! You've got a purchase order, sent the invoice, delivered the software. You're already handling some support issues from users at BigCorp. Then BANG! Martha from Procurement emails back, as a favour, just to let you know that BigCorp has not received your W8 form with a valid tax id, and therefore will be withholding 30% of the purchase price of your multi-thousand dollar product for taxes.

The Curious Complexity of Being Turned On

In software, the simplest things can turn into a nightmare, especially at a large company.

UMA and free long distance

What's to stop me from travelling to another continent, and then making free long distance calls to local numbers back home? Technically, nothing.

Finding awesome developers in programming interviews

In a job interview, I once asked a very experienced embedded software developer to write a program that reverses a string and prints it on the screen. He struggled with this basic task. This man was awesome. Give him a bucket of spare parts, and he could build a robot and program it to navigate around the room. He had worked on satellites that are now in actual orbit. He could have coded circles around me. But the one thing that he had never, ever needed to do was: display something on the screen.

When a reporter mangles your elevator pitch

If a reporter asks you about your new startup company, be careful what you say.

Microsoft's generosity knows no end for a year (comic)