VP trees: A data structure for finding stuff fast
Posted on: 2011-12-02 08:00:00

Let's say you have millions of pictures of faces tagged with names. Given a new photo, how do you find the name of person that the photo most resembles?

Suppose you have scanned short sections of millions of songs, and for each five second period you have a rough list of the frequencies and beat patterns contained in them. Given a new audio snippet, can you find the song to which it belongs?

What if you have data from thousands of web site users, including usage frequency, when they signed up, what actions they took, etc. Given a new user's actions, can you find other users like them and predict whether they will upgrade or stop using your product?

In the cases I mentioned, each record has hundreds or thousands of elements: the pixels in a photo, or patterns in a sound snippet, or web usage data. These records can be regarded as points in high dimensional space. When you look at a points in space, they tend to form clusters, and you can infer a lot by looking at ones nearby.

In this blog entry, I will half-heartedly describe some data structures for spatial search. Then I will launch into a detailed explanation of VP-Trees (Vantage Point Trees), which are simple, fast, and can easily handle low or high dimensional data.

Data structures for spatial search

When a programmer wants to search for points in space, perhaps the the first data structure that springs to mind is the K-D tree. In this structure, we repeatedly subdivide all of the points along a particular dimension to form a tree structure.

With high dimensional data, the benefits of the K-D tree are soon lost. As the number of dimensions increase, the points tend to scatter and it becomes difficult to pick a good splitting dimension. Hundreds of students have gotten their masters degree by coding up K-D trees and comparing them with an alphabet soup of other trees. (In particular, I like this one.)

The authors of Data Mining: Practical machine Learning Tools and Techniques suggests using Ball Trees. Each node of a Ball tree describes a bounding sphere, using a centre and a radius. To make the search efficient, the nodes should use the minimal sphere that completely contains all of its children, and overlaps the least with other sibling spheres in the tree.

Ball trees work, but they are difficult to construct. It is hard to figure out the optimal placement of spheres to minimize the overlap. For high dimensional data, the structure can be huge. The nodes must store their centre, and if a point has thousands of coordinates, it occupies a lot of storage. Moreover, you need to be able to calculate these fake sphere centres from the other points. What, exactly, does it mean to calculate a point between two sets of users' web usage history?

Fortunately, there are methods of building tree structures which do not require manipulation of the individual coordinates. The things that you put in them do not need to resemble points. You only need a way to figure out how far apart they are.

Entering metric space

Image you are blindfolded and placed in a gymnasium filled with other blindfolded people. Even worse: you also lost all sense of direction. When others talk, you can sense how far away they are, but not where they are in the room. Eventually, some basic laws become clear.

  1. If there is no distance between you and the other person, you are standing in the same spot.
  2. When you talk to another person, they perceive you has being the same distance away as you perceive them.
  3. When you talk to person A and person B, the distance to A is always less than the distance to B plus the distance from A to B. In other words, the shortest distance between two people is a straight line. Distance is never negative.

This is a metric space. The great thing about metric spaces is that the things that you put in them do not need to do a lot. All you need is a way of calculating the distances between them. You do not need to be able to add them together or find bounding shapes or find points midway between them. The data structure that I want to talk about is the Vantage Point Tree (a generalization of the BK-tree that is eloquently reviewed in Damn cool algorithms.

Each node of the tree contains one of the input points, and a radius. Under the left child are all points which are closer to the node's point than the radius. The other child contains all of the points which are farther away. The tree requires no other knowledge about the items in it. All you need is a distance function that satisfies the properties of a metric space.

How searching a VP-Tree works

Let us examine one of these nodes in detail, and what happens during a recursive search for the nearest neighbours to a target.

Suppose we want to find the two nearest neighbours to the target, marked with the red X. Since we have no points yet, the node's center p is the closest candidate, and we add it to the list of results. (It might be bumped out later). At the same time, we update our variable tau which tracks the distance of the farthest point that we have in our results.

Then, we have to decide whether to search the left or right child first. We may end up having to search them both, but we would like to avoid that most of the time.

Since the target is closer to the node's center than its outer shell, we search the left child first, which contains all of the points closer than the radius. We find the blue point. Since it is farther away than tau we update the tau value.

Do we need to continue the search? We know that we have considered all the points that are within the distance radius of p. However, it is closer to get to the outer shell than the farthest point that we have found. Therefore there could be closer points just outside of the shell. We do need to descend into the right child to find the green point.

If, however, we had reached our goal of collecting the n nearest points, and the target point is farther from the the outer shell than the farthest point that we have collected, then we could have stopped looking. This results in significant savings.

Implementation

Here is an implementation of the VP Tree in C++. The recursive search() function decides whether to follow the left, right, or both children. To efficiently maintain the list of results, we use a priority queue. (See my article, Finding the top k items in a list efficiently for why).

I tried it out on a database of all the cities in the world, and the VP tree search was 3978 times faster than a linear search through all the points. You can download the C++ program that uses the VP tree for this purpose here.

It is worth repeating that you must use a distance metric that satisfies the triangle inequality. I spent a lot of time wondering why my VP tree was not working. It turns out that I had not bothered to find the square root in the distance calculation. This step is important to satisfy the requirements of a metric space, because if the straight line distance to a <= b+c, it does not necessarily follow that a2 <= b2 + c2.

Here is the output of the program when you search for cities by latitude and longitude.

Create took 15484122
Search took 36
ca,waterloo,Waterloo,08,43.4666667,-80.5333333
 0.0141501
ca,kitchener,Kitchener,08,43.45,-80.5
 0.025264
ca,bridgeport,Bridgeport,08,43.4833333,-80.4833333
 0.0396333
ca,elmira,Elmira,08,43.6,-80.55
 0.137071
ca,baden,Baden,08,43.4,-80.6666667
 0.161756
ca,floradale,Floradale,08,43.6166667,-80.5833333
 0.163351
ca,preston,Preston,08,43.4,-80.35
 0.181762
ca,ayr,Ayr,08,43.2833333,-80.45
 0.195739
---
Linear search took 143212
ca,waterloo,Waterloo,08,43.4666667,-80.5333333
 0.0141501
ca,kitchener,Kitchener,08,43.45,-80.5
 0.025264
ca,bridgeport,Bridgeport,08,43.4833333,-80.4833333
 0.0396333
ca,elmira,Elmira,08,43.6,-80.55
 0.137071
ca,baden,Baden,08,43.4,-80.6666667
 0.161756
ca,floradale,Floradale,08,43.6166667,-80.5833333
 0.163351
ca,preston,Preston,08,43.4,-80.35
 0.181762
ca,ayr,Ayr,08,43.2833333,-80.45
 0.195739

Construction

I'm too lazy to implement a delete or insert function. It is most efficient to simply build the tree by repeatedly partitioning the data. We build the tree from the top down from an array of items. For each node, we first choose a point at random, and then partition the list into two sets: The left children contain the points farther away than the median, and the right contains the points that are closer than the median. Then we recursively repeat this until we have run out of points.
// A VP-Tree implementation, by Steve Hanov. (steve.hanov@gmail.com)
// Released to the Public Domain
// Based on "Data Structures and Algorithms for Nearest Neighbor Search" by Peter N. Yianilos
#include <stdlib.h>
#include <algorithm>
#include <vector>
#include <stdio.h>
#include <queue>
#include <limits>

template<typename T, double (*distance)( const T&, const T& )>
class VpTree
{
public:
    VpTree() : _root(0) {}

    ~VpTree() {
        delete _root;
    }

    void create( const std::vector& items ) {
        delete _root;
        _items = items;
        _root = buildFromPoints(0, items.size());
    }

    void search( const T& target, int k, std::vector* results, 
        std::vector<double>* distances) 
    {
        std::priority_queue<HeapItem> heap;

        _tau = std::numeric_limits::max();
        search( _root, target, k, heap );

        results->clear(); distances->clear();

        while( !heap.empty() ) {
            results->push_back( _items[heap.top().index] );
            distances->push_back( heap.top().dist );
            heap.pop();
        }

        std::reverse( results->begin(), results->end() );
        std::reverse( distances->begin(), distances->end() );
    }

private:
    std::vector<T> _items;
    double _tau;

    struct Node 
    {
        int index;
        double threshold;
        Node* left;
        Node* right;

        Node() :
            index(0), threshold(0.), left(0), right(0) {}

        ~Node() {
            delete left;
            delete right;
        }
    }* _root;

    struct HeapItem {
        HeapItem( int index, double dist) :
            index(index), dist(dist) {}
        int index;
        double dist;
        bool operator<( const HeapItem& o ) const {
            return dist < o.dist;   
        }
    };

    struct DistanceComparator
    {
        const T& item;
        DistanceComparator( const T& item ) : item(item) {}
        bool operator()(const T& a, const T& b) {
            return distance( item, a ) < distance( item, b );
        }
    };

    Node* buildFromPoints( int lower, int upper )
    {
        if ( upper == lower ) {
            return NULL;
        }

        Node* node = new Node();
        node->index = lower;

        if ( upper - lower > 1 ) {

            // choose an arbitrary point and move it to the start
            int i = (int)((double)rand() / RAND_MAX * (upper - lower - 1) ) + lower;
            std::swap( _items[lower], _items[i] );

            int median = ( upper + lower ) / 2;

            // partitian around the median distance
            std::nth_element( 
                _items.begin() + lower + 1, 
                _items.begin() + median,
                _items.begin() + upper,
                DistanceComparator( _items[lower] ));

            // what was the median?
            node->threshold = distance( _items[lower], _items[median] );

            node->index = lower;
            node->left = buildFromPoints( lower + 1, median );
            node->right = buildFromPoints( median, upper );
        }

        return node;
    }

    void search( Node* node, const T& target, int k,
                 std::priority_queue& heap )
    {
        if ( node == NULL ) return;

        double dist = distance( _items[node->index], target );
        //printf("dist=%g tau=%gn", dist, _tau );

        if ( dist < _tau ) {
            if ( heap.size() == k ) heap.pop();
            heap.push( HeapItem(node->index, dist) );
            if ( heap.size() == k ) _tau = heap.top().dist;
        }

        if ( node->left == NULL && node->right == NULL ) {
            return;
        }

        if ( dist < node->threshold ) {
            if ( dist - _tau <= node->threshold ) {
                search( node->left, target, k, heap );
            }

            if ( dist + _tau >= node->threshold ) {
                search( node->right, target, k, heap );
            }

        } else {
            if ( dist + _tau >= node->threshold ) {
                search( node->right, target, k, heap );
            }

            if ( dist - _tau <= node->threshold ) {
                search( node->left, target, k, heap );
            }
        }
    }
};

More...

ADATA SH93 Teardown
Posted on: 2011-11-19 13:58:24

My wife gave one of our laptops, a Dell Inspiron 6400, a swift accidental kick, and it stopped working. I quickly sprung into action and put my Masters of Mathematics and years of education in algorithms and advanced compiler optimization to use. First I turned it on. “No hard drive found," said the BIOS. I deftly operated a screwdriver to diagnose the problem. Okay, actually I used a kitchen knife to unscrew the drawer from the laptop. When I finally got it out, I realized how much hard drives and incandescent light bulbs have in common. When you shake them and they rattle, you know it's time to replace them.

I was going to buy a new hard drive, but then I remembered that I had this old ADATA SH93 hard drive lying around. It could survive drops, stepping on it, freezing, or being shot with hundreds of pounds of water using a firehose, but if you slightly yank on the USB cable in any direction, it will bend the little prongs and it will never work again. I wondered if I could convert this overpriced paperweight to an overpriced internal hard drive.

The Internet does not contain any information about it. Nobody is going to pay extra for a waterproof, shockproof hard drive only to tear it open. I resolved to open it up and post my findings on the web.

First, I cut into the corner of the rubber shell with my knife/screwdriver. Note how I don't use any anti-static protection. Don't try this at home!

Cut the other side, and the military-grade shell slips right off. It reveals two screws that I removed..

The plastic case contains some tabs to hold it together. I had to pry it apart with a screwdriver. The shell got all bent out of shape.

This is why at work they never let me near the expensive prototypes anymore.

Inside you can see the hard drive, sheathed in static-dispersing foil.

The drive seemed hard to remove, but it can be gently pried out of the case using something pointy.

Success! My laptop now has an overpriced 500 GB SATA hard drive.

More...

Why you should go to the Business of Software Conference Next Year
Posted on: 2011-10-29 00:19:36

Most people, having already paid $2000.00 of their hard earned money, and then having flown, driven, or otherwise travelled to Boston to attend a conference, and then having paid an additional $250/night plus $33/night parking and "tourism taxes" to the Seaport Hotel -- most people, after all this, are unlikely to say that it was a waste of time and they should have stayed home watching the remaining salvaged episodes of Doctor Who on Netflix.

In fact, I found it quite useful.

The talks by Clayton Christenson (author of The Innovators Dilemma), Rory Sutherland (expert on Behavioural economics) and the dozens of entrepreneurs (both serial and parallel) were all very fascinating and useful, and they were all broadcast for free, and they will soon be up for streaming, for free.

So why go through all of this effort to physically go to the conference?


One of the conference rooms at Business of Software 2011.

What the the World Trade Center in Boston lacks in number of bathrooms, it more than makes up for in hallways. It has roughly 1000 miles of hallways in which you can bump into successful business people. And every one of them is trying to meet you and get your take on important, urgent business-related matters like, "Have you seen an empty bathroom?"

Seriously, when not at the conference, and people ask what I do, I have learned to say something like "I do computers". People here understand when I talk about NoSQL databases, SaaS models, and programmer development tools. The amount of time until their eyes glaze over is well over the 60 second mark.

You also get some inside info. People aren't shy talking about their pricing. How much does the super-mega-ultra corporate option cost? The one where instead of a price, it says "Call"? These people will tell you, because they don't get to talk about it much, and they are honestly trying to help.

I talked to C.E.O.s, and C.T.Os, of 3 to 30 person companies. I talked to VPs, Cloud Engineers, and Intrapreneurs of big companies. For many, this is the first opportunity to talk to an outsider about their businesses. It is like psychotherapy. Often they would come to a sudden realization. "Hey," a micro-ISV would say, "I just have a fear of releasing the next version because it's missing some difficult features. I should just do it anyway!". If you go to this conference, you probably already know what you should do to improve your business. But having Jason Cohen, or some seasoned CEO tell you in person moves it up onto the todo list.

General trends

  • Disruption - Disruption is big. If you're not disruptive, you might as well be selling mainframes and typewriters. Companies are disrupting each-other at an astounding rate. Sometimes, while one company is busy disrupting an industry, another one will sneak up behind it and try to disrupt it when it is not looking. That is why companies need to be agile and pivot frequently.
  • Metrics - The info-geeks have taken over. Founders are demanding dashboards for their business, updated in real time. But not only for themselves -- every click of the web site, and every cancellation is streamed to every employee to give an accurate picture of the health of the company. A special version containing only the "Customer Happiness Index" and a huge happy face is streamed to the investors.
  • Crowd-sourced employee recognition - At least three companies are working on this. It can be hard for bosses to identify their best contributors to allocate bonuses. The idea is to crowd-source this from their workforce. "So we'll give them a button -- so whenever anybody does something nice, other people will just push it and they get a -- a pony point --- yeah! And then I just have to add them all up to find the best contributors!" If you've worked at a large company for more than a year, you already know what an awesome idea this is. Just rename "pony" to "stab".
  • Skype - Ask anybody, in tiny or large companies. Odds are that they bypass their Enterprise Collabosoft GrouperWare system and secretly use Skype to communicate. Just a minute while I go privately Skype to people about why Microsoft should acquire my startup.
  • Dishonesty - Jason Cohen gave a talk about how honesty in business can differentiate you. If you are a small company, he says, you should not try to hide it. Companies will be refreshed by your truthfulness, and it sets the correct expectations at the outset. Most of the attendees believe honesty is a great idea. Companies should all be honest! Because Jason Cohen says it pays! But if you are in a uniquely special business, such as storing data securely in the cloud, or selling software as a service, or selling licensed software, or you offer a limited or very diverse product line, or you have competition -- in these very special situations, honesty definitely will never work. At least, that's the going opinion.

I hope you're convinced of the value that Business of Software has to offer, and I hope to see you there next year. I should be finished Doctor Who by then.

More...

Four ways of handling asynchronous operations in node.js
Posted on: 2011-09-30 21:30:00

Javascript was not designed to do asynchronous operations easily. If it were, then writing asynchronous code would be as easy as writing blocking code. Instead, developers in node.js need to manage many levels of callbacks.

Today, we will examine four different methods of performing the same task asynchronously, in node.js. We will read in all the files in the current folder,

  1. In parallel, using callback functions
  2. Sequentially, using callback functions
  3. In parallel, using promises
  4. Sequentially, using promises

This will help you decide which to use for your particular situation. It is simply a matter of taste. If you want to know which is the best method -- the absolute best way to go -- is probably to switch to Erlang.

Reading the files in parallel using callbacks

Our toy problem is to read all of the files in the current folder.

With our toy problem, we can't do everything in parallel. To read the files in the folder, we first need to know which files are in the folder. Thus we start with the readdir() function. We wait for the operation to complete. Then, for each file, we use the readFile() to get the contents of the file.

Here are the documentation for the functions, from node.js.

fs.readdir(path, [callback])

Asynchronous readdir(3). Reads the contents of a directory. The callback gets two arguments (err, files) where files is an array of the names of the files in the directory excluding '.' and '..'.

fs.readFile(filename, [encoding], [callback])

Asynchronously reads the entire contents of a file.

The callback is passed two arguments (err, data), where data is the contents of the file.

If no encoding is specified, then the raw buffer is returned.

All of the readFile() operations happen at once, and then we wait for the results to come in. We simply count how many times a readFile() operation has completed. When all of the files have been read, we know we are done.

    // Read all files in the folder in parallel.
    var fs = require("fs");

    fs.readdir( ".", function( err, files) {
        if ( err ) {
            console.log("Error reading files: ", err);
        } else {
            // keep track of how many we have to go.
            var remaining = files.length;
            var totalBytes = 0;

            if ( remaining == 0 ) {
                console.log("Done reading files. totalBytes: " +
                    totalBytes);
            }

            // for each file,
            for ( var i = 0; i < files.length; i++ ) {
                // read its contents.
                fs.readFile( files[i], function( error, data ) {
                    if ( error ) {
                        console.log("Error: ", error);
                    } else {
                        totalBytes += data.length
                        console.log("Successfully read a file.");
                    }
                    remaining -= 1;
                    if ( remaining == 0 ) {
                        console.log("Done reading files. totalBytes: " +
                            totalBytes);
                    }
                });
            }
        }
    });

Reading the files sequentially using callbacks

It is usually most efficient to to the above. Order the computer to do everything at once, and let the operating system sort it out. But that's not always what you want. Sometimes you need to impose an order and do things sequentially.

Here is an example of reading each file one at a time. The for loop is gone. It is replaced by a recursive function. The function checks to see if it has reached the last file. If so, it is done. Otherwise, it calls itself to process the next file in the list.

    // Read all the files in the folder in sequence, using callbacks
    var fs = require("fs");

    fs.readdir( ".", function( error, files ) {
        if ( error ) {
            console.log("Error listing file contents.");
        } else {
            var totalBytes = 0;

            // This function repeatedly calls itself until the files are all read.
            var readFiles = function(index) {
                if ( index == files.length ) {
                    // we are done.
                    console.log( "Done reading files. totalBytes = " + 
                        totalBytes );
                } else {

                    fs.readFile( files[index], function( error, data ) {
                        if ( error ) {
                            console.log( "Error reading file. ", error );
                        } else {
                            totalBytes += data.length;
                            readFiles(index + 1);
                        }
                    });
                }

            };

            readFiles(0);
        }
    });

Reading the files in parallel using Promises

A promise (also known a future, and sometimes a channel) is a concept from the 1970's that has recently become popular for Javascript programming. Promises are implemented in the node.js module promise. This module is not included unless you add it.

When you call a function and expect a return value, and the value is not yet available, the function instead returns a promise. The caller can then store the promise for later or schedule a subsequent operation when it completes. Promises can be seen as a more specialized form of node.js's EventEmitter, where the only two events are reject or resolve. Instead of using "on" to listen for the events, we use "then".

Here is a really simple example of promises being used.

var promise = doSomeAsynchronousOperation();
promise.then( function(result) {
    // yay! I got the result.
}, function(error) {
    // The promise was rejected with this error.
}

function doSomeAsynchronousOperation()
{
   var promise = new Promise.Promise();
   fs.readFile( "somefile.txt", function( error, data ) {
        if ( error ) {
            promise.reject( error );
        } else {
            promise.resolve( data );
        }
    });

    return promise;
}

Promises may be easier to deal with for some people, because functions that return promises are harder to misuse. You could forget whether a callback belongs in the 3rd or 4th parameter, but you can't make that mistake with a return value. Another advantage is that they encapulate the recursive function loop above. You can easily construct a super-promise from an array of promises, what will resolve only when each of its members resolve. That's we do in the code below, with Promise.all()

    var fs = require("fs");
    var Promise = require("promise");

    // Wrap the io functions with ones that return promises.
    var readdir_promise = Promise.convertNodeAsyncFunction(fs.readdir);
    var readFile_promise = Promise.convertNodeAsyncFunction( fs.readFile );

    p = readdir_promise( "." );
    p.then( function( files ) {

        // Create an array of promises
        var promises = [];

        for ( var i = 0; i < files.length; i++ ) {
            promises.push( readFile_promise( files[i] ) );
        }

        Promise.all( promises ).then( function(results) {
            var totalBytes = 0;
            for ( i = 0; i < results.length; i++ ) {
                totalBytes += results[i].length;
            }
            console.log("Done reading files. totalBytes = " + totalBytes);
        }, function( error ) {
            console.log("Error reading files");
        });

    }, function( error ) {
        console.log( "readdir failed.");

    });

Reading the files sequentially using Promises

By default, promises don't support sequential operations very well. But we can build on them using a PromiseSequence, which adds the ability to define a series of steps and loops, which are performed sequentially.

Here is the program again, reading a file. Instead of indenting the code by many levels, we are able to write it in more of a sequential style. Also, the above examples had two places where errors had to be handled. With a promise sequence, the errors for any operation in the sequence are handled in one place.

We first add the readdir() operation to the sequence, and then add a loop() to the sequence. The loop executes repeatedly until exitLoop() is called. Since there are no further step, the argument to exitLoop resolves the promise and the program ends.

    // Read all the files in the folder in a sequence, using Promises
    var fs = require("fs");
    var Promise = require("promise");
    var PromiseSequence = require("./PromiseSequence").PromiseSequence;

    // Wrap the io functions with ones that return promises.
    var readdir_promise = Promise.convertNodeAsyncFunction(fs.readdir);
    var readFile_promise = Promise.convertNodeAsyncFunction( fs.readFile );

    var seq = new PromiseSequence();
    var index = 0;
    var totalBytes = 0;
    var files = null;

    seq.add( function() {
        return readdir_promise( "." );
    });

    seq.loop( 
        // The "next" function of the loop takes the result of the readdir and
        // reads the file. It is executed when the loop is entered, and again after
        // each time the body is executed.
        function( files_arg ) {
            files = files_arg;
            if ( index == files.length ) {
                seq.exitLoop(totalBytes);
                return;
            } else {
                console.log("Reading file " + files[index]);
                return readFile_promise( files[index++] );
            }
        },

        // The "body" function of the loop is called with the result of the "next" function.
        // It simply sums the length of the file.
        function( contents ) {
            totalBytes += contents.length;
        }
    );

    seq.run().then( function(total) {
        console.log("Done reading file. Total bytes: " + total);
    }, function(error) {
        console.log("Error reading files: ", error);
    });

More...

Type-checked CoffeeScript with jzbuild
Posted on: 2011-09-08 14:42:02
Coffeescript is a faster way to write Javascript. You write a program in its terse syntax, and it gets expanded into javascript. Here's a coffeescript file, Animal.coffee
class Animal
    constructor: (@sound) ->

    makeSound: -> 
        alert @sound

class Cow extends Animal
    constructor: -> 
        super "moo"

    ###* @param {number} numHours ###
    graze: (numHours) ->


animal = new Cow()
animal.makeSound()

animal.graze "16"
animal.eatGrass()
There are some errors in this file that wouldn't normally be detected until you run it.

More...

Creating binary RPM/DEB packages the easy way
Posted on: 2011-08-13 10:09:41
Linux users love to have .deb or .rpm packages. They are easy to use, but they are not easy to create. If you want to create a DEB the traditional way, you would first have to read the Debian Policy Manual. If you don't, newer versions of Debian will warn users against installing your "low quality" software. Creating an RPM file, on the other hand, requires learning about a new macro language. All the documentation assumes you are going to create a "source" package, from which the binary package is built.

More...

Zero load time file formats
Posted on: 2011-06-29 17:27:08
Sometimes you cannot afford to load data files from disk. Maybe you need results immediately, or the data is simply too large to fit into memory. A technique that I like to use is an on-disk data structure. Here is a toy example for instantly accessing lists of related words.

More...

Finding the top K items in a list efficiently
Posted on: 2011-06-04 22:56:39

Algorithms will always matter. Sure, processor speeds are still increasing. But the problems that we want to solve using those processors are increasing in size faster. People who are dealing with social network graphs, or analyzing twitter posts, or searching images, or solving any of the hundreds of problems in vogue would be wasting time without the fastest possible hardware. But they would sitting around forever if they weren't using the right tools.

That's why I get sad when I see code like this:

# find the top 10 results
results = sorted(results, reverse=True)[:10]

Anything involving a sort will usually take O(nlogn) time, which, when dealing with lots of items, will keep you waiting around for several seconds or even minutes. An O(nlogn) algorithm, for large N, simply cannot be run in realtime when users are waiting.

The Heap

Finding the top K items can be done in O(nlogk) time, which is much, much faster than O(nlogn), using a heap (wikipedia). Or, since I usually end up rewriting everything in C++ eventually, a priority queue.

The strategy is to go through the list once, and as you go, keep a list of the top k elements that you found so far. To do this efficiently, you have to always know the smallest element in this top-k, so you can possibly replace it with one that is larger. The heap structure makes it easy to maintain this list without wasting any effort. It is like a lazy family member who always does the absolute minimum amount of work. It only does enough of the sort to find the smallest element, and that is why it is fast.

Here's some code to demonstrate the difference between a linear search, and a heap search to find the top K elements in a large array. The heap search is 4 times faster, despite the test being biased in favour of the linear search. The linear search ends up executing in compiled C inside python itself, while the heap search is completely in interpreted python. If they were both in C, the difference in performance would be more pronounced.

#!/usr/bin/python
import heapq
import random
import time

def createArray():
    array = range( 10 * 1000 * 1000 )
    random.shuffle( array )
    return array

def linearSearch( bigArray, k ):
    return sorted(bigArray, reverse=True)[:k]

def heapSearch( bigArray, k ):
    heap = []
    # Note: below is for illustration. It can be replaced by 
    # heapq.nlargest( bigArray, k )
    for item in bigArray:
        # If we have not yet found k items, or the current item is larger than
        # the smallest item on the heap,
        if len(heap) < k or item > heap[0]:
            # If the heap is full, remove the smallest element on the heap.
            if len(heap) == k: heapq.heappop( heap )
            # add the current element as the new smallest.
            heapq.heappush( heap, item )
    return heap

start = time.time()
bigArray = createArray()
print "Creating array took %g s" % (time.time() - start)

start = time.time()
print linearSearch( bigArray, 10 )    
print "Linear search took %g s" % (time.time() - start)

start = time.time()
print heapSearch( bigArray, 10 )    
print "Heap search took %g s" % (time.time() - start)
Creating array took 7.15145 s
[9999999, 9999998, 9999997, 9999996, 9999995, 9999994, 9999993, 9999992, 9999991, 9999990]
Linear search took 10.9981 s
[9999990, 9999992, 9999991, 9999994, 9999993, 9999998, 9999997, 9999996, 9999999, 9999995]
Heap search took 2.66371 s

Also, if you see stuff like this, you should go directly to the wikipedia page on the Selection Algorithm

# find the median
median = sorted(results)[len(results)/2]

More...

An instant rhyming dictionary for any web site
Posted on: 2011-06-04 21:30:00
Many good web applications, and many bad ones, have an API, and my hobby project, RhymeBrain.com, is no exception. The trouble is: the target users of both the web site and the API don't know the difference between Javascript and Java. They don't even know what "A.P.I." stands for. The most they can do is edit some HTML and paste in some code, so that's what my API has to be.

More...

More Posts

Supporter
AllStream Managed IT Services

Email
steve.hanov@gmail.com

Other posts by Steve

[comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast ADATA SH93 Teardown Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Creating binary RPM/DEB packages the easy way Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me Google Buzz gets less awful every day The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript (part 1) Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git A complete blogging system in 1900 lines of php Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets A Fast Calorie Calculator for Windows UMA and free long distance UMA's dirty secrets Creating a Todo list in Ajax Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Comment Spam Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets