Hate UML?

Draw sequence diagrams in seconds.
http://www.websequencediagrams.com

Finding the top K items in a list efficiently
Posted on: 2011-06-04 22:56:39

Algorithms will always matter. Sure, processor speeds are still increasing. But the problems that we want to solve using those processors are increasing in size faster. People who are dealing with social network graphs, or analyzing twitter posts, or searching images, or solving any of the hundreds of problems in vogue would be wasting time without the fastest possible hardware. But they would sitting around forever if they weren't using the right tools.

That's why I get sad when I see code like this:

# find the top 10 results
results = sorted(results, reverse=True)[:10]

Anything involving a sort will usually take O(nlogn) time, which, when dealing with lots of items, will keep you waiting around for several seconds or even minutes. An O(nlogn) algorithm, for large N, simply cannot be run in realtime when users are waiting.

The Heap

Finding the top K items can be done in O(nlogk) time, which is much, much faster than O(nlogn), using a heap (wikipedia). Or, since I usually end up rewriting everything in C++ eventually, a priority queue.

The strategy is to go through the list once, and as you go, keep a list of the top k elements that you found so far. To do this efficiently, you have to always know the smallest element in this top-k, so you can possibly replace it with one that is larger. The heap structure makes it easy to maintain this list without wasting any effort. It is like a lazy family member who always does the absolute minimum amount of work. It only does enough of the sort to find the smallest element, and that is why it is fast.

Here's some code to demonstrate the difference between a linear search, and a heap search to find the top K elements in a large array. The heap search is 4 times faster, despite the test being biased in favour of the linear search. The linear search ends up executing in compiled C inside python itself, while the heap search is completely in interpreted python. If they were both in C, the difference in performance would be more pronounced.

#!/usr/bin/python
import heapq
import random
import time

def createArray():
    array = range( 10 * 1000 * 1000 )
    random.shuffle( array )
    return array

def linearSearch( bigArray, k ):
    return sorted(bigArray, reverse=True)[:k]

def heapSearch( bigArray, k ):
    heap = []
    # Note: below is for illustration. It can be replaced by 
    # heapq.nlargest( bigArray, k )
    for item in bigArray:
        # If we have not yet found k items, or the current item is larger than
        # the smallest item on the heap,
        if len(heap) < k or item > heap[0]:
            # If the heap is full, remove the smallest element on the heap.
            if len(heap) == k: heapq.heappop( heap )
            # add the current element as the new smallest.
            heapq.heappush( heap, item )
    return heap

start = time.time()
bigArray = createArray()
print "Creating array took %g s" % (time.time() - start)

start = time.time()
print linearSearch( bigArray, 10 )    
print "Linear search took %g s" % (time.time() - start)

start = time.time()
print heapSearch( bigArray, 10 )    
print "Heap search took %g s" % (time.time() - start)
Creating array took 7.15145 s
[9999999, 9999998, 9999997, 9999996, 9999995, 9999994, 9999993, 9999992, 9999991, 9999990]
Linear search took 10.9981 s
[9999990, 9999992, 9999991, 9999994, 9999993, 9999998, 9999997, 9999996, 9999999, 9999995]
Heap search took 2.66371 s

Also, if you see stuff like this, you should go directly to the wikipedia page on the Selection Algorithm

# find the median
median = sorted(results)[len(results)/2]

Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts

Post comment

Real Name:
Your Email (Not displayed):

Text only. No HTML. If you write "http:" your message will be ignored.
Choose an edit password if you want to be able to edit or delete your comment later.
Editing Password (Optional):

Dan

2011-06-05 00:51:03
The heap results aren't sorted like the linear results. Just to be complete, I think the function should return sorted(heap). After all, it might take a millisecond to sort 10 numbers. ;-)

As a casual pythoner, I appreciate a post like this, where a cool feature I didn't know about is explored.

I don't like if-statements

2011-06-05 06:36:18
if-statements are expensive!

try this:

def faster_heap_search(bigArray, k):

heap = bigArray[:k]

for item in bigArray:

if item > heap[0]:

heapq.heappop(heap)

heapq.heappush(heap, item)

return sorted(heap)

otherwise, nice post.

but always remember: premature optimization is the root of all evil!

Ok, next time, i'll set an editing password, sorry ;)

2011-06-05 06:38:37
pastebin.com/10eL2d4V

Satoru

2011-06-05 06:47:09
Thanks for introducing the heapq module :)

BTW, I just discovered that this module also includes two functions that are intended for the job as described here - finding the largest/smallest n items in a list.

Then I try the heapq.nlargest function on `bigArray`, and it turns out about 1 second slower than your code.

Mariano Chouza

2011-06-05 17:31:02
You can do better by using the k-th element selection algorithm, getting an O(n + k log k) algorithm if you care about the order of the top items and O(n) if you don't.

en.wikipedia.org/wiki/Selection_algorithm#Application_of_simple_selection_algorithms

Brendan Miller

2011-06-05 17:40:49
In C++, rather than std::priority_queue it's easier to use either:

1. std::nth_element: selection algorithm in O(n)

2. std::partial_sort: partial sort algorithm in O(n * log k)

priority queues are good if you have a stream of incoming values, say from disk or the network, and you don't want to store all n in memory at once.

BlueRaja

2011-06-06 17:25:39
Of course, the first code sample takes less time and thought to write.

Unless the list contains millions of elements or is called 100's of times a second, I would use the first method - we can always change it to use the second later when and if profiling determines this to be a bottleneck.

Iftakharul Islam

2011-06-10 11:03:10
Isn't it O(klogn). k times rebuild the heap (logn)

Krishna

2012-01-20 11:30:51
No because priority queues size is k.

Oleg

2012-07-18 17:45:19
I think it would be better to use heapq.heapreplace instead of heappop followed by heappush.

Joshua Grigonis

2012-08-28 20:40:26
Coincidentally, I just wrote some code that needed to do this, on the same day this popped on reddit/r/programming. When I wrote my algorithm, I didn't even have to step back and go "EWWWW! that smells", I knew I was doing it inefficiently.

Luckily, the longest list I'm dealing with is less than 500 items, and should generally be less than 100, but I may implement this algorithm instead.

Email
steve.hanov@gmail.com

Other posts by Steve

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US How Asana Breaks the Rules About Per-Seat Pricing 5 Ways PowToon Made Me Want to Buy Their Software How I run my business selling software to Americans 0, 1, Many, a Zillion Give your Commodore 64 new life with an SD card reader 20 lines of code that will beat A/B testing every time [comic] Appreciation of xkcd comics vs. technical ability VP trees: A data structure for finding stuff fast Why you should go to the Business of Software Conference Next Year Four ways of handling asynchronous operations in node.js Type-checked CoffeeScript with jzbuild Zero load time file formats Finding the top K items in a list efficiently An instant rhyming dictionary for any web site Succinct Data Structures: Cramming 80,000 words into a Javascript file. Throw away the keys: Easy, Minimal Perfect Hashing Why don't web browsers do this? Fun with Colour Difference Compressing dictionaries with a DAWG Fast and Easy Levenshtein distance using a Trie The Curious Complexity of Being Turned On Cross-domain communication the HTML5 way Five essential steps to prepare for your next programming interview Minimal usable Ubuntu with one command Finding awesome developers in programming interviews Compress your JSON with automatic type extraction JZBUILD - An Easy Javascript Build System Pssst! Want to stream your videos to your iPod? "This is stupid. Your program doesn't work," my wife told me The simple and obvious way to walk through a graph Asking users for steps to reproduce bugs, and other dumb ideas Creating portable binaries on Linux Bending over: How to sell your software to large companies Regular Expression Matching can be Ugly and Slow C++: A language for next generation web apps qb.js: An implementation of QBASIC in Javascript Zwibbler: A simple drawing program using Javascript and Canvas You don't need a project/solution to use the VC++ debugger Boring Date (comic) barcamp (comic) How IE <canvas> tag emulation works I didn't know you could mix and match (comic) Sign here (comic) It's a dirty job... (comic) The PenIsland Problem: Text-to-speech for domain names Pitching to VCs #2 (comic) Building a better rhyming dictionary Does Android team with eccentric geeks? (comic) Comment spam defeated at last Pitching to VCs (comic) How QBASIC almost got me killed Blame the extensions (comic) How to run a linux based home web server Microsoft's generosity knows no end for a year (comic) Using the Acer Aspire One as a web server When programmers design web sites (comic) Finding great ideas for your startup Game Theory, Salary Negotiation, and Programmers Coding tips they don't teach you in school When a reporter mangles your elevator pitch Test Driven Development without Tears Drawing Graphs with Physics Free up disk space in Ubuntu Keeping Abreast of Pornographic Research in Computer Science Exploiting perceptual colour difference for edge detection Experiment: Deleting a post from the Internet Is 2009 the year of Linux malware? Email Etiquette How a programmer reads your resume (comic) How wide should you make your web page? Usability Nightmare: Xfce Settings Manager cairo blur image surface Automatically remove wordiness from your writing Why Perforce is more scalable than Git Optimizing Ubuntu to run from a USB key or SD card UMA Questions Answered Make Windows XP look like Ubuntu, with Spinning Cube Effect See sound without drugs Standby Preventer Stock Picking using Python Spoke.com scam Stackoverflow.com Copy a cairo surface to the windows clipboard Simulating freehand drawing with Cairo Free, Raw Stock Data Installing Ubuntu on the Via Artigo Why are all my lines fuzzy in cairo? A simple command line calculator Tool for Creating UML Sequence Diagrams Exploring sound with Wavelets UMA and free long distance UMA's dirty secrets Installing the Latest Debian on an Ancient Laptop Dissecting Adsense HTML/ Javascript/ CSS Pretty Printer Web Comic Aggregator Experiments in making money online How much cash do celebrities make? Draw waveforms and hear them Cell Phones on Airplanes Detecting C++ memory leaks What does your phone number spell? A Rhyming Engine Rules for Effective C++ Cell Phone Secrets