|
Hate UML?Draw sequence diagrams in seconds.http://www.websequencediagrams.com |
Algorithms will always matter. Sure, processor speeds are still increasing. But the problems that we want to solve using those processors are increasing in size faster. People who are dealing with social network graphs, or analyzing twitter posts, or searching images, or solving any of the hundreds of problems in vogue would be wasting time without the fastest possible hardware. But they would sitting around forever if they weren't using the right tools.
That's why I get sad when I see code like this:
# find the top 10 results results = sorted(results, reverse=True)[:10]
Anything involving a sort will usually take O(nlogn) time, which, when dealing with lots of items, will keep you waiting around for several seconds or even minutes. An O(nlogn) algorithm, for large N, simply cannot be run in realtime when users are waiting.
The strategy is to go through the list once, and as you go, keep a list of the top k elements that you found so far. To do this efficiently, you have to always know the smallest element in this top-k, so you can possibly replace it with one that is larger. The heap structure makes it easy to maintain this list without wasting any effort. It is like a lazy family member who always does the absolute minimum amount of work. It only does enough of the sort to find the smallest element, and that is why it is fast.
Here's some code to demonstrate the difference between a linear search, and a heap search to find the top K elements in a large array. The heap search is 4 times faster, despite the test being biased in favour of the linear search. The linear search ends up executing in compiled C inside python itself, while the heap search is completely in interpreted python. If they were both in C, the difference in performance would be more pronounced.
#!/usr/bin/python
import heapq
import random
import time
def createArray():
array = range( 10 * 1000 * 1000 )
random.shuffle( array )
return array
def linearSearch( bigArray, k ):
return sorted(bigArray, reverse=True)[:k]
def heapSearch( bigArray, k ):
heap = []
# Note: below is for illustration. It can be replaced by
# heapq.nlargest( bigArray, k )
for item in bigArray:
# If we have not yet found k items, or the current item is larger than
# the smallest item on the heap,
if len(heap) < k or item > heap[0]:
# If the heap is full, remove the smallest element on the heap.
if len(heap) == k: heapq.heappop( heap )
# add the current element as the new smallest.
heapq.heappush( heap, item )
return heap
start = time.time()
bigArray = createArray()
print "Creating array took %g s" % (time.time() - start)
start = time.time()
print linearSearch( bigArray, 10 )
print "Linear search took %g s" % (time.time() - start)
start = time.time()
print heapSearch( bigArray, 10 )
print "Heap search took %g s" % (time.time() - start)
Creating array took 7.15145 s [9999999, 9999998, 9999997, 9999996, 9999995, 9999994, 9999993, 9999992, 9999991, 9999990] Linear search took 10.9981 s [9999990, 9999992, 9999991, 9999994, 9999993, 9999998, 9999997, 9999996, 9999999, 9999995] Heap search took 2.66371 s
Also, if you see stuff like this, you should go directly to the wikipedia page on the Selection Algorithm
# find the median median = sorted(results)[len(results)/2]
Want more programming tech talk?
Add to Circles on Google Plus
Subscribe to posts
As a casual pythoner, I appreciate a post like this, where a cool feature I didn't know about is explored.
try this:
def faster_heap_search(bigArray, k):
heap = bigArray[:k]
for item in bigArray:
if item > heap[0]:
heapq.heappop(heap)
heapq.heappush(heap, item)
return sorted(heap)
otherwise, nice post.
but always remember: premature optimization is the root of all evil!
BTW, I just discovered that this module also includes two functions that are intended for the job as described here - finding the largest/smallest n items in a list.
Then I try the heapq.nlargest function on `bigArray`, and it turns out about 1 second slower than your code.
en.wikipedia.org/wiki/Selection_algorithm#Application_of_simple_selection_algorithms
1. std::nth_element: selection algorithm in O(n)
2. std::partial_sort: partial sort algorithm in O(n * log k)
priority queues are good if you have a stream of incoming values, say from disk or the network, and you don't want to store all n in memory at once.
Unless the list contains millions of elements or is called 100's of times a second, I would use the first method - we can always change it to use the second later when and if profiling determines this to be a bottleneck.
Luckily, the longest list I'm dealing with is less than 500 items, and should generally be less than 100, but I may implement this algorithm instead.
Post comment