1"""Heap queue algorithm (a.k.a. priority queue). 2 3Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for 4all k, counting elements from 0. For the sake of comparison, 5non-existing elements are considered to be infinite. The interesting 6property of a heap is that a[0] is always its smallest element. 7 8Usage: 9 10heap = [] # creates an empty heap 11heappush(heap, item) # pushes a new item on the heap 12item = heappop(heap) # pops the smallest item from the heap 13item = heap[0] # smallest item on the heap without popping it 14heapify(x) # transforms list into a heap, in-place, in linear time 15item = heappushpop(heap, item) # pushes a new item and then returns 16 # the smallest item; the heap size is unchanged 17item = heapreplace(heap, item) # pops and returns smallest item, and adds 18 # new item; the heap size is unchanged 19 20Our API differs from textbook heap algorithms as follows: 21 22- We use 0-based indexing. This makes the relationship between the 23 index for a node and the indexes for its children slightly less 24 obvious, but is more suitable since Python uses 0-based indexing. 25 26- Our heappop() method returns the smallest item, not the largest. 27 28These two make it possible to view the heap as a regular Python list 29without surprises: heap[0] is the smallest item, and heap.sort() 30maintains the heap invariant! 31""" 32 33# Original code by Kevin O'Connor, augmented by Tim Peters and Raymond Hettinger 34 35__about__ = """Heap queues 36 37[explanation by François Pinard] 38 39Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for 40all k, counting elements from 0. For the sake of comparison, 41non-existing elements are considered to be infinite. The interesting 42property of a heap is that a[0] is always its smallest element. 43 44The strange invariant above is meant to be an efficient memory 45representation for a tournament. The numbers below are `k', not a[k]: 46 47 0 48 49 1 2 50 51 3 4 5 6 52 53 7 8 9 10 11 12 13 14 54 55 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 56 57 58In the tree above, each cell `k' is topping `2*k+1' and `2*k+2'. In 59a usual binary tournament we see in sports, each cell is the winner 60over the two cells it tops, and we can trace the winner down the tree 61to see all opponents s/he had. However, in many computer applications 62of such tournaments, we do not need to trace the history of a winner. 63To be more memory efficient, when a winner is promoted, we try to 64replace it by something else at a lower level, and the rule becomes 65that a cell and the two cells it tops contain three different items, 66but the top cell "wins" over the two topped cells. 67 68If this heap invariant is protected at all time, index 0 is clearly 69the overall winner. The simplest algorithmic way to remove it and 70find the "next" winner is to move some loser (let's say cell 30 in the 71diagram above) into the 0 position, and then percolate this new 0 down 72the tree, exchanging values, until the invariant is re-established. 73This is clearly logarithmic on the total number of items in the tree. 74By iterating over all items, you get an O(n ln n) sort. 75 76A nice feature of this sort is that you can efficiently insert new 77items while the sort is going on, provided that the inserted items are 78not "better" than the last 0'th element you extracted. This is 79especially useful in simulation contexts, where the tree holds all 80incoming events, and the "win" condition means the smallest scheduled 81time. When an event schedule other events for execution, they are 82scheduled into the future, so they can easily go into the heap. So, a 83heap is a good structure for implementing schedulers (this is what I 84used for my MIDI sequencer :-). 85 86Various structures for implementing schedulers have been extensively 87studied, and heaps are good for this, as they are reasonably speedy, 88the speed is almost constant, and the worst case is not much different 89than the average case. However, there are other representations which 90are more efficient overall, yet the worst cases might be terrible. 91 92Heaps are also very useful in big disk sorts. You most probably all 93know that a big sort implies producing "runs" (which are pre-sorted 94sequences, which size is usually related to the amount of CPU memory), 95followed by a merging passes for these runs, which merging is often 96very cleverly organised[1]. It is very important that the initial 97sort produces the longest runs possible. Tournaments are a good way 98to that. If, using all the memory available to hold a tournament, you 99replace and percolate items that happen to fit the current run, you'll 100produce runs which are twice the size of the memory for random input, 101and much better for input fuzzily ordered. 102 103Moreover, if you output the 0'th item on disk and get an input which 104may not fit in the current tournament (because the value "wins" over 105the last output value), it cannot fit in the heap, so the size of the 106heap decreases. The freed memory could be cleverly reused immediately 107for progressively building a second heap, which grows at exactly the 108same rate the first heap is melting. When the first heap completely 109vanishes, you switch heaps and start a new run. Clever and quite 110effective! 111 112In a word, heaps are useful memory structures to know. I use them in 113a few applications, and I think it is good to keep a `heap' module 114around. :-) 115 116-------------------- 117[1] The disk balancing algorithms which are current, nowadays, are 118more annoying than clever, and this is a consequence of the seeking 119capabilities of the disks. On devices which cannot seek, like big 120tape drives, the story was quite different, and one had to be very 121clever to ensure (far in advance) that each tape movement will be the 122most effective possible (that is, will best participate at 123"progressing" the merge). Some tapes were even able to read 124backwards, and this was also used to avoid the rewinding time. 125Believe me, real good tape sorts were quite spectacular to watch! 126From all times, sorting has always been a Great Art! :-) 127""" 128 129__all__ = ['heappush', 'heappop', 'heapify', 'heapreplace', 'merge', 130 'nlargest', 'nsmallest', 'heappushpop'] 131 132def heappush(heap, item): 133 """Push item onto heap, maintaining the heap invariant.""" 134 heap.append(item) 135 _siftdown(heap, 0, len(heap)-1) 136 137def heappop(heap): 138 """Pop the smallest item off the heap, maintaining the heap invariant.""" 139 lastelt = heap.pop() # raises appropriate IndexError if heap is empty 140 if heap: 141 returnitem = heap[0] 142 heap[0] = lastelt 143 _siftup(heap, 0) 144 return returnitem 145 return lastelt 146 147def heapreplace(heap, item): 148 """Pop and return the current smallest value, and add the new item. 149 150 This is more efficient than heappop() followed by heappush(), and can be 151 more appropriate when using a fixed-size heap. Note that the value 152 returned may be larger than item! That constrains reasonable uses of 153 this routine unless written as part of a conditional replacement: 154 155 if item > heap[0]: 156 item = heapreplace(heap, item) 157 """ 158 returnitem = heap[0] # raises appropriate IndexError if heap is empty 159 heap[0] = item 160 _siftup(heap, 0) 161 return returnitem 162 163def heappushpop(heap, item): 164 """Fast version of a heappush followed by a heappop.""" 165 if heap and heap[0] < item: 166 item, heap[0] = heap[0], item 167 _siftup(heap, 0) 168 return item 169 170def heapify(x): 171 """Transform list into a heap, in-place, in O(len(x)) time.""" 172 n = len(x) 173 # Transform bottom-up. The largest index there's any point to looking at 174 # is the largest with a child index in-range, so must have 2*i + 1 < n, 175 # or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so 176 # j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is 177 # (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1. 178 for i in reversed(range(n//2)): 179 _siftup(x, i) 180 181def _heappop_max(heap): 182 """Maxheap version of a heappop.""" 183 lastelt = heap.pop() # raises appropriate IndexError if heap is empty 184 if heap: 185 returnitem = heap[0] 186 heap[0] = lastelt 187 _siftup_max(heap, 0) 188 return returnitem 189 return lastelt 190 191def _heapreplace_max(heap, item): 192 """Maxheap version of a heappop followed by a heappush.""" 193 returnitem = heap[0] # raises appropriate IndexError if heap is empty 194 heap[0] = item 195 _siftup_max(heap, 0) 196 return returnitem 197 198def _heapify_max(x): 199 """Transform list into a maxheap, in-place, in O(len(x)) time.""" 200 n = len(x) 201 for i in reversed(range(n//2)): 202 _siftup_max(x, i) 203 204# 'heap' is a heap at all indices >= startpos, except possibly for pos. pos 205# is the index of a leaf with a possibly out-of-order value. Restore the 206# heap invariant. 207def _siftdown(heap, startpos, pos): 208 newitem = heap[pos] 209 # Follow the path to the root, moving parents down until finding a place 210 # newitem fits. 211 while pos > startpos: 212 parentpos = (pos - 1) >> 1 213 parent = heap[parentpos] 214 if newitem < parent: 215 heap[pos] = parent 216 pos = parentpos 217 continue 218 break 219 heap[pos] = newitem 220 221# The child indices of heap index pos are already heaps, and we want to make 222# a heap at index pos too. We do this by bubbling the smaller child of 223# pos up (and so on with that child's children, etc) until hitting a leaf, 224# then using _siftdown to move the oddball originally at index pos into place. 225# 226# We *could* break out of the loop as soon as we find a pos where newitem <= 227# both its children, but turns out that's not a good idea, and despite that 228# many books write the algorithm that way. During a heap pop, the last array 229# element is sifted in, and that tends to be large, so that comparing it 230# against values starting from the root usually doesn't pay (= usually doesn't 231# get us out of the loop early). See Knuth, Volume 3, where this is 232# explained and quantified in an exercise. 233# 234# Cutting the # of comparisons is important, since these routines have no 235# way to extract "the priority" from an array element, so that intelligence 236# is likely to be hiding in custom comparison methods, or in array elements 237# storing (priority, record) tuples. Comparisons are thus potentially 238# expensive. 239# 240# On random arrays of length 1000, making this change cut the number of 241# comparisons made by heapify() a little, and those made by exhaustive 242# heappop() a lot, in accord with theory. Here are typical results from 3 243# runs (3 just to demonstrate how small the variance is): 244# 245# Compares needed by heapify Compares needed by 1000 heappops 246# -------------------------- -------------------------------- 247# 1837 cut to 1663 14996 cut to 8680 248# 1855 cut to 1659 14966 cut to 8678 249# 1847 cut to 1660 15024 cut to 8703 250# 251# Building the heap by using heappush() 1000 times instead required 252# 2198, 2148, and 2219 compares: heapify() is more efficient, when 253# you can use it. 254# 255# The total compares needed by list.sort() on the same lists were 8627, 256# 8627, and 8632 (this should be compared to the sum of heapify() and 257# heappop() compares): list.sort() is (unsurprisingly!) more efficient 258# for sorting. 259 260def _siftup(heap, pos): 261 endpos = len(heap) 262 startpos = pos 263 newitem = heap[pos] 264 # Bubble up the smaller child until hitting a leaf. 265 childpos = 2*pos + 1 # leftmost child position 266 while childpos < endpos: 267 # Set childpos to index of smaller child. 268 rightpos = childpos + 1 269 if rightpos < endpos and not heap[childpos] < heap[rightpos]: 270 childpos = rightpos 271 # Move the smaller child up. 272 heap[pos] = heap[childpos] 273 pos = childpos 274 childpos = 2*pos + 1 275 # The leaf at pos is empty now. Put newitem there, and bubble it up 276 # to its final resting place (by sifting its parents down). 277 heap[pos] = newitem 278 _siftdown(heap, startpos, pos) 279 280def _siftdown_max(heap, startpos, pos): 281 'Maxheap variant of _siftdown' 282 newitem = heap[pos] 283 # Follow the path to the root, moving parents down until finding a place 284 # newitem fits. 285 while pos > startpos: 286 parentpos = (pos - 1) >> 1 287 parent = heap[parentpos] 288 if parent < newitem: 289 heap[pos] = parent 290 pos = parentpos 291 continue 292 break 293 heap[pos] = newitem 294 295def _siftup_max(heap, pos): 296 'Maxheap variant of _siftup' 297 endpos = len(heap) 298 startpos = pos 299 newitem = heap[pos] 300 # Bubble up the larger child until hitting a leaf. 301 childpos = 2*pos + 1 # leftmost child position 302 while childpos < endpos: 303 # Set childpos to index of larger child. 304 rightpos = childpos + 1 305 if rightpos < endpos and not heap[rightpos] < heap[childpos]: 306 childpos = rightpos 307 # Move the larger child up. 308 heap[pos] = heap[childpos] 309 pos = childpos 310 childpos = 2*pos + 1 311 # The leaf at pos is empty now. Put newitem there, and bubble it up 312 # to its final resting place (by sifting its parents down). 313 heap[pos] = newitem 314 _siftdown_max(heap, startpos, pos) 315 316def merge(*iterables, key=None, reverse=False): 317 '''Merge multiple sorted inputs into a single sorted output. 318 319 Similar to sorted(itertools.chain(*iterables)) but returns a generator, 320 does not pull the data into memory all at once, and assumes that each of 321 the input streams is already sorted (smallest to largest). 322 323 >>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25])) 324 [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25] 325 326 If *key* is not None, applies a key function to each element to determine 327 its sort order. 328 329 >>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len)) 330 ['dog', 'cat', 'fish', 'horse', 'kangaroo'] 331 332 ''' 333 334 h = [] 335 h_append = h.append 336 337 if reverse: 338 _heapify = _heapify_max 339 _heappop = _heappop_max 340 _heapreplace = _heapreplace_max 341 direction = -1 342 else: 343 _heapify = heapify 344 _heappop = heappop 345 _heapreplace = heapreplace 346 direction = 1 347 348 if key is None: 349 for order, it in enumerate(map(iter, iterables)): 350 try: 351 next = it.__next__ 352 h_append([next(), order * direction, next]) 353 except StopIteration: 354 pass 355 _heapify(h) 356 while len(h) > 1: 357 try: 358 while True: 359 value, order, next = s = h[0] 360 yield value 361 s[0] = next() # raises StopIteration when exhausted 362 _heapreplace(h, s) # restore heap condition 363 except StopIteration: 364 _heappop(h) # remove empty iterator 365 if h: 366 # fast case when only a single iterator remains 367 value, order, next = h[0] 368 yield value 369 yield from next.__self__ 370 return 371 372 for order, it in enumerate(map(iter, iterables)): 373 try: 374 next = it.__next__ 375 value = next() 376 h_append([key(value), order * direction, value, next]) 377 except StopIteration: 378 pass 379 _heapify(h) 380 while len(h) > 1: 381 try: 382 while True: 383 key_value, order, value, next = s = h[0] 384 yield value 385 value = next() 386 s[0] = key(value) 387 s[2] = value 388 _heapreplace(h, s) 389 except StopIteration: 390 _heappop(h) 391 if h: 392 key_value, order, value, next = h[0] 393 yield value 394 yield from next.__self__ 395 396 397# Algorithm notes for nlargest() and nsmallest() 398# ============================================== 399# 400# Make a single pass over the data while keeping the k most extreme values 401# in a heap. Memory consumption is limited to keeping k values in a list. 402# 403# Measured performance for random inputs: 404# 405# number of comparisons 406# n inputs k-extreme values (average of 5 trials) % more than min() 407# ------------- ---------------- --------------------- ----------------- 408# 1,000 100 3,317 231.7% 409# 10,000 100 14,046 40.5% 410# 100,000 100 105,749 5.7% 411# 1,000,000 100 1,007,751 0.8% 412# 10,000,000 100 10,009,401 0.1% 413# 414# Theoretical number of comparisons for k smallest of n random inputs: 415# 416# Step Comparisons Action 417# ---- -------------------------- --------------------------- 418# 1 1.66 * k heapify the first k-inputs 419# 2 n - k compare remaining elements to top of heap 420# 3 k * (1 + lg2(k)) * ln(n/k) replace the topmost value on the heap 421# 4 k * lg2(k) - (k/2) final sort of the k most extreme values 422# 423# Combining and simplifying for a rough estimate gives: 424# 425# comparisons = n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k)) 426# 427# Computing the number of comparisons for step 3: 428# ----------------------------------------------- 429# * For the i-th new value from the iterable, the probability of being in the 430# k most extreme values is k/i. For example, the probability of the 101st 431# value seen being in the 100 most extreme values is 100/101. 432# * If the value is a new extreme value, the cost of inserting it into the 433# heap is 1 + log(k, 2). 434# * The probability times the cost gives: 435# (k/i) * (1 + log(k, 2)) 436# * Summing across the remaining n-k elements gives: 437# sum((k/i) * (1 + log(k, 2)) for i in range(k+1, n+1)) 438# * This reduces to: 439# (H(n) - H(k)) * k * (1 + log(k, 2)) 440# * Where H(n) is the n-th harmonic number estimated by: 441# gamma = 0.5772156649 442# H(n) = log(n, e) + gamma + 1 / (2 * n) 443# http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)#Rate_of_divergence 444# * Substituting the H(n) formula: 445# comparisons = k * (1 + log(k, 2)) * (log(n/k, e) + (1/n - 1/k) / 2) 446# 447# Worst-case for step 3: 448# ---------------------- 449# In the worst case, the input data is reversed sorted so that every new element 450# must be inserted in the heap: 451# 452# comparisons = 1.66 * k + log(k, 2) * (n - k) 453# 454# Alternative Algorithms 455# ---------------------- 456# Other algorithms were not used because they: 457# 1) Took much more auxiliary memory, 458# 2) Made multiple passes over the data. 459# 3) Made more comparisons in common cases (small k, large n, semi-random input). 460# See the more detailed comparison of approach at: 461# http://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest 462 463def nsmallest(n, iterable, key=None): 464 """Find the n smallest elements in a dataset. 465 466 Equivalent to: sorted(iterable, key=key)[:n] 467 """ 468 469 # Short-cut for n==1 is to use min() 470 if n == 1: 471 it = iter(iterable) 472 sentinel = object() 473 result = min(it, default=sentinel, key=key) 474 return [] if result is sentinel else [result] 475 476 # When n>=size, it's faster to use sorted() 477 try: 478 size = len(iterable) 479 except (TypeError, AttributeError): 480 pass 481 else: 482 if n >= size: 483 return sorted(iterable, key=key)[:n] 484 485 # When key is none, use simpler decoration 486 if key is None: 487 it = iter(iterable) 488 # put the range(n) first so that zip() doesn't 489 # consume one too many elements from the iterator 490 result = [(elem, i) for i, elem in zip(range(n), it)] 491 if not result: 492 return result 493 _heapify_max(result) 494 top = result[0][0] 495 order = n 496 _heapreplace = _heapreplace_max 497 for elem in it: 498 if elem < top: 499 _heapreplace(result, (elem, order)) 500 top, _order = result[0] 501 order += 1 502 result.sort() 503 return [elem for (elem, order) in result] 504 505 # General case, slowest method 506 it = iter(iterable) 507 result = [(key(elem), i, elem) for i, elem in zip(range(n), it)] 508 if not result: 509 return result 510 _heapify_max(result) 511 top = result[0][0] 512 order = n 513 _heapreplace = _heapreplace_max 514 for elem in it: 515 k = key(elem) 516 if k < top: 517 _heapreplace(result, (k, order, elem)) 518 top, _order, _elem = result[0] 519 order += 1 520 result.sort() 521 return [elem for (k, order, elem) in result] 522 523def nlargest(n, iterable, key=None): 524 """Find the n largest elements in a dataset. 525 526 Equivalent to: sorted(iterable, key=key, reverse=True)[:n] 527 """ 528 529 # Short-cut for n==1 is to use max() 530 if n == 1: 531 it = iter(iterable) 532 sentinel = object() 533 result = max(it, default=sentinel, key=key) 534 return [] if result is sentinel else [result] 535 536 # When n>=size, it's faster to use sorted() 537 try: 538 size = len(iterable) 539 except (TypeError, AttributeError): 540 pass 541 else: 542 if n >= size: 543 return sorted(iterable, key=key, reverse=True)[:n] 544 545 # When key is none, use simpler decoration 546 if key is None: 547 it = iter(iterable) 548 result = [(elem, i) for i, elem in zip(range(0, -n, -1), it)] 549 if not result: 550 return result 551 heapify(result) 552 top = result[0][0] 553 order = -n 554 _heapreplace = heapreplace 555 for elem in it: 556 if top < elem: 557 _heapreplace(result, (elem, order)) 558 top, _order = result[0] 559 order -= 1 560 result.sort(reverse=True) 561 return [elem for (elem, order) in result] 562 563 # General case, slowest method 564 it = iter(iterable) 565 result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)] 566 if not result: 567 return result 568 heapify(result) 569 top = result[0][0] 570 order = -n 571 _heapreplace = heapreplace 572 for elem in it: 573 k = key(elem) 574 if top < k: 575 _heapreplace(result, (k, order, elem)) 576 top, _order, _elem = result[0] 577 order -= 1 578 result.sort(reverse=True) 579 return [elem for (k, order, elem) in result] 580 581# If available, use C implementation 582try: 583 from _heapq import * 584except ImportError: 585 pass 586try: 587 from _heapq import _heapreplace_max 588except ImportError: 589 pass 590try: 591 from _heapq import _heapify_max 592except ImportError: 593 pass 594try: 595 from _heapq import _heappop_max 596except ImportError: 597 pass 598 599 600if __name__ == "__main__": 601 602 import doctest # pragma: no cover 603 print(doctest.testmod()) # pragma: no cover 604