For a number of years Python has provided developers with the special parameters 'cmp' and 'key' on list.sort and __builtin__.sorted. However Python does not provide a built-in mechanism for doing binary searches on such sorted lists. This recipe provides a simple function that allows you to perform binary searches on your sorted sequences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | #! /usr/bin/env python
######################################################################
# Written by Kevin L. Sitze on 2011-02-04
# This code may be used pursuant to the MIT License.
######################################################################
import __builtin__
__all__ = (
'lower_bound',
'upper_bound'
)
def lower_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
"""lower_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n
Find \var{needle} via a binary search on \var{haystack}. Returns the
index of the first match if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".
\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
if cmp is None: cmp = __builtin__.cmp
if key is None: key = lambda x: x
if lo < 0: raise ValueError( 'lo cannot be negative' )
if hi is None: hi = len(haystack)
val = None
while lo < hi:
mid = (lo + hi) >> 1
val = cmp(key(haystack[mid]), needle)
if val < 0:
lo = mid + 1
else:
hi = mid
if val is None: return -1
elif val == 0: return lo
elif lo >= len(haystack): return -1 - lo
elif cmp(key(haystack[lo]), needle) == 0: return lo
else: return -1 - lo
def upper_bound(haystack, needle, lo = 0, hi = None, cmp = None, key = None):
"""upper_bound(haystack, needle[, lo = 0[, hi = None[, cmp = None[, key = None]]]]) => n
Find \var{needle} via a binary search on \var{haystack}. Returns the
non-negative index \var{N} of the element that immediately follows the
last match of \var{needle} if \var{needle} is found, else a negative
value \var{N} is returned indicating the index where \var{needle}
belongs with the formula "-\var{N}-1".
\var{haystack} - the ordered, indexable sequence to search.
\var{needle} - the value to locate in \var{haystack}.
\var{lo} and \var{hi} - the range in \var{haystack} to search.
\var{cmp} - the cmp function used to order the \var{haystack} items.
\var{key} - the key function used to extract keys from the elements.
"""
if cmp is None: cmp = __builtin__.cmp
if key is None: key = lambda x: x
if lo < 0: raise ValueError( 'lo cannot be negative' )
if hi is None: hi = len(haystack)
val = None
while lo < hi:
mid = (lo + hi) >> 1
val = cmp(key(haystack[mid]), needle)
if val > 0:
hi = mid
else:
lo = mid + 1
if val is None: return -1
elif val == 0: return lo
elif lo > 0 and cmp(key(haystack[lo - 1]), needle) == 0: return lo
else: return -1 - lo
if __name__ == '__main__':
from assertions import *
a = [0, 2, 4, 6, 8]
b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
test_left = (
(-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
(-1, [0], -2), (-1, [0], -1), ( 0, [0], 0), (-2, [0], 1), (-2, [0], 2),
(-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 0, [1], 1), (-2, [1], 2),
(-1, [0, 0], -2), (-1, [0, 0], -1), ( 0, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
(-1, [0, 1], -2), (-1, [0, 1], -1), ( 0, [0, 1], 0), ( 1, [0, 1], 1), (-3, [0, 1], 2),
(-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 0, [1, 1], 1), (-3, [1, 1], 2),
(-1, a, -1),
( 0, a, 0), (-2, a, 1), ( 1, a, 2), (-3, a, 3), ( 2, a, 4),
(-4, a, 5), ( 3, a, 6), (-5, a, 7), ( 4, a, 8), (-6, a, 9),
(-6, a, 10),
(-1, b, -1),
( 0, b, 0), (-2, b, 1), ( 1, b, 2), (-4, b, 3), ( 3, b, 4),
(-7, b, 5), ( 6, b, 6), (-11, b, 7), (10, b, 8), (-16, b, 9),
(-16, b, 10)
)
for expect, haystack, needle in test_left:
assertEquals(expect, lower_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))
a = list(a)
for i in xrange(11):
n = lower_bound(a, i - 1)
if n < 0:
a[-n-1:-n-1] = [i-1]
assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)
b = list(b)
for i in xrange(11):
n = lower_bound(b, i - 1)
if n < 0:
b.insert(-n-1, i-1)
assertEquals([-1, 0, 1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 8, 8, 8, 9], b)
a = [0, 2, 4, 6, 8]
b = [0, 2, 2, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8, 8]
test_right = (
(-1, [] , -2), (-1, [] , -1), (-1, [] , 0), (-1, [] , 1), (-1, [] , 2),
(-1, [0], -2), (-1, [0], -1), ( 1, [0], 0), (-2, [0], 1), (-2, [0], 2),
(-1, [1], -2), (-1, [1], -1), (-1, [1], 0), ( 1, [1], 1), (-2, [1], 2),
(-1, [0, 0], -2), (-1, [0, 0], -1), ( 2, [0, 0], 0), (-3, [0, 0], 1), (-3, [0, 0], 2),
(-1, [0, 1], -2), (-1, [0, 1], -1), ( 1, [0, 1], 0), ( 2, [0, 1], 1), (-3, [0, 1], 2),
(-1, [1, 1], -2), (-1, [1, 1], -1), (-1, [1, 1], 0), ( 2, [1, 1], 1), (-3, [1, 1], 2),
(-1, a, -1),
( 1, a, 0), (-2, a, 1), ( 2, a, 2), (-3, a, 3), ( 3, a, 4),
(-4, a, 5), ( 4, a, 6), (-5, a, 7), ( 5, a, 8), (-6, a, 9),
(-6, a, 10),
(-1, b, -1),
( 1, b, 0), (-2, b, 1), ( 3, b, 2), (-4, b, 3), ( 6, b, 4),
(-7, b, 5), (10, b, 6), (-11, b, 7), (15, b, 8), (-16, b, 9),
(-16, b, 10)
)
for expect, haystack, needle in test_right:
assertEquals(expect, upper_bound(haystack, needle), 'haystack: %r - needle: %r' % (haystack, needle))
for i in xrange(11):
n = upper_bound(a, i - 1)
if n < 0:
a.insert(-n-1, i-1)
assertEquals([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], a)
import random
a = list(xrange(100))
b = list(a)
for j in xrange(10):
c = list()
random.shuffle(b)
for i in b:
n = lower_bound(c, i)
assertTrue(n < 0)
c[-n-1:-n-1] = [i]
assertEquals(a, c)
def rcmp(x, y):
return cmp(y, x)
a.sort(cmp = rcmp)
b.sort(cmp = rcmp)
for j in xrange(10):
c = list()
random.shuffle(b)
for i in b:
n = lower_bound(c, i, cmp = rcmp)
assertTrue(n < 0)
c.insert(-n-1, i)
assertEquals(a, c)
|
The Why
Using the bisect package can be very trying at times.
This package works great so long as the sequence elements are also the key values.
What happens with object elements with keys that are object properties,
or elements that are subsequences with the key stored in a location other
than the first element of the subsequence?
Thankfully, sorting such a sequence is trivial thanks to the key argument on list.sort.
But bisect is useless for subsequences and often objects too.
Other bisect problems
bisectis useless on objects that have no natural sort order. Unless the object class provides a natural ordering applicable to the problem domainbisectis unable to determine how the objects should be compared.bisectis useless in cases even where objects have a natural sort order but you are searching using only the key value itself.bisectdoes not distinguish between "found" and "not-found" scenarios when searching. You still have to check the sequence after usingbisect_leftorbisect_right, because the return value does not tell you whether the desired item is actually in the sequence.bisecthas a similar problem if you use theinsort_leftorinsort_rightfunctions. The item you are insorting will always be inserted, even if a duplicate exists.
Thus, if you're using bisect to insert an element into a list, but only if it does not exist there already, then you have to write code that looks something like this:
def insort_left_unique(a, x, lo, hi):
pos = bisect.bisect_left(a, x, lo, hi)
# because __eq__ might not be available...
if pos == len(a) or (not a[pos] < x and not x < a[pos]):
a.insert(pos, x)
Doing some poking around on the net I determined that no one else out there has a decent Pythonic binary search function. So, of course the correct answer is to write my own.
The How
The functions in this package perform the equivalent function to bisect_left and bisect_right with the exception of differentiating between success and failure on finding the desired item.
It finds needle via a binary search on haystack.
The functions have guaranteed O(log(N)) performance on a sequence having N elements.
If needle is found in haystack the index of the first element (the element whose index is closest to zero) that matches needle is returned.
If the needle is NOT found in haystack then a negative value n shall be returned
such that -n-1 is the location in haystack that needle should have been found.
This means that needle can be sliced into the correct location in haystack using the idiom haystack[-n-1:-n-1] = [needle] or alternatively using haystack.insert(-n-1, needle).
Of course these assignment operations will work only if haystack is mutable.
haystackis a sequence with the following properties:
- The sequence must be indexable.
- The sequence must be ordered such that for any two sequence elements
AandBwhere the index ofAis less than the index ofBthen the boolean expressioncmp(key(A), key(B)) <= 0isTrue.
needleis the key of the element to locate.lois the minimum index inhaystackto search. This parameter is optional, if not specified then index0is used.hiis one past the maximum index inhaystackto search. This parameter is optional, if not specified then the length ofhaystackis used (i.e.,len(haystack)).cmpis a function of two arguments returning -1, 0 or 1 if the first argument is less than, equal to or greater than the second argument respectively. The first argument will always be a key taken from thehaystackand the second argument will always beneedle. IfcmpisNonethen the builtin comparison function will be used. This implies that the actual function called for each processed element iscmp(key(x), needle)wherexis an element inhaystack. The functioncmpwill be called a maximum of O(log(n)) times.key- a function that takes one argument. Its return value will be used as the key value for the comparison. If key isNonethen the identity function will be used. The functionkeywill be called a maximum of O(log(n)) times.
Download
Copy to clipboard
You can find the
assertionspackage used for the unit tests for this package in my recipe Poor Man unit tests.If you want a more Pythonic API to sorted lists, have a look at Daniel Stutzbach's amazing blist -- it provides an efficient implementation using B+trees of sorted lists, letting you modify and inspect the object with the usual interface of python containers. For example you can do l.add(x), l.remove(x), test x in l, etc.
A pure Python implementation using a regular list and a bisect implementation would be very easy with the abstract base classes in the collections module, but a much better approach -- with logarithmic insertion and deletion -- would be using the indexable skiplist that Raymond Hettinger used in these two recipes.
If you're looking for an API similar to that provided by a blist.sortedlist, check out the sortedcontainers module. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.