Intro to Algorithms: Binary Search

Paul Taylor

When an array of data is sorted, there is a much faster way of finding a value than by linear search. Think of the way in which you use a dictionary. In reality, we employ our knowledge of the frequency of initial letters - the London Residential Telephone Directory used to be published in volumes for A-D, E-K, L-Q and S-Z - but we shall not use such knowledge in this algorithm.

The idea is to look half way through, then, depending on whether that was too high or too low, either (1/4) or (3/4) of the way through. At each iteration we have top and bot pointers, and look at the value at the position mid-way between them. You should try programming this for yourself before looking at the code below, which gives the best possible form of this algorithm.

In fact we shall study this algorithm from a completely different point of view. Many textbooks give you complete JAVA code, and students regard this as a recipe. The purpose of our (novel) approach to is gain an appreciation that every symbol in the program means something. We shall see this by changing them, and first by looking at four equally correct versions of the same algorithm. You will learn that ±1 in a program is "a matter of life and death" - that the behaviour is different in crucial ways that you cannot see from a superficial look the code.

When you get a new job as a programmer, you won't be writing virgin code, as in most official and unofficial student exercises, but modifying someone else's code. Usually, you will find that other people's code is unspeakably awful, but you have to live with it. It is difficult to simulate this real-life situation in a teaching environment, but this exercise is a tiny example.

This chapter is a rather crude amalgamation of several kinds of teaching materials, and needs to be re-written before being given to students again. I introduce binary search in the lectures by handing out the eight versions of the code. I divide the class into groups, and ask each group to do a "dry run" of one of the algorithms on certain input data. In the labs, they are asked to verify correctness, in particular finding out whether the leftmost or rightmost occurrence is found, and making various alterations to the code. These things are then covered in a mini-exam. Formal proof of correctness is covered in the lectures. This has also been an exam question, for which the model answer is incorporated here.

1 Four versions of binary search

"There are nine and sixty ways of constructing tribal lays,
And every single one of them is right!"
In the Neolithic Age, Rudyard Kipling

In the algorithms textbooks, there are three different ways of writing the code for binary search. Although the version that we call CX below is the one that a naïve programmer would write, version AL is the best, and BL is a less tidy version of the same thing.

The +1s and -1s in the assignments to top and bot are what distinguish the versions, making one better than another, but you're allowed to change everything else about these versions except for these things. The fourth version completes the symmetry.

Each version of the algorithm solves the simple problem of locating some occurrence of any value that is actually present. It also reports that a value is absent by returning a pointer to a position that the caller can check doesn't contain the required value.

However, version A is more precise than this: it behaves in a specific, predictable and useful way when there are multiple or no occurrences of the value, as we shall see.

Binary.java in your own coursework directory contains a copy of the following JAVA code.

    int AL (int[] array, int seek) {
        int bot = -1;
        int top = size;
        while (top - bot > 1) {
            int mid = (top + bot)/2;
            if     (array[mid] <  seek) bot = mid;
            else /* array[mid] >= seek */ top = mid;
        }
        return top; 
    }
    int BL (int[] array, int seek) {
        int bot = 0;
        int top = size;
        while (top > bot) {
            int mid = (top + bot - 1)/2;
            if     (array[mid] <  seek)   bot = mid + 1;
            else /* array[mid] >= seek */ top = mid;
        }
        return top;
    }
      
    int CL (int[] array, int seek) {
        int bot = 0;
        int top = size - 1;
        while (top >= bot) {
            int mid = (top + bot)/2;
            if      (array[mid] <  seek)    bot = mid + 1;
            else /* (array[mid] >= seek) */ top = mid - 1;
        }
        return bot;
    }
    int DL (int[] array, int seek) {
        int bot = -1;
        int top = size - 1;
        while (top > bot) {
            int mid = (top + bot + 1)/2;
            if     (array[mid] <  seek)   bot = mid;
            else /* array[mid] >= seek */ top = mid - 1;
        }
        return top+1;
    }

Note: /*...*/ is another way of delimiting a comment; whereas // begins a comment that continues to the next newline, a comment that begins /* continues until the next (not necessarily matching) */. As a matter of style, when using this way of delimiting comments that span many lines of the source file, it is usual to start each intervening line with another * and align the *s vertically.

2 Dry runs

The following "dry runs" show what the search executes and returns when it is asked to search for 6 or 13, respectively in the array 1, 3, 4, 6, 6, 6, 8, 8, 8, 10, 12, 14, 14, 14, 15

`seek=6`	`bot`	`mid`	`top`
`bot=-1`	−1
`top=15`			15
`while (15 - -1 > 1)`
`mid=(15+-1)/2`		7
`array[7]=8>=6`
`top=7`			7
`while (7 - -1 > 1)`
`mid=(7+-1)/2`		3
`array[3]=6>=6`
`top=3`			3
`while (3 - -1 > 1)`
`mid=(3+-1)/2`		1
`array[1]=3<6`
`bot=1`	1
`while (3 - 1 > 1)`
`mid=(3 + 1)/2`		2
`array[2]=4<6`
`bot=2`	2
`! (2 - 1 > 1)`
`return top=3`

`seek=13`	`bot`	`mid`	`top`
`bot=-1`	−1
`top=15`			15
`while (15 - -1 > 1)`
`mid=(15+-1)/2`		7
`array[7]=8<13`
`bot=7`	7
`while (15 - 7 > 1)`
`mid=(15 + 7)/2`		11
`array[11]=14>=13`
`top=11`			11
`while (11 - 7 > 1)`
`mid=(11 + 7)/2`		9
`array[9]=10<13`
`bot=9`	9
`while (11 - 9 > 1)`
`mid=(11 + 9)/2`		10
`array[10]=12<13`
`bot=10`	10
`! (11 - 10 > 1)`
`return top=11`

3 Experiments

Use check to verify that AL-DL are correct by doing things like

    check binary-al
    check Binary.java choice=3

of which the first version uses the "model answer" built into the TESTER and the second uses the template file Binary.java in your filespace, in which AL is selected by setting choice=1, BL=2, CL=3 and DL=4. Paste in the correctness reports of the TESTER.

Verify that each version is able to find some occurrence of any value that is actually present, using tests like

    run Binary.java choice=1 array=1,3,4,6,6,6,8,8,8,10,12,14,14,14,15 seek=6

Be careful to check the extreme values, 1 and 15.

When a search is done for a value that occurs more than once, can you characterise which occurrence each of the eight versions finds?

Now consider a search for a value that does not occur in the array. What position does each method return? Don't forget the extreme cases, i.e. values that are smaller (or larger) than all of the values that do occur in the array.

What happens if the array is of zero length?

    run Binary.java choice=1 size=0

Do you get an ArrayIndexOutOfBoundsException? What position is returned? Is this consistent with the specification of the behaviour that you have just formulated, where an absent value was sought in a non-empty array?

Change the initialisation bot=-1; in AL to 0 or -2, and check the behaviour for small values. What goes wrong?

Similarly, change the initialisation top=size to size-1.

Explain how you would change the code to search the segment array[3] to array[13] inclusive, so that it never accesses array[2] or array[14].

Change the return value from bot to top or vice versa in all four versions.

There is a slight difference in the amount of time that they take. Find out the coefficient of nlog(n) and paste it in as a comment in the source file. Which is faster?

Does the timing graph depend on the range of random increments in the array values? (You will need to make range very small or density large to see any difference.)

4 The pre-condition

These algorithms only work when the array satisfies a certain precondition.

What is this condition? Write it in logical notation.

What does the program do if it is not satisfied?

Write the JAVA code to test whether an array satisfies this property.

How long would it take to execute as a function of the size of the array?

When you use a dictionary, do you (as a human) make such a check? Describe carefully would be the effect on the usefulness of the dictionary if a few of the words failed to satisfy this condition.

This verification takes time that depends linearly on the length of the array (and sorting it, i.e. enforcing rather than verifying it, would take time at least O(nlogn)), which is much greater than the time that the search itself takes. To verify or enforce the precondition would therefore entirely defeat the point of using binary search.

In the shift exercise, the TESTER insisted that you should check the precondition before proceeding with the shift, and throw an Exception if it fails.

Discuss the consequences of trying to proceed with shift, or with binary search, in a situation whether the precondition is not satisfied. How, on the other hand, do the costs of checking the precondition compare?

5 Time complexity

On paper, find out how many times you go round the loop if the array is of size 3, 7, 15, 31, 63, ..., or more generally of of size 2^k−1.

How many bits long is the number n, written in binary, if 2^k−1 ≤ n ≤ 2^k−1?

You may have learned about logarithms to base 10 or base e in Calculus at school. In Computer Science we always use logarithms with base 2, and usually round them up to the nearest whole number.

Now use the TESTER's check command to find out how long binary search takes on arrays of size 3, 7, 15, 31, 63, ... and so on up to the biggest that the JAVA run-time system will allow.

Shut down all unnecessary programs, especially Netscape.

It will be possible to run the program for arrays that are larger than the available RAM, because your workstation will start using virtual memory, on disk. This will completely ruin your timings. You can tell that this is happening by listening to the disk and watching the lights on the box.

The time taken to execute the code depends logarithmically on the length, and is therefore substantially faster than linear search.

Look inside the file Binary.plot to find out how to plot a graph of time against the logarithm of the size, so 2, 4, 8, 16, 32, 64, ... are equally spaced.

If you are very careful with the experiments, you can find out the size of the primary memory cache within your CPU, and how long it takes to fetch data from the secondary cache within the CPU.

6 Mirror-image algorithms

Within your copy of the file Binary.java, make a copy of the cases 1-4 (AL-DL), calling them cases 11-14, and change < to <=. So, for example, case 2 (BL) becomes

    case 12: {
        int bot = 0;
        int top = size;
        while (top > bot) {
            int mid = (top + bot - 1)/2;
            if     (array[mid] <= seek)   bot = mid + 1;
            else /* array[mid] >  seek */ top = mid;
        }
        return top;
    }

Does this still return the position of the leftmost occurrence when seek occurs several times?

How does the position returned by the algorithm now relate to the position where the value should be inserted if it is absent?

You may need to change the return values in cases A, C and D to get this right.

What are appropriate names (instead of AL-DL) for these four versions of the algorithm?

What behaviour (in particular for missing values) would be appropriate if this algorithm were to be used as the search subroutine for insertion sort?

7 Early exit

If you tried to program binary search for yourself without looking at (the four versions of) the code at the beginning of this chapter, you probably wrote something like

    int CX (int[] array, int seek) {
        int bot = 0;
        int top = size - 1;
        while (top >= bot) {
            int mid = (top + bot)/2;
            if      (array[mid] == seek)    return mid;
            if      (array[mid] <  seek)    bot = mid + 1;
            else /* (array[mid] >= seek) */ top = mid - 1;
        }
        return bot;
    }

which certainly appears in many of the textbooks. We shall say that this version of the algorithm makes an early exit.

Include CX in your copy of Binary.java, calling it case 23. The other three versions can be modified in a similar way, and we call them AX, BX and DX or cases 21, 22 and 24.

Now, when you use CX to search for a value that occurs more than once, does it return the leftmost or rightmost occurrence?

On the other hand, if the value is not present, does CX still returns the position before which to insert it?

Which line do you have to change to obtain the position after which to insert a missing value?

As you see, this "early exit" test means that the algorithm satisfies a weaker specification. But the point of adding this test was to make it run faster.

Does CX actually perform the search any quicker?

Assuming n=2^r, no value is repeated in the array, we search for each value with equal likelihood, and never search for absent values, in how many cases is the search successful after exactly k iterations? What is the average number of comparisons made in the search? Hint:

(1/2) + (2/2²) + (3/2³) + (4/2⁴) + (5/2⁵) + (6/2⁶) + … = 2

What saving does this represent, on average, compared to the r comparisons that are made in every case with no early escape?

In our code (comparing integers), a three-way test is actually twice as expensive as a two-way test, so is there any advantage?

Now consider that, in practice, unsuccessful searches are performed, as well as successful ones. How does this affect the conclusion?

What difference does this make to the time?

One might suppose that C would be faster than method S because it makes an early exit from the loop. On the other hand, the three-way test in C takes longer than the corresponding two-way test in S.

In the JAVA implementation, S turns out to be faster.

Suppose that we were instead searching a String[] array, using the JAVA library method

    int comparison = string1.compareTo (string2);

which returns −1, 0 or +1 according as string1 comes before, at the same place as or after string2 in the dictionary. This method, which we would only call once for either C or S, takes considerably longer than the rest of the code, so avoiding one or two calls to it could quite easily outweigh the cost of the three-way test amongst −1, 0 and +1.

The COMP instruction in SF4 machine code (from the CS1 course) also makes a 3-way test, in the sense that it sets N and Z flags inside the CPU in such a way as to distinguish the three cases. As for compareTo, we would only use it once in either C or S. Donald Knuth¹ claims that C is faster unless the array is very large, though I don't believe this.

Note: this is still a comparison between the two algorithms C and S, not between JAVA and machine code.

Assuming that the value occurs exactly once, the early exit saves just one iteration on average. (This follows from a study of how many nodes of a binary tree are leaves, at height 1, at height 2, etc.., as in the analysis of merge sort.)

8 Changing the loop test and the calculation of `mid`

99% of while loops that you will ever meet can be proved to terminate by observing that the loop variable is a whole number that increases or decreases by 1 or more on each iteration, between pre-determined bounds. As we shall now see, termination of binary search is not quite this obvious, and, on the other hand, some curious things happen if you exit the loop too early.

Change the loop condition (while (top > bot) etc.).

What happens if you make the condition stronger? For example, change AL to

    while (top - bot > 2)

What happens if you make it weaker?

    while (top > bot)

In this case it more readily continues going round the loop.

The one remaining thing to change is the calculation

    mid = (top + bot + 1)/2;

A easy mistake to make (you will do this often) is to miss out the brackets. Try this, and find out whether this makes the algorithm give wrong results.

On the other hand does it necessarily produce any result at all?

A more reasonable version of this "mid-point" calculation is (top+bot)/2. Do the same experiments.

Keeping your alternative calculation for mid for which the algorithm fails to terminate, change the loop condition to fix the problem. What is wrong now? Can you fix it with extra code after the end of the loop? Make sure that this still works with arrays of length zero.

9 Termination

The root of the problem is that integer division by 2 does not necessarily make a number smaller, or turn one non-zero number into another. Annoyingly, JAVA gives (-1)/2==0, and this is significant in one case, but for this section let us assume that division always rounds down, to less positive or more negative whole numbers.

Draw out a table whose rows and columns are each labelled from −1 to +6. Calling the number of the column top and the row bot, calculate (top+bot)/2 in each square of the table. Make six copies of this table, and two of another containing the values of (top+bot+1)/2. Label the tables AL ... DR according to the assignment to mid inside each loop.

From each square of each table, draw vertical red arrows indicating the effect of as assignment to bot if one branch of the conditional is taken. To ensure termination, the arrows should go at least one square downwards, but some of them go upwards or are loops.

Similarly, draw green horizontal arrows to indicate the effect of assignments to top.

The normal initial situation is in the top right triangle of the table. From here, the arrows point down and left towards the diagonal. There is a mirror-world (with top<bot) in the bottom left of the table, where the arrows generally point up and right, but still towards the diagonal.

However, near the diagonal are loops in single squares, and linking pairs of squares. This region is called the attractor set. By this we mean the smallest region with the property that:

wherever you start, you eventually end up inside the attractor, and
once you're inside the attractor, you can't get out!

Here we have a tiny, tiny, tiny example of chaos.

If you get bored or make a mistake drawing the two-dimensional tables, notice that they have a diagonal translation symmetry: the behaviour really only depends on the value of top-bot, not on top and bot separately. (This is where the stupid standard implementation of integer division makes a mess of things.)

Drawn out in this one-dimensional fashion, the attractor set for top-bot has just two or three elements, near to zero. Although there are some coincidences, it is by no means the case that all eight versions of binary search have the same attractor set - that's why we needed three different loop conditions, and, for the D versions, an untidy extra test after the loop.

Compare the top of the attractor set with the values that were chosen for the loop condition. Now it is obvious why the program didn't terminate if we made the loop condition weaker.

But notice something else: in some cases all of the arrows that enter the attractor set arrive at the same point, in others they don't. In the former case, we know, after the loop, exactly what the value of top-bot is. In the latter we don't, and, as a result, may have to do some further tests.

10 Proving termination

The basic idea behind a proof of termination is that the value of top-bot is strictly reduced on each iteration.

The modifications that you made to the programs, and the analysis of attractor sets both show that this argument needs to be considered more carefully.

This is much easier than the work involved in calculating the attractor sets. In cases A and D, the loop condition says that

top >= bot + 2

from which it is easy to deduce that

bot < mid < top

but the first < becomes a <= for BL, CL and CR, and the second does so in the cases BR, CL and CR.

Using these little bits of arithmetic, we can check that top-bot really is reduced on each iteration.

11 Proving correctness

If the loop terminates normally, with no early escape, which of top, mid or bot points to x?

When the loop exits, its condition having failed, we have some bound on the value of top-bot, although careful examination of the attractor sets shows that we may not necessarily know exactly what the final relationship is between top and bot.

The condition inside the loop does, however, tell us something very interesting.

After each assignment to bot, we know one of

A[bot]<=x A[bot]<x A[bot-1]<=x A[bot-1]<x

depending on which version of the code we're considering. Similarly, after each assignment to top, we know one of

x<A[top] x<=A[top] x<A[top+1] x<=A[top+1]

Moreover, the initialisations of bot and top we also chosen to ensure that these properties trivially hold before we enter the loop. Well, they actually involve positions in the array that are out of bounds, so they only hold if we adopt the convention that

A[-1]==−∞ A[n]==+∞

Now, putting these inequalities together, you should be able to prove that each of the eight versions of binary search must behave in the way that you found out by experiment in the first section.

Again, begin by drawing the pictures for the loop invariant for the four versions of the binary search. The comparisons that you have to do inside the loop are different for the four cases. In what way is (c) now slightly simpler than the other versions?

12 Logical analysis of version AL

Note: be careful to distinguish between the old and new values of variables to which assignments are made (top and bot).

The assignment bot=mid is executed after the test (top - bot > 1) has succeeded and the initialisation int mid = (top + bot)/2 has been made. For any integers b and t, we claim that

t − b > 1 ⇒(t + b) / 2 > b.

But, as these are integers, t ≥ b+2, so (t + b)/2 ≥ (b + 2 + b) / 2 = b + 1 as required. Hence the variable bot is increased by at least 1.
Beware that, since we are dealing with integer division, it is not admissible to "multiply throughout by 2" to eliminate the division. Mathematical notation - which, after all, you learned at school - is essential for such reasoning, as English words are imprecise. Maths is also much quicker to write!
Note that mid may often be closer to bot than to top, for example if bot=8, top=11, so mid==9.
This calculation would no longer be valid if the loop condition were changed to while (top - bot > 0). Eventually we (may) reach the state top==bot+1 and continue executing the loop, in which case the value of mid is (t+b)/2=(b+1+b)/2=b, leaving bot unaltered.
Suppose, for example, seek=7, bot==3, top==4 and array[3]=6. Then we re-assign the same value, bot=3, and go round the loop again in exactly the same state as before, and again, and again, ... On the other hand, with seek=5, we would instead assign top=3 and leave the loop. So this version sometimes fails to terminate.
Similarly, the assignment bot=mid makes the new value of bot at most top-1, because, as in (a), (t+b)/2 ≤ (t+t−2)/2 = t−1.
After the assignment bot=mid has been made, array[bot] < seek.
Similarly, after the assignment top=mid has been done, seek <= array[top].
The two logical statements in (e) and (f) are not, as they stand, valid just after the initialisations bot=-1 and top=array.length, since these are "out of bounds". We may, however, restore meaning to the statement either by adopting the convention that array[−1]=−∞ and array[array.length]=+∞, or by using the more complicated statement

(bot = −1 ∨array[bot] < seek) ∧ (top = array.length ∨ seek ≤ array[top])

(though even makes illegitimate use of array[-1] and so depends on a convention about ∨). Of course, this convention is only appropriate to the discussion of an array that's sorted in ascending order!
Suppose that the logical statement in (e) happened to be valid just inside the beginning of the body of the loop, but that the else branch (including assignment top=mid) is executed. Then (e) is still valid, because bot, seek and the contents of the array have not been altered. Such statements about what has not happened are just as important as saying what has happened, cf. that the rest of the array has stayed the same when part of it has been shifted.
Therefore, on exit from the loop we may conclude either that array[bot] < seek <= array[top], or the more complicated statement in part (g). This is because this statement
- has been verified on entry to the loop, and
- is preserved by each iteration,
- so is therefore still true on exit -
- assuming that the loop does eventually exit!
We exit the loop exactly when the loop condition fails, i.e. !(top - bot > 1).
Part (d) said that (top - bot >= 1) after the loop body has been executed, irrespectively of whether it was true at the top of the loop, though in fact it is also true after the initialisation, so this statement is part of the loop invariant. Putting this together with part (j), we have (top == bot+1) on exit from the loop.
Substituting (k) into the loop invariant, we have array[top-1] < seek <= array[top] on exit from the loop.
We have already said that the array must be sorted as a precondition for binary search to work. Then,

∀i. 0 ≤ i < top−1 ⇒ array[i] ≤ array[top−1] < seek.

so it is not possible for array[i]==seek where i<=top-2.
Hence, if the value seek occurs in the array, top is the position of its leftmost occurrence.
Suppose that array[top] != seek. Then, by the same reasoning as in (m), ∀i. top < i⇒ seek < array[i]. Hence the number seek does not occur anywhere in the array.
More precisely, the value seek lies strictly between those in the array in positions top-1 and top, so top is the position before which to insert.
If we replace the loop condition by while (top - bot > 2), the best that we can say is that top==bot+1 or top=bot+2, so array[top-2] < seek <= array[top], but we know nothing about array[top-1].

13 The intermediate value theorem

Binary search provides a constructive proof of a standard theorem in Analysis. ("Analysis" is what mathematicians call Calculus done rigorously.) Clearly this would not be part of a test in a computer science course.

"If you're on one side of a river in the morning and on the other side in the afternoon, you must have eaten your lunch on a boat!"

Let f:[0,1]→R be a real-valued continuous function on the real unit interval. Suppose that f(0) ≤ s ≤ f(1). Then ∃x.0 ≤ x ≤ 1∧f(x)=s.

The proof usually found in the textbooks considers the point

sup

{y ∈ [0,1] | f(y) ≤ s}

and uses continuity to prove that f(x)=s, without actually offering any way to calculate x.

Here is a constructive way to solve the problem (so long as the function f is such that you can decide, for each y ∈ [0,1], whether f(y) < 0 or f(y) ≥ 0). You can apply this method to solving equations like nlog₂(n)=10⁶ if you wish.

We define three sequences b_n, t_n and m_n of real numbers by a recursive procedure.

First, put b₀=0 and t₀=1.

Then let m_n=(b_n+t_n)/2, and

if f(m_n) < s, put b_n+1=m_n and t_n+1=t_n;
if f(m_n) ≥ s, put b_n+1=b_n and t_n+1=m_n.

So b_n is an increasing sequence, and t_n a decreasing sequence, and moreover t_n−b_n ≤ 2⁻ⁿ, so both sequences converge to a common limit. Call it x. This satisfies

∀n. x−2⁻ⁿ ≤ b_n ≤ x ≤ t_n ≤ x+2⁻ⁿ.

We haven't used continuity of f yet, just as in binary search we didn't rely on the array being sorted until the very end (question 7(m) and (o)).

By definition, we say that f is continuous at x if

∀ε > 0. ∃δ > 0. ∀y. |y−x| < δ⇒|f(y)−f(x)| < ε.

Suppose f(x) > s, so ε = f(x)−s > 0, and let δ > 2⁻ⁿ > 0 be as in the definition of continuity. But then y=b_n satisfies |y−x| < δ, so |f(y)−f(x)| < ε by continuity. However, by construction f(y)=f(b_n) < s < f(x), so f(x)−f(y) > ε.

The case f(x) < s also leads to contradiction by considering t_n, so we are left with f(x)=s as required.

Footnotes:

¹The Art of Computer Programming, volume 3, Sorting and Searching (Addison-Wesley, 1973, second edition, 1997), closing paragraphs of section 6.2.1 (page 422), and exercise 23 (page 425).

This is www.PaulTaylor.EU/algorithms/binary.html and it was derived from algorithms/binary.tex which was last modified on 1 June 2008.