Introduction to Algorithms: Hash Tables

Paul Taylor
Department of Computer Science,
Queen Mary and Westfield College, London E1 4NS
`pt@dcs.qmw.ac.uk`

7 April 1999

Hash tables were invented by the authors of compilers for programming languages in the 1960s, as a trick for maintaining the list of user-defined names (such as program variables) and associated properties. An interpreter for the formerly popular teaching language BASIC, for example, consists of little more than a calculator together with a hash table in which variable-names and their current values are recorded.

The function of a hash table is to enter and retrieve data as quickly as possible. It does not sort the data, which would take O(n logn) operations. Entry, deletion and retrieval of single items in a hash table can be done in constant time, irrespectively of the number of items in the table, so building the table takes just O(n) operations.

The basic idea is to put each item in a numbered "pigeonhole", the number being obtained by some quick and repeatable (but otherwise meaningless) piece of arithmetic, called a hash function (Section 2.2). As there are vastly more possible names than one can provide pigeonholes, there must also be a strategy for resolving conflicts, i.e. when the pigeonhole that we want to use for one name is already occupied by another.

The main factor on which the performance of the hash table depends is how crowded it is. Experiment shows that there should be approximately twice as many available pigeonholes as items to go in them. When the table gets too full, it needs to be rebuilt.

If you are planning to write an essay on hash tables in the exam, you should use my implementation to do further investigations of the performance when the load factor and other features are changed. You should not do more programming work for the exam.

[Explain why prime numbers are used.] [Section on performance.] [References.]

These notes were written after I had marked the April 1999 course-work, and are based on the points that arose from marking. They are complementary to the treatments in Richard Bornat's lecture notes and textbooks such as that by Weiss.

Other sources:

Richard Bornat's notes,
Chapter 19 of Weiss's book,
The documentation of the JAVA library.

1 Using hash tables

In the subdirectory Hash of the course directory, you will find

a test program called Main.java
the compiled version of this program, Main.class
a file dcs-login that contains the login and human names of everyone in the Department and is read by the test program
the compiled version Hash_Table.class of my implementation of a hash table
a template template.java that is a simplified version of my implementation, with its three most important methods removed.

Start by running Main.class. It will print out some statistics as it builds two hash tables (for the login-to-human and human-to-login translations) from the data, and prompt for queries. Try your own login name first: it will look this up in both tables, responding with your human name in one case and "not found" in the other. For the reverse translation you have to type your (human) name exactly as it appears in the password file.

Apart from answering queries, as an application would be expected to do, the program also prints some debugging output. This shows each of the entries in the tables that were considered during the search. In a good implementation, there should only be one, two or maybe three of these, but in order to show you how the algorithm works, these hash tables have been built with a very high load factor, requiring multiple attempts to locate the entries.

One table uses linear rehash, the other quadratic (Section 2.3), so you can see what these do since the printout gives the table index, the long hash code, the key and the data.

1.1 Object-oriented aspects

If you now look at the JAVA source code provided, you will see that Main.java has method calls like

    login_to_human.put (login_name, human_name);

whilst the corresponding method definition in template.java is

    public Object put (Object key, Object data) { ... }

with no dot (or anything to go before it).

JAVA is an object-oriented programming language, and this program is written in the object-oriented style. The algorithms for hash tables are enclosed in a class declaration, whilst a particular hash table (such as login_to_human) is an instance of the class (this is object-oriented programming jargon). Each instance is declared (in Main.java) using the constructor

    Hash_Table login_to_human = new Hash_Table ();

Objects, like arrays, may be passed around and assigned by reference: the constructor returns a reference to a hash table. Its details are meaningless to the application (Main.java): the reference simply says which hash table to use.

Imagine that you are in a foreign country and are given a visa or some other official document, written in a foreign language, that's meaningless to you, but which the local bureaucrats think is very important. All you have to do with it is take it off one bureaucrat to give to another.

You access different objects (such as the two hash tables login_to_human and human_to_login in Main.java) by prepending their names to the method name, e.g.

    login_to_human.remove("pt");

From the point of view of the implementation (Hash_Table.java), what this means is that you can write the code as if there were only one single hash table in the world. You (almost) never have to refer to any particular instance (hash table): there is, it seems, only one contents variable.

If ever it is necessary to say whose contents variable something is, it's called

    login_to_human.contents

as with the method calls. However, it is advisable to hide or encapsulate variables such as this, only allowing the external user to read or change them via methods. One reason for this is that, in the case of the hash table, this variable is part of a data invariant (Section 3.1), and must only be changed in harmony with the rest of the data.

There is another course about object-oriented programming: everything that you need to know for this exercise, or for this course, you can find out by looking at how the dot is used in Main.java and template.java.

The dot syntax is rather clumsy when it comes to using binary operations such as lessThan or

    if ( skey.equals (key[i]) ) { ... }

because the natural symmetry between the arguments is lost. However, there is a reason for this: the object skey is an instance of a particular class (say String), and the actual code that is executed is that which is the definition of the equals() method in that class. This means that

    if ( key[i].equals (skey) ) { ... }

may possibly do something quite different, should key[i] happen to be of a different class, containing a different equals() method (Section 1.3).

Usually, the data (e.g. the variable contents) belong to the object, i.e. the instance of the class, not the class itself. This is how JAVA makes it possible to use many instances of a class, even though only one version of the code, with an apparently global contents variable, has been written.

However, if you declare a variable as static it belongs to the class instead. For example, you might want to accumulate usage statistics for the search algorithms without distinguishing between instances.

1.2 Public methods for accessing data

We have already seen how to use the constructor,

    Hash_Table login_to_human = new Hash_Table ();

and one of the public access methods,

    String pt_name = login_to_human.get ("pt");

of which the others are

    boolean present = login_to_human.containsKey ("pt");
    boolean present = login_to_human.contains ("Paul Taylor");
    String pt_old_name = login_to_human.put ("pt", "Paul Taylor");
    String pt_old_name = login_to_human.remove ("pt");

Notice that, although the last two methods would have void return types in natural usage, we use the return value to tell the application whether this entry had previously been defined or not, and if so what its old value was. The application can then report an error if it so chooses. It may also want to maintain usage statistics about the old data value before it is discarded.

The contains() method is only included because it appears in the hash table implementation to be found in the JAVA library; it does a linear search through the entire table, which is plainly stupid.

1.3 What objects can you put in a hash table?

Almost all uses of hash tables are indexed by strings; the data are often strings too, but may be numbers.

However, hash tables may be used to look up other things, with very little change to the code.

One application is playing games such as Chess. The tree of possible moves needs to be searched to some depth using a minimax algorithm (with alpha-beta pruning). However, there are often pairs of "independent" moves, that can be done in either order, resulting in the same position. This means that the minimax algorithm does a lot of unnecessary work, by considering the same position twice.

The solution to this problem is to hash the positions (as the keys) and minimax's calculations (as the data). For this there must be a GamePosition class.

The hash table does not need to convert game positions into strings before storing or manipulating them. All it needs is to be able to call the following two methods in the GamePosition class:

    int long_hash_code = key.hashCode ();
    boolean same = key1.equals (key2);

Any class that has these methods (can be declared to say that it) implements the Hashable interface.

If you do not provide an equals method for the class, the default one from the Object class is used. This just compares the references, so, for example, Strings that contain the same letters in the same order but which were declared separately are considered unequal.

1.4 Methods to access the control parameters

With no arguments, the constructor has to make its own guess as to the size of the table required. As the table can be rebuilt automatically with a larger size, this is OK. However, the constructor may be given arguments to specify the original size of the table, what the acceptable load factor is, and how fast it should grow when this is exceeded.

The application may perhaps wish to change these parameters after the table has been created, or to find out how full it is or how efficiently it's working.

My implementation (Hash_Table.class) has numerous bells and whistles that you can ring/blow by making minor adjustments to Main.java. You are not expected to implement these yourself, or be put off by how many there are! They are there for you to do experiments and design your own implementation. In each case, if the method is called with no arguments, the current value is returned.

    int capacity (int new_size)
    int used () { return used; }
    int size () { return contents; }
    int contents () { return contents; }
    boolean isEmpty () { return (contents==0); }
    double inflation (double new_inflation)
    int increment (int new_increment)
    boolean report_rebuild (boolean new_report_rebuild)
    boolean use_quadratic_rehash (boolean new_use_quadratic_rehash)
    double conflict () { return ((double)nconflicts)/((double)nsearches);}
    double loading () { return ((double)used)/((double)capacity); }
    double real_loading () {return ((double)contents)/((double)capacity);}
    double load_factor (double new_load_factor)

The meaning of increment and inflation is apparent from

    private int rebuild_size ()
         { return next_prime ((capacity + increment) * inflation); }

Changing capacity or, if necessary, certain of the other parameters, triggers a rebuild of the table. The JAVA library's ensureCapacity (new_size) is the same as capacity (new_size).

Find out what happens to the behaviour when you change the load factor or switch to quadratic rehash. With load factor 0.9, quadratic rehash crashes; there is a little number-theoretic problem for the maths students to see why; how could cubic rehash solve this problem?

1.5 Other methods

The debugging output that Main.class produces is obtained from

    String output = login_to_human.show ("pt");

and the whole table is rendered in this way using toString() (or just the name of the table as part of a string concatenation) in the usual way.

I'll leave you to guess what these methods do:

    login_to_human.clear ();
    Hash_Table copy = login_to_human.clone ();

You can get all of the keys, or all of the data, using

    Object[] data =  login_to_human.element_array ();
    Object[] key  = login_to_human.key_array ();

There is another way, using the Enumeration interface (see the JAVA library documentation):

    Enumeration list = login_to_human.elements (boolean quick);

    Enumeration list = login_to_human.keys (boolean quick);
    while ( list.hasMoreElements () ) { 
          Object key  = list.nextElement ();
          Object data = login_to_human.get (key);
          ...
    }

The enumeration methods maintain a cursor, which has to be part of another object in order to allow you to do double enumerations (e.g. running through all pairs of vertices in a graph). The quick switch, if true, just has this pointing into the hash table itself, so it is invalidated whenever a change is made to the table (the corresponding methods in the Vector class in the JAVA library, for example, have this bug). With quick=false, the enumeration uses a private copy, which is made using one of the to-array methods.

2 Implementation

template.java is my implementation with the "guts" taken out. The bits that (I have taken out and) you have to write are, in this order:

the private locate (search_key) [or find] method, which is the core of the algorithm; it must return the index into the table where either the key has been found or there is an EMPTY or DELETED cell that can be used to insert it;
the public put (key, data) [or insert] method for inserting items in the table, and
the private rebuild (new_size) [or rehash] method that rebuilds the whole table when it has got too full.

Please note that the test program (Main.java/class) uses a more elaborate version of (my implementation of) Hash_Table.class than is given in the template. You are not expected to implement all these fiddly bits yourself!

2.1 The data

The data in the table consist of several arrays, the most important of which are

Object[] key = new Object [capacity] and
Object[] data = new Object [capacity]

The key is what you put in (e.g. a name) to unlock the data (e.g. a telephone number). In my implementation, the keys and data are Objects (the most general JAVA class: see Section 1.3), but you may safely substitute String for Object throughout.

Along with the data itself, various control parameters must be maintained, so that the program knows easily when to rebuild the table, without repeated counting:

capacity: the number of cells available altogether;
contents: the number of currently valid entries;
used: the number of cells currently occupied by valid or DELETED entries.

It is part of the data invariant (Section 3.1) that the numerical value of the contents variable should really be the number of valid entries in the table.

In my implementation, the long hash code is saved in a third array,

int[] code = new int [capacity]

for use when rebuilding the table; this avoids calling hashCode () again.

I use "special" values in code[] to indicate EMPTY and DELETED positions. This means that the return value of each call to the hashCode() method must be tested, and replaced with an OTHER value if it happens to be equal to one of these.

Another approach is to use two boolean[] arrays, or better two BitSets (see the JAVA library), to record whether an entry is EMPTY, EMPTY|DELETED or valid. (Examination of the tests that actually need to be done in the code shows that this is better than storing DELETED instead.)

2.2 The hash coding function

Despite extensive attempts to design better ones, experience has shown that it is not important how the hash function is calculated. In any case, in any single application (for example compiling a particular program written by a particular person) there are regularities in the distribution of names that would compromise any "rationally" designed hash function.

The rule is simply that all of the bits in the key must be used.

From the design of spelling checkers, it is known that the erroneous form of a word almost always differs from the intended one by a single change of one of four kinds: insertion, deletion or alteration of a single letter, or transposition of two letters.

Applying a similar idea to the design of a hash function, we must ensure that each letter contributes to the value. For example, item1 and item2 must get different values, This might be done, for example, by adding up their ASCII or Unicode values.

However, the sum of the values (or any other associative commutative operation, such as bitwise OR) fails the transposition requirement. For example, in Main.java there are variables called login_to_human and human_to_login, which contain the same letters in different orders. A common solution to this problem is to multiply each ASCII code by its position; this trick is used, for example, to calculate check digits in ISBNs, bank account numbers, etc..

In fact the JAVA String class (along with many others in the JAVA library, in particular Integer, Float, etc..) already has a hashCode() method. You should use this instead of writing your own, although when you have your hash table working you may like to experiment with alternative hash functions.

The hashCode() method in the library returns a 32-bit int value, which is of course too big to use as the index into an array. I call this the long hash code; it is printed (in hexadecimal) in the debugging output of Main.java.

The index into the table, which I call the short hash code, is obtained by finding the remainder modulo the size of the table. However, the implementation of the % (modulo or remainder) operator in JAVA and other programming language standardly has the behaviour that (-5) % 3 is not 1, as it ought to be for our purposes, but -2. I cannot think of any application in which this would be the desired behaviour, but that's how it is. Consequently, you have to check for negative results and either add capacity or change the sign.

In my implementation, the long hash code of each entry is stored in a third array, called code[ ], but special values of code[i] are used to indicate that position i is EMPTY or DELETED. If you do things this way, you must check each value returned by hashCode() to see if it accidentally happens to be equal to one of these special codes, and if so replace this value with an OTHER one.

Otherwise, one day (come the millennium) you will find that some particular key cannot be entered into the table: whenever you do so, it mysteriously disappears, being treated as an EMPTY or DELETED position. (Worse still, the inappropriate insertion of a spurious EMPTY position might make other entries disappear too!)

2.3 The private `locate` method

See Section 2.2 for the long and short hash codes. In fact, in Section 2.4 we shall see that it's slightly more convenient for the caller (put(), get(), remove()) to calculate the long hash code and pass it to locate() as a second argument.

There is no point in doing

    if (code[i] == EMPTY) { return i; }
    else
    while (code[i] != EMPTY) { ... }

because you're merely evaluating the same test for EMPTY twice.

In the loop, you have to test

if (code[i] == EMPTY) as numerical values and
if (skey.equals (key[i])) as equality of Strings (or Objects),

although as the second test may be expensive, it is a good idea to find out first whether the saved long hash code is equal to the one we're looking for.

The table is effectively circular: if we reach capacity we have to return to 0.

Some versions of this code used while (i<capacity) or the same thing with for, then reset to 0 and repeated the code with a second loop. This is no more efficient (the same test (i<capacity) is made), and is less safe because when you write code twice you have two opportunities to get it wrong.

There is some benefit in testing that we don't go round the loop again and again, but proper consideration of correctness would deal with that. Looping forever is as good a way of reporting a run-time error as is raising an exception; better, in fact, because infinite loops can't be silenced by code like

    try { code; }
    catch (Exception e) { }

It is the whole point of marking cells as DELETED that they must not be treated like EMPTY ones (Section 3.3).

Very little of the code needs to be changed to implement quadratic rehash, i.e. to visit positions i, i+1, i+3, i+6, i+10, i+15 instead. It is not necessary to calculate j² or j(j+1)/2: notice that the jumps are 1, 2, 3, 4, ... Beware, however, that quadratic rehash considers (capacity-1)/2 positions, and then repeats the same path (backwards). Problem for mathematics students: why? Could cubic rehash avoid this problem? (For cubic rehash, we don't increment jump by 1 each time, but by an amount that is itself incremented.)

2.4 The public `insert` method

This method is an application of the private locate() method above, which returns either the position of the key if it already exists in the table, or the position where it can be inserted. These cases are distinguished by the if statements that follow.

You could instead repeat the code for put(), get() and remove(), but this is unsafe (Section 3.3).

There is no loop in this code!

As (in my implementation) the long hash code is saved in code[i], it is more convenient for put() to calculate it and pass it as the second argument to locate().

It is important to maintain the contents and used counts correctly (Section 3.1). There are different things to do in the three cases.

The key may already be present in the table. In this case we replace the old data with the new, returning the old data as our return-value.

Using the used count, we test whether the table has got too full, and if so call rebuild().

2.5 The private `rebuild` method

Notice the way in which this is divided into two methods, one of which is then available for the constructor to make a new table when no valid table is already available.

The caller should be able to choose the (minimum) new size, and the actual size should be the next prime above this.

Beware that this is not a simple copy of one array to another: each old entry must be inserted in the new table in the same way as it would be using the put() method; this is why it's useful to save the 32-bit long hash code.

This method could be implemented by mimicking the code in locate() and put(), but it is much safer to call put() instead (Section 3.3).

The old EMPTY and DELETED entries must not be re-inserted.

The control parameters, in particular contents, overflow, contents and used must be set correctly (Section 3.1).

When using the put() method, the hash table must first be initialised correctly, as newhash() does. As this assigns to the key, data and code variables, these must be saved for later use in the loop.

It is not necessary to save the contents of these arrays, element-by-element. This is because the key and other variables are references to arrays, and the assignment inside newhash() changes these references to point to new arrays, leaving the contents of the old ones alone.

Nor is there any need to assign old_key = null at the end. As this is a local variable, its value is lost when rebuild() exits. As it is private to the class, this local variable is the only reference to the array. JAVA counts the references and, when it has exhausted the available memory, de-allocates the arrays and class instances that have zero reference counts. This process (garbage collection) is not straightforward: when an array or object is de-allocated, any references to other arrays and objects that it contains also disappear, so there are long chains of de-allocations that have to be made. Maintaining this is not so difficult, but there may also be loops of references, which can only be detected by scanning all of the objects that the program has allocated. JAVA worries about this for you: usually you don't have to do so, but sometimes this can go wrong!

3 Correctness

In the earlier parts of the course you have learnt how to prove correctness of algorithms such as linear search and array shifting by using loop invariants.

However, the only loops in Hash_Table.java are

locate(): a linear search along a "path" in the table (Section 2.3);
newhash() and clear(): make all the cells EMPTY;
next_prime()
rebuild(): a linear copy of one array to another, though not into the same positions;

plus a few inessential things to make the "bells and whistles" for my implementation.

Therefore, it is not in the use of loops where the correctness issues lie.

When using a class in object-oriented programming, there is another, bigger, loop that you can't see in the code:

    Hash_Table login_to_human = new Hash_Table ();
    while ( ! bored ) {
         login_to_human.call_some_method ( ... );
    }

So the invariant - which we now call a data invariant or object invariant or class invariant - is a statement of

what we may assume about the state of the data when we enter a method, and
what we must ensure about the state of the data when we leave it.

So we have to prove something about the body of the method, in exactly the same way as we did about the body of a loop.

3.1 Counting

One part - the simplest part - of the data invariant concerns the variables contents and used. These count the number of cells in each of the three states: EMPTY, DELETED and VALID.

So they must be updated every time one of the cells changes state, and also when the table is rebuilt.

Otherwise, we will not know when the table has got too full and needs to be rebuilt, with the result that locate() will search the entire table instead of two or three entries, or maybe loop forever. Alternatively, we might get into a loop rebuilding the table bigger and bigger until we run out of memory.

3.2 Partial functional relations

Neither of the previous two subsections actually addresses the issue of what the hash table is for, and therefore what it means for it to be correct or otherwise.

Remember that the entries in the table are not sorted. All we do is put them in, look them up and take them out.

There are three ways that we can look at this:

the sequence of put (key, data) and remove (key) method calls that were used to create the table,
the finite set of (key, data) pairs that the table contains, and
the key→data response of the get (key) method.

The last two directly define partial functional or single-valued relations: if you consider the same key twice, you get the same data back.

Considering the remove (key) method calls, and the put (key, data) calls that overwrite the data already associated with the key, the first way of looking at the hash table is like a sequence of assignment statements. The overall effect of this (when we consider just the current state, rather than the assignments that got us there) is that this view of the table is also a partial functional relation.

To show that the implementation of the hash table is correct, we must therefore show that these three views actually define the same partial functional relation.

Part of this is easy to check: every put or remove method call must actually enter or delete an entry in the table, unless it can see that this has already been done.

However, this is not the whole story. We must also ensure that, when one of the methods accesses a key that is contained in a particular cell of the table, this is the only cell where this key is to be found.

This concrete property is in fact just the same as the abstract mathematical property of being a partial functional relation, that you should have learned in the Discrete Structures course.

In fact this is not quite right. Mathematically, we say that R is a partial functional relation if

∀x. ∀y₁, y₂. x R y₁ ∧x R y₂ ⇒ y₁=y₂.

If we were actually doing this in the JAVA implementation, there would somewhere be code like

    if ( data1.equals (data2) ) { ... }

whereas in fact we test equality of keys. The logical property that our table satisfies is actually

∀i j. 0 ≤ i < j < capacity ⇒key [i] ≠ key [j],

but I shall leave it to you to prove that this implies that the three binary relations defined above are in fact functional in the mathematical sense.

3.3 Repeatability of `locate`

The considerations about partial functional relations in the previous subsection come down to proving that the locate() method correctly solves the search problem:

to determine whether or not the search key occurs in the table, and if so where.

The linear search algorithm solved this problem in a very crude way, by possibly considering every cell in the table. Binary search was able to do it much more quickly, only considering logn cells, by relying on the assumption that the table was sorted.

A hash table is not sorted, and we certainly don't want to consider every single cell. Correctness must therefore depend on guaranteeing the "efficient filing" property that,

if the key occurs anywhere, then it must be here,

so if it's not here, it's not present at all.

We achieve this for get() by knowing where put() would have left the entry. So these methods must consider the same sequence of positions, which is most safely achieved by making them call the same code, a private locate() method. This means that

locate() must be repeatable.

Recall that we call a hashCode() method to turn the key (usually a String) into a long hash code, and then the % operator to reduce this to the short hash code, which is the entry point to the table.

Really, it is better to consider hashCode() and locate() together as defining, not a single cell-index in the table, but a path (sequence of cells) in the table. In principle, this path consists of a large number of positions, to be considered in a particular order. The single loop within this method runs through this list of positions.

This path may be defined in various ways:

linear rehash visits positions i, i+1, i+2, i+3, ..., modulo capacity, where i is the short hash code;
quadratic rehash visits i, i+1, i+3, i+6, i+10, ...;
maybe these intervals, like the starting-point i, are also obtained from the long hash code, for example by taking its remainder modulo capacity-2. Richard Bornat says (9 April 1999):
I have heard of it. It's the sort of thing that was tried early on in history; my recollection is that it doesn't work better than (say) quadratic rehash, and that Knuth's analysis of linear rehash means that we don't have to look for better rehash schemes. But Sedgewick's book may have references.

In practice, of course, we only want to consider up to about three positions. However, the sequence of positions that are actually visited by a latter call to get() may be longer than the sequence visited by the call to put() that originally inserted the entry, because an intervening insertion may have changed the terminating EMPTY cell into a valid one. This is why, in proving correctness, we must consider the entire (potential) path.

The loop in locate() breaks when it encounters either

the required key, or
an EMPTY cell.

The second case implies that the key is absent, so long as every attempt to look up this key follows exactly the same path, and the insertion is made in the first available EMPTY cell. When the later get() finds an EMPTY cell without first finding the key, it knows that there has been no earlier put() - because that would have used this EMPTY cell, or an earlier one in the path that has since been used for another key.

What happens when we delete one of the other keys that is considered along this path? If this cell were marked EMPTY, the loop searching for our key would break prematurely. Therefore, this other key must be marked in some other way (DELETED) that does not cause the loop to break at that point.

We have not considered the rebuild() method in this discussion. However, this works by

taking all of the valid (key, data) pairs in the old table, and
re-inserting them in the new one.

Therefore the old (key, data)-relation (defined by the old table) coincides with the new one, as defined by its put() method, and hence (by the above argument) with that defined by its table and its get() method.

This is www.PaulTaylor.EU/algorithms/hash.html and it was derived from algorithms/hash.tex which was last modified on 31 December 2007.