I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java
Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode(); int binIndex = hash % binCount; // The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to
log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (
log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.