Forming An Index With Data
Hashing As A Concept
After working with some pretty cool other kinds of data structures, hashing and hash tables will bring us back to arrays. However, as you might expect, Computer Scientists are still looking for ways to make data access and management faster. And hashing is definitely a way to do that. The very core of hashing is that we take the data we have such as names, addresses, part descriptions, city locations, or whatever data we are charged with managing, and we figure out a way to convert some part of the given data into array indices.
For example, if we were charged with figuring out a data structure to store auto parts, we could generate a part number for each and every part, or maybe we could let the data create its own access number (e.g., array index value). Now there are a lot of ways to make this translation but say you took the first ten letters of the part name and you used some kind of mathematical processing of the letters, which you will remember are stored as numbers in computers. And say the part named "Muffler Bearings" was somehow translated to a number 285917450. By itself, this is progress but we must also consider the capacity of the array, which is unfortunately called "size" in hashing operations (so we will go with that term).
So if the company is confident that it will have no more than 1,000,000 parts, then the hash table (i.e., the array) size could be set to 1,000,000 (more on this later). So how does one take the number 285917450 and translate that to an index in a table of size 1,000,000? The answer is actually pretty easy. The remainder of the calculated number (285917450) is found by dividing by the array size (1,000,000). The index would be 917450 which is a valid index within an array of size 1,000,000.
Bottom Line: Hashing is converting something directly related to the data into an index that can be used in an array, called a hash table. In it's simplest form, this has a complexity of one since there is no iteration, or searching, across the array data; there is just a hash calcuation. A complexity of one is really fast in an array of 1,000,000 or 1,000,000,000. In fact, it is really fast no matter how large the data quantity is. That is the inherent value of hashing.
But . . .
It doesn't matter whether you are working with particle Physics, playing Monopoly, or developing data management programs, there is no such thing as a free lunch. There is always a "But" or a "However" in any conversation such as this one.
The biggest problem with hashing is that thanks to the various ways that hashing might be calculated, it is possible that the hashed result and the calculated index comes out to be the same for two different data items. Now if you have created a really good hashing algorithm, you may keep this kind of problem to a minimum, but it isn't going to go away. There are going to be times that hashing two unique data quantities is going to result in the same index. This unfortunate occurrence has a name. When two different data quantities hash to the same index, it is called a collision. And that is something we have to deal with when we are working with hashing operations.
Small Side Step: Hash Table Size (Capacity)
The hashing process itself is pretty challenging. To get a mathematical operation to create differing indices for different data by itself is tricky. But from there, you still need to conduct modulo math on the data to turn the hashed value into an index. This means the table size must also be considered. For example, if the table size is an even number, there will be a lot of identical indices generated. Odd numbers are better, but the best number to use for setting the hash table array size (capacity) is a prime number.
Since none of the generated hash values can be divided evenly by a prime number, there is a better likelihood that the indices will be more uniquely distributed. For small hash tables like might be developed in the classroom, this is negligible. However, once again, in the outside world, they don't use tables of size 30 or 40. Their tables will commonly be sized in the millions. Get in the habit of using prime numbers for your hash tables now. It will pay off later.
And From There . . .
Collision management can be handled in one of two fundamental ways. The first way is to simply find another spot after the one where the collision took place. There are a handful of ways to conduct this process, called probing, and they will be discussed in the next topic. The alternative way is, instead of using the array to store the data directly, the array is used to store other data structures (e.g., Binary Search Trees, linked lists, other arrays, etc.) that can hold the given data. While the first strategy will ultimately be limited to the size (capacity) of the hash table array, the strategy of using other data structures in each array element expands the hash table's ability to a potentially limitless capacity without significantly compromising the speed of the access. For example: Think about an array of self-balancing trees . . .
This is the good and bad news about hashing and hash tables. The potential to access data very quickly is within reach but collisions must be managed. In the next two topics, the two primary strategies for doing this are discussed. Enjoy.