Introduction to Hashing

Assume we want to create a system for storing employee records that include phone numbers (as keys). We also want the following queries to run quickly:

  • Insert a phone number and any necessary information.
  • Look up a phone number and get the information.
  • Remove a phone number and any associated information.

We can consider using the following data structures to store information about various phone numbers.

  • A collection of phone numbers and records.
  • Phone numbers and records are linked in this list.
  • Phone numbers serve as keys in a balanced binary search tree.
  • Table with Direct Access.

We must search in a linear fashion for arrays and linked lists, which can be costly in practise. If we use arrays and keep the data sorted, we can use Binary Search to find a phone number in O(Logn) time, but insert and delete operations become expensive because we must keep the data sorted.

We get moderate search, insert, and delete times with a balanced binary search tree. All of these operations will be completed in O(Logn) time.

The term "access-list" refers to a set of rules for controlling network traffic and reducing network attacks. ACLs are used to filter network traffic based on a set of rules defined for incoming or outgoing traffic.

Another option is to use a direct access table, in which we create a large array and use phone numbers as indexes. If the phone number is not present, the array entry is NIL; otherwise, the array entry stores a pointer to the records corresponding to the phone number. In terms of time complexity, this solution is the best of the bunch; we can perform all operations in O(1) time. To insert a phone number, for example, we create a record with the phone number's details, use the phone number as an index, and store the pointer to the newly created record in the table.

This solution has a number of practical drawbacks. The first issue with this solution is the amount of extra space required. For example, if a phone number has n digits, we require O(m * 10n) table space, where m is the size of a pointer to record. Another issue is that an integer in a programming language cannot hold n digits.

Because of the limitations mentioned above, Direct Access Table cannot always be used. In practise, Hashing is the solution that can be used in almost all such situations and outperforms the above data structures such as Array, Linked List, and Balanced BST. We get O(1) search time on average (under reasonable assumptions) and O(n) in the worst case with hashing. Let's break down what hashing is.

What exactly do you mean by hashing?

Hashing is a popular technique for quickly storing and retrieving data. The primary reason for using hashing is that it produces optimal results by performing optimal searches.

Why should you use Hashing?

If we try to search, insert, or delete any element in a balanced binary search tree, the time complexity for the same is O. (logn). Now, there may be times when our applications need to perform the same operations in a faster, more optimised manner, and this is where hashing comes into play. All of the above operations in hashing can be completed in O(1), or constant time. It is critical to understand that hashing's worst-case time complexity remains O(n), but its average time complexity is O. (1).

Let us now look at some fundamental hashing operations.

Fundamental Operations:

  • HashTable: Use this operation to create a new hash table.
  • Delete: This operation is used to remove a specific key-value pair from the hash table.
  • Get: This operation is used to find a key within the hash table and return the value associated with that key.
  • Put: This operation is used to add a new key-value pair to the hash table.
  • DeleteHashTable: This operation is used to remove the hash table.

Describe the hash function.

  • A hash function is a fixed procedure that changes a key into a hash key.
  • This function converts a key into a length-restricted value known as a hash value or hash.
  • Although the hash value is typically less than the original, it nevertheless represents the original string of characters.
  • The digital signature is transferred, and both the hash value and the signature are then given to the recipient. The hash value generated by the receiver using the same hash algorithm is compared to the hash value received along with the message.
  • The message is sent without problems if the hash values match.

Hash Table: What is it?

  • A data structure called a hash table or hash map is used to hold key-value pairs.
  • It is a collection of materials that have been organised for later simple access.
  • It computes an index into an array of buckets or slots from which the requested value can be located using a hash function.
  • Each list in the array is referred to as a bucket.
  • On the basis of the key, it contains value.
  • The map interface is implemented using a hash table, which also extends the Dictionary class.
  • The hash table is synchronised and only has distinct components.

Components of Hashing:

  • Hash Table: An array that stores pointers to records that correspond to a specific phone number. If no existing phone number has a hash function value equal to the index for the entry, the entry in the hash table is NIL. In simple terms, a hash table is a generalisation of an array. A hash table provides the functionality of storing a collection of data in such a way that it is easy to find those items later if needed. This makes element searching very efficient.
  • Hash Function: A function that reduces a large phone number to a small practical integer value. In a hash table, the mapped integer value serves as an index. So, to put it simply, a hash function is used to convert a given key into a specific slot index. Its primary function is to map every possible key to a unique slot index. The hash function is referred to as a perfect hash function if each key maps to a distinct slot index. Although it is exceedingly challenging to construct the ideal hash function, it is our responsibility as programmers to do so in a way that minimises the likelihood of collisions. This section will cover collision.

The following characteristics a decent hash function ought to have:

  • Effectively calculable.
  • The keys ought to be distributed equally among all table positions.
  • Ought to reduce collisions.
  • Low load factor should be the norm (number of items in table divided by size of the table).

A poor hash function for phone numbers, for instance, would be to use the first three digits. Consideration of the last three numbers is a better function. Please be aware that this hash function might not be the best. There could be better options.

  • Handling Collisions: Because a hash function only returns a little number for a large key, it is possible that two keys will yield the same result. Collision occurs when a newly added key corresponds to a hash table slot that is already taken, and it needs to be handled using a collision handling mechanism. The methods for handling collisions are as follows:
    • Making each hash table cell point to a linked list of records with the same hash function value is known as "chaining." Although chaining is straightforward, more memory is needed outside of the table.
    • Open Addressing: In open addressing, the hash table itself serves as the storage location for all items. Either a record or NIL is present in every table entry. When looking for an element, we look through each table slot individually until the sought-after element is discovered or it becomes obvious that the element is not in the table.

Linear Probing

In data structures, hashing produces array indexes that are already used to store a value. In this situation, hashing does a search operation and linearly probes for the subsequent empty cell.

The simplest method for handling collisions in hash tables is known as linear probing in hash algorithms. Any collision that occurred can be located using a sequential search.

Hashing twice

Two hash functions are used in the double hashing method. When the first hash function results in a collision, the second hash function is used. In order to store the value, it offers an offset index.

The double hashing method's formula is as follows:

(firstHash(key) + i * secondHash(key)) % sizeOfTable

The offset value is represented by i. The offset value is continuously increased until it encounters an empty slot.