Hashing in Data Structure

Introduction to Hashing in Data Structure:

Hashing is a popular technique in computer science that involves mapping large data sets to fixed-length values. It is a process of converting a data set of variable size into a data set of a fixed size. The ability to perform efficient lookup operations makes hashing an essential concept in data structures.

What is Hashing?

A hashing algorithm is used to convert an input (such as a string or integer) into a fixed-size output (referred to as a hash code or hash value). The data is then stored and retrieved using this hash value as an index in an array or hash table. The hash function must be deterministic, which guarantees that it will always yield the same result for a given input.

Hashing is commonly used to create a unique identifier for a piece of data, which can be used to quickly look up that data in a large dataset. For example, a web browser may use hashing to store website passwords securely. When a user enters their password, the browser converts it into a hash value and compares it to the stored hash value to authenticate the user.

What is a hash Key?

In the context of hashing, a hash key (also known as a hash value or hash code) is a fixed-size numerical or alphanumeric representation generated by a hashing algorithm. It is derived from the input data, such as a text string or a file, through a process known as hashing.

Hashing involves applying a specific mathematical function to the input data, which produces a unique hash key that is typically of fixed length, regardless of the size of the input. The resulting hash key is essentially a digital fingerprint of the original data.

The hash key serves several purposes. It is commonly used for data integrity checks, as even a small change in the input data will produce a significantly different hash key. Hash keys are also used for efficient data retrieval and storage in hash tables or data structures, as they allow quick look-up and comparison operations.

How Hashing Works?

The process of hashing can be broken down into three steps:

Input: The data to be hashed is input into the hashing algorithm.
Hash Function: The hashing algorithm takes the input data and applies a mathematical function to generate a fixed-size hash value. The hash function should be designed so that different input values produce different hash values, and small changes in the input produce large changes in the output.
Output: The hash value is returned, which is used as an index to store or retrieve data in a data structure.

Hashing Algorithms:

There are numerous hashing algorithms, each with distinct advantages and disadvantages. The most popular algorithms include the following:

MD5: A widely used hashing algorithm that produces a 128-bit hash value.
SHA-1: A popular hashing algorithm that produces a 160-bit hash value.
SHA-256: A more secure hashing algorithm that produces a 256-bit hash value.

Hash Function:

Hash Function: A hash function is a type of mathematical operation that takes an input (or key) and outputs a fixed-size result known as a hash code or hash value. The hash function must always yield the same hash code for the same input in order to be deterministic. Additionally, the hash function should produce a unique hash code for each input, which is known as the hash property.

There are different types of hash functions, including:

Division method:

This method involves dividing the key by the table size and taking the remainder as the hash value. For example, if the table size is 10 and the key is 23, the hash value would be 3 (23 % 10 = 3).

Multiplication method:

This method involves multiplying the key by a constant and taking the fractional part of the product as the hash value. For example, if the key is 23 and the constant is 0.618, the hash value would be 2 (floor(10*(0.61823 - floor(0.61823))) = floor(2.236) = 2).

Universal hashing:

This method involves using a random hash function from a family of hash functions. This ensures that the hash function is not biased towards any particular input and is resistant to attacks.

Collision Resolution

One of the main challenges in hashing is handling collisions, which occur when two or more input values produce the same hash value. There are various techniques used to resolve collisions, including:

Chaining: In this technique, each hash table slot contains a linked list of all the values that have the same hash value. This technique is simple and easy to implement, but it can lead to poor performance when the linked lists become too long.
Open addressing: In this technique, when a collision occurs, the algorithm searches for an empty slot in the hash table by probing successive slots until an empty slot is found. This technique can be more efficient than chaining when the load factor is low, but it can lead to clustering and poor performance when the load factor is high.
Double hashing: This is a variation of open addressing that uses a second hash function to determine the next slot to probe when a collision occurs. This technique can help to reduce clustering and improve performance.

Example of Collision Resolution

Let's continue with our example of a hash table with a size of 5. We want to store the key-value pairs "John: 123456" and "Mary: 987654" in the hash table. Both keys produce the same hash code of 4, so a collision occurs.

We can use chaining to resolve the collision. We create a linked list at index 4 and add the key-value pairs to the list. The hash table now looks like this:

4: John: 123456 -> Mary: 987654

Hash Table:

A hash table is a data structure that stores data in an array.Typically, a size for the array is selected that is greater than the number of elements that can fit in the hash table. A key is mapped to an index in the array using the hash function.

The hash function is used to locate the index where an element needs to be inserted in the hash table in order to add a new element. The element gets added to that index if there isn't a collision.If there is a collision, the collision resolution method is used to find the next available slot in the array.

The hash function is used to locate the index that the element is stored in order to retrieve it from the hash table. If the element is not found at that index, the collision resolution method is used to search for the element in the linked list (if chaining is used) or in the next available slot (if open addressing is used).

Hash Table Operations

There are several operations that can be performed on a hash table, including:

Insertion: Inserting a new key-value pair into the hash table.
Deletion: Removing a key-value pair from the hash table.
Search: Searching for a key-value pair in the hash table.

Creating a Hash Table:

Hashing is frequently used to build hash tables, which are data structures that enable quick data insertion, deletion, and retrieval. One or more key-value pairs can be stored in each of the arrays of buckets that make up a hash table.

To create a hash table, we first need to define a hash function that maps each key to a unique index in the array. A simple hash function might be to take the sum of the ASCII values of the characters in the key and use the remainder when divided by the size of the array. However, this hash function is inefficient and can lead to collisions (two keys that map to the same index).

To avoid collisions, we can use more advanced hash functions that produce a more even distribution of hash values across the array. One popular algorithm is the djb2 hash function, which uses bitwise operations to generate a hash value:

unsigned long hash(char* str) {

    unsigned long hash = 5381;
    int c;

    while (c = *str++) {
        hash = ((hash << 5) + hash) + c;
    }

    return hash;
}

This hash function takes a string as input and returns an unsigned long integer hash value. The function initializes a hash value of 5381 and then iterates over each character in the string, using bitwise operations to generate a new hash value. The final hash value is returned.

Hash Tables in C++

In C++, the standard library provides a hash table container class called unordered_map. The unordered_map container is implemented using a hash table and provides fast access to key-value pairs. The unordered_map container uses a hash function to calculate the hash code of the keys and then uses open addressing to resolve collisions.

To use the unordered_map container in C++, you need to include the <unordered_map> header file. Here's an example of how to create an unordered_map container in C++:

#include <iostream>
#include <unordered_map>

int main()
{
    // create an unordered_map container
std::unordered_map<std::string, int> my_map;

    // insert some key-value pairs into the map
    my_map["apple"] = 10;
    my_map["banana"] = 20;
    my_map["orange"] = 30;

    // print the value associated with the "banana" key
std::cout << my_map["banana"] << std::endl;

    return 0;
}

Explanation:

This program demonstrates the use of the unordered_map container in C++, which is implemented using a hash table and provides fast access to key-value pairs.
First, the program includes the necessary header files: <iostream> and<unordered_map>.
Then, the program creates an empty unordered_map container called my_map, which has string keys and integer values. This is done using the syntax std::unordered_map my_map;
Next, the program inserts three key-value pairs into the my_map container using the [] operator: "apple" with a value of 10, "banana" with a value of 20, and "orange" with a value of 30.
This is done using the syntax my_map["apple"] = 10;, my_map["banana"] = 20;, and my_map["orange"] = 30; respectively.
Finally, the program prints the value associated with the "banana" key using the [] operator and the std::cout object.

Program Output:

Inserting Data into a Hash Table

To insert a key-value pair into a hash table, we first need toas an index into the array to store the key-value pair. If another key maps to the same index, we have a collision and need to handle it appropriately. One common method is to use chaining, where each bucket in the array contains a linked list of key-value pairs that have the same hash value.

Here is an example of how to insert a key-value pair into a hash table using chaining:

typedef struct node {
    char* key;
    int value;
    struct node* next;
} node;

node* hash_table[100];

void insert(char* key, int value) {
    unsigned long hash_value = hash(key) % 100;
    node* new_node = (node*) malloc(sizeof(node));
    new_node->key = key;
    new_node->value = value;
    new_node->next = NULL;

    if (hash_table[hash_value] == NULL) {
        hash_table[hash_value] = new_node;
    } else {
        node* curr_node = hash_table[hash_value];
        while (curr_node->next != NULL) {
            curr_node = curr_node->next;
        }
        curr_node->next = new_node;
    }
}

Explanation:

First, a struct called node is defined, which represents a single node in the hash table.
Each node has three members: a char* key to store the key, an int value to store the associated value, and a pointer to another node called next to handle collisions in the hash table using a linked list.
An array of node pointers called hash_table is declared with a size of 100. This array will be used to store the elements of the hash table.
The insert function takes a char* key and an int value as parameters.
It starts by computing the hash value for the given key using the hash function, which is assumed to be defined elsewhere in the program.
The hash value is then reduced to fit within the size of the hash_table array using the modulus operator % 100.
A new node is created using dynamic memory allocation (malloc(sizeof(node))), and its members (key, value, and next) are assigned with the provided key, value, and NULL, respectively.
If the corresponding slot in the hash_table array is empty (NULL), indicating no collision has occurred, the new node is assigned to that slot directly (hash_table[hash_value] = new_node).

However, if there is already a node present at that index in the hash_table array, the function needs to handle the collision. It traverses the linked list starting from the current node (hash_table[hash_value]) and moves to the next node until it reaches the end (curr_node->next != NULL). Once the end of the list is reached, the new node is appended as the next node (curr_node->next = new_node).

Implementation of Hashing in C++:

Let's see an implementation of hashing in C++ using open addressing and linear probing for collision resolution. We will implement a hash table that stores integers.

#include<iostream>
using namespace std;


const int SIZE = 10;

class HashTable {
private:
    int arr[SIZE];
public:
HashTable() {
        for (int i = 0; i < SIZE; i++) {
            arr[i] = -1;
        }
    }

    int hashFunction(int key) {
        return key % SIZE;
    }

    void insert(int key) {
        int index = hashFunction(key);
        int i = 0;

        while (arr[(index + i) % SIZE] != -1 && arr[(index + i) % SIZE] != -2 && arr[(index + i) % SIZE] != key) {
            i++;
        }

        if (arr[(index + i) % SIZE] == key) {
            cout << "Element already exists in the hash table" << endl;
        }
        else {
arr[(index + i) % SIZE] = key;
        }
    }

    void remove(int key) {
        int index = hashFunction(key);
        int i = 0;

        while (arr[(index + i)
        % SIZE] != -1) {
if (arr[(index + i) % SIZE] == key) {
arr[(index + i) % SIZE] = -2;
cout << "Element deleted from the hash table" << endl;
return;
}
i++;
}

    cout << "Element not found in the hash table" << endl;
}

void display() {
    for (int i = 0; i < SIZE; i++) {
        if (arr[i] == -1 || arr[i] == -2) {
            continue;
        }
        cout << "Index " << i << ": " << arr[i] << endl;
    }
}

};

int main() {
HashTable ht;
ht.insert(5);
ht.insert(15);
ht.insert(25);
ht.insert(35);
ht.insert(45);
ht.display();
ht.remove(15);
ht.display();
ht.remove(10);
ht.display();
ht.insert(55);
ht.display();

return 0;
}

Explanation:

This program implements a hash table data structure using linear probing to handle collisions.
A hash table is a data structure that stores data in key-value pairs, where the keys are hashed using a hash function to generate an index in an array. This allows for constant-time average-case complexity for inserting, searching, and deleting elements from the hash table.
The HashTable class has a private integer array arr of size SIZE, which is initialized to -1 in the constructor. The hash function method takes an integer key and returns the hash value of the key, which is simply the remainder of the key when divided by SIZE.
The insert method takes an integer key and uses the hash function to get the index where the key should be inserted.
If the index is already occupied by another key, linear probing is used to find the next available index in the array. Linear probing checks the next index in the array until it finds an empty slot or the key itself.
If the key is already in the hash table, the method displays a message saying that the element already exists. Otherwise, it inserts the key at the calculated index.
The remove method takes an integer key and uses the hash function to get the index where the key is located.
If the key is not in the calculated index, linear probing is used to search for the key in the next indices in the array. Once the key is found, it is deleted by setting its value to -2.
If the key is not found in the hash table, the method displays a message saying that the element is not found.
The display method simply iterates through the array and prints out all non-empty key-value pairs.
In the main function, an instance of the HashTable class is created, and several integers are inserted into the hash table using the insert method.
Then, the display method is called to show the contents of the hash table. The remove method is called twice, first to remove an element that exists in the hash table and then to remove an element that does not exist.
The display method is called after each remove method call to show the updated contents of the hash table.
Finally, another integer is inserted into the hash table, and the display method is called again to show the final contents of the hash table.

Program Output:

Applications of Hashing

Hashing has many applications in computer science, including:

Databases: Hashing is used to index and search large databases efficiently.
Cryptography: Hash functions are used to generate message digests, which are used to verify the integrity of data and protect against tampering.
Caching: Hash tables are used in caching systems to store frequently accessed data and improve performance.
Spell checking: Hashing is used in spell checkers to quickly search for words in a dictionary.
Network routing: Hashing is used in load balancing and routing algorithms to distribute network traffic across multiple servers.

Advantages of Hashing:

Fast Access: Hashing provides constant time access to data, making it faster than other data structures like linked lists and arrays.
Efficient Search: Hashing allows for quick search operations, making it an ideal data structure for applications that require frequent search operations.
Space-Efficient: Hashing can be more space-efficient than other data structures, as it only requires a fixed amount of memory to store the hash table.

Limitations of Hashing:

Hash Collisions: Hashing can produce the same hash value for different keys, leading to hash collisions. To handle collisions, we need to use collision resolution techniques like chaining or open addressing.
Hash Function Quality: The quality of the hash function determines the efficiency of the hashing algorithm. A poor-quality hash function can lead to more collisions, reducing the performance of the hashing algorithm.

Conclusion:

In conclusion, hashing is a widely used technique in a data structure that provides efficient access to data. It involves mapping a large amount of data to a smaller hash table using a hash function, which reduces the amount of time needed to search for specific data elements. The hash function ensures that data is stored in a unique location within the hash table and allows for easy retrieval of data when needed.

Hashing has several advantages over other data structure techniques, such as faster retrieval times, efficient use of memory, and reduced collisions due to the use of a good hash function. However, it also has some limitations, including the possibility of hash collisions and the need for a good hash function that can distribute data evenly across the hash table.

Overall, hashing is a powerful technique that is used in many applications, including database indexing, spell-checking, and password storage. By using a good hash function and implementing appropriate collision resolution techniques, developers can optimize the performance of their applications and provide users with a seamless experience.

Next TopicPrimitive Data Structure

← prev next →