Hashing Tricks

🤔What is Hashing?

Hashing is a process in computer science used to uniquely identify data and map it to a specific location in data structure, typically an array or hash table. The concept revolves around transforming data (like numbers, strings, etc.) into a fixed-size numerical value, called a hash value or hash code, through a special function known as a hash function. This hash value or hashcode then points to where the data is stored in a structure called a hash table.

It allows quick access mapping data to a specific index in a hash table using a hash function. The main goal of hashing is to make data retrieval fast, even when dealing amounts of information. Hashing is widely used in various application, such as searching databases, managing passwords, and speeding up data lookups in many types of software.

List  = [5, 14, 22, 33, 56]
H(x) = x % 10

            -------------------------------------------
Hash Table  |     |     |     |     |     |     |     |
            -------------------------------------------
     index     0     1     2     3     4     5     6


First Element of List -> 5
            5 % 10 = 5, it is mapped to location 5
            -------------------------------------------
Hash Table  |     |     |     |     |     |  5  |     |
            -------------------------------------------
     index     0     1     2     3     4     5     6
    
Second Element of List -> 14
            14 % 10 = 4, it is mapped to location 4
            -------------------------------------------
Hash Table  |     |     |     |     | 14  |  5  |     |
            -------------------------------------------
     index     0     1     2     3     4     5     6

Third Element of List -> 22
            22 % 10 = 2, it is mapped to location 2
            -------------------------------------------
Hash Table  |     |     | 22 |     |  14  |  5  |     |
            -------------------------------------------
     index     0     1     2     3     4     5     6

Fourth Element of List -> 33
            33 % 10 = 3, it is mapped to location 3
            -------------------------------------------
Hash Table  |     |     | 22  | 33  | 14  |  5  |     |
            -------------------------------------------
     index     0     1     2     3     4     5     6

Fifth Element of List -> 56
            56 % 10 = 6, it is mapped to location 6
            -------------------------------------------
Hash Table  |     |     | 22  | 33  | 14  | 56  |     |
            -------------------------------------------
     index     0     1     2     3     4     5     6

Key Concepts of Hashing

  1. Hash Function:
    • A hash function takes an input (or "key") and returns a fixed-size string or number, which is the hash value.
    • The hash function is designed to minimize collisions (where two different inputs produce the same output) and distribute values evenly across the hash table.
    • Example: A simple hash function could be something like hash(x) = x % 10, which maps a number to one of the ten buckets (indices 0-9).
  2. Hash Code:
    • The hash function crunches the data and give a unique hash code. This hash code is typically integer value that can be used an index.
  3. Hash Table:
    • A hash table is a data structure that implements an associative array, which can map keys to values.
    • The hash table uses the hash function to compute an index into an array of buckets or slots, where the desired value can be found.
    • This allows for fast data retrieval since the data's location is directly calculated.
  4. Collisions:
    • A collision occurs when two different inputs to the hash function produce the same hash value. This can be handled by various techniques such as chaining (using linked lists to store multiple items at the same index) or open addressing (finding another open slot in the array).
  5. Applications of Hashing:
    • Data Retrieval: Hashing allows for quick data retrieval. For example, finding a record in a database can be done in O(1) time if hashing is used.
    • Checking Uniqueness: Hashing is often used in algorithms to quickly check for duplicates or uniqueness, such as in detecting anagrams or duplicates in a list.
    • Password Storage: In security, passwords are often hashed and stored rather than storing the actual passwords.

Example of a Simple Hash Function

Consider a scenario where we have a list of integers and we want to check if an integer is present in the list. Instead of searching through the entire list, which takes O(n) time, we can use a hash function to map each integer to a specific location in an array, making the search process O(1).

Example hash function:

int hashFunction(int key, int arraySize) {
    return key % arraySize; // Simple modulus-based hash function
}

Visualization of Hashing:

Imagine you have a set of keys (like numbers or words) that you want to store in a table. Instead of searching through the entire set each time, you use a hash function to determine where each key should go in the table.

For example:

  • Keys: {12, 44, 13, 88, 23, 94, 11, 39, 20, 16}
  • Hash Function: h(x) = x % 10
  • Hash Table of size 10

When you insert these keys using the hash function:

  • 12 % 10 = 2 → Store 12 at index 2
  • 44 % 10 = 4 → Store 44 at index 4
  • 13 % 10 = 3 → Store 13 at index 3
  • 88 % 10 = 8 → Store 88 at index 8
  • ... and so on.

If you need to find whether 13 is in the table, simply compute 13 % 10 = 3 and check index 3.

Which Data Structure to use:

We can use either simple array or the vector (dynamic array). Both serves the same purpose.

Key Points on Simple Array vs. Vector

  1. Simple Array:
    • Fixed Size: Simple arrays have a fixed size, which must be known at compile time or specified when creating the array.
    • Memory Efficiency: Because they have a fixed size, simple arrays can be slightly more memory-efficient, avoiding any overhead associated with dynamic resizing.
    • Static Allocation: Arrays are statically allocated in memory, which can be beneficial in low-level programming or when working with performance-critical code.
  2. Vector (Dynamic Array):
    • Dynamic Size: Vectors can dynamically resize themselves as elements are added or removed. This makes them more flexible when the size of the data is not known in advance.
    • Ease of Use: Vectors provide a number of convenient member functions, such as push_back(), size(), and automatic memory management, making them easier to use in many cases.
    • Bounds Checking: Vectors in C++ provide bounds checking with the at() method, which helps in preventing out-of-bounds errors.
    • Dynamic Allocation: Vectors are dynamically allocated on the heap, which can handle larger data sizes than stack-allocated arrays.

In STL C++

C++ STL library provides us two containers unordered_map and unordered_set, which are based on hashmap.

🥇 std::unordered_map:

  • Description:unordered_map is an associative container that stores elements formed by a combination of a key value and a mapped value. The elements are stored in no particular order, but it allows fast retrieval of individual elements based on their keys.
  • Key Characteristics:
    • Provides average O(1) time complexity for insertions, deletions, and lookups.
    • Keys must be unique; if you try to insert a key that already exists, it will either update the value associated with the key or the insertion will fail depending on the method used.
    • Allows you to use custom hash functions and equality predicates.
  • Complexity:
    • Offers an average time complexity of O(1) for insertions, deletions, and lookups. However, in the worst-case scenario, typically due to hash collisions, the time complexity can degrade to O(N).
  • Use Case: Ideal when you need fast access to elements using keys and don't care about the order of elements.
std::unordered_map<std::string, int> umap;
umap["apple"] = 2;
umap["banana"] = 3;
umap["orange"] = 5;

🥈std::unordered_set:

  • Description:unordered_set is an associative container that contains a set of unique elements. Like unordered_map, it stores the elements in no particular order, but it provides fast retrieval based on the element's value, which also serves as the key.
  • Key Characteristics:
    • Provides average O(1) time complexity for insertions, deletions, and lookups.
    • Only stores unique elements; duplicate values are not allowed.
    • Useful for situations where you need to check the existence of an element or perform operations on sets of unique elements.
  • Complexity:
    • Implements a balanced binary tree, offering a guaranteed time complexity of O(log N) for all operations, regardless of collisions.
  • Use Case: Ideal for situations where you need a collection of unique elements and need fast checks for existence, such as in hashing problems or checking for duplicates.
std::unordered_set<std::string> uset;
uset.insert("apple");
uset.insert("banana");
uset.insert("orange");

Array Hashing

Array hashing is a technique used to efficiently count the frequency of elements in an array or to perform other operations like finding duplicates, checking for specific patterns, and more. The core idea is to use an additional array (often called a hash table or frequency array) where each index corresponds to a specific value in the input array, allowing for constant-time access and update operations.

Hashing using an array, also known as direct addressing, is a simple yet effective way to store and retrieve data efficiently. In this approach, the array itself serves as a hash table, and each index of the array represents a unique key. This method is particularly efficient when the set of possible keys is small and the array can be directly indexed using these keys.

Concept

The idea behind using an array for hashing is straightforward:

  • You have a direct mapping between keys and indices in the array.
  • Each possible key corresponds to a unique index in the array.
  • The value associated with each key is stored at the corresponding index.

Example:

Imagine you want to store and quickly access information about people's ages. The possible ages range from 0 to 120. You can create an array of size 121 (from index 0 to 120), where each index represents an age, and the value at that index represents the count or some other information associated with that age.

How Array Hashing Works

  1. Hash Function: A hash function takes an input (array element) and returns an integer (hash code). The hash code determines the index in the hash table where the value will be stored.
  2. Hash Table: The hash table is an array of fixed size where each index corresponds to a hash code. Collisions (when two elements map to the same index) are handled using techniques like chaining or open addressing.

Complications:

The most difficult part of the simple array hashing is determining the range of the array.

Suppose we have the following list, and we need to find the most frequent element:

list: [1, 5, 5, 4, 3, 88]

Here, since the larger number is 88, then we have to use array of size large enough. So we would need an array of size 89, which can have range from 0 to 88.

#include <iostream>
#include <vector>
#include <climits> // for INT_MIN

int main() {
    int arr[] = {1, 5, 5, 4, 3, 88};
    int n = sizeof(arr) / sizeof(arr[0]);

    // Step 1: Determine the range and create the hash table
    const int MAX_ELEMENT = 88; // Max value in the array
    int hashTable[MAX_ELEMENT + 1] = {0}; // Initialize all frequencies to 0

    // Step 2: Populate the hash table with frequencies
    for (int i = 0; i < n; ++i) {
        hashTable[arr[i]]++;
    }

    // Step 3: Find the most frequent element
    int maxFrequency = INT_MIN;
    int mostFrequentElement = -1;

    for (int i = 0; i <= MAX_ELEMENT; ++i) {
        if (hashTable[i] > maxFrequency) {
            maxFrequency = hashTable[i];
            mostFrequentElement = i;
        }
    }

    // Step 4: Output the result
    std::cout << "The most frequent element is: " << mostFrequentElement
              << " with a frequency of " << maxFrequency << std::endl;

    return 0;
}

So the simple array hashing is mostly used where the range is smaller. like we have given elements of range from 1 to 10 or characters - since there are limited characters available in ASCII, which we will study in next section Character Hashing.

Hashing in Array Operations: Counting Occurrences

In the context of arrays, hashing provides a highly efficient method for solving problems like counting the frequency of elements. Consider the array arr[] = {5, 6, 5, 6, 9, 6}. To determine how many times the number 6 appears in this array, there are several approaches that can be taken.

Method 1: Basic Traversal and Counting

The simplest method involves traversing the entire array and counting the occurrences of the number 6. Although this method is straightforward, it is not optimal for large datasets as it requires a complete traversal of the array.

#include<bits/stdc++.h>
using namespace std;

int main() {
    int arr[] = {5, 6, 5, 6, 9, 6};
    int count = 0;
    for(int i = 0; i < 6; i++) {
        if(arr[i] == 6) {
            count++;
        }
    }
    cout << count << endl;  // Output: 3
    return 0;
}
Method 2: Hashing for Efficient Counting

A more efficient approach involves using hashing. Here, the array is hashed into another array, often called a hash table, where the index represents the element value and the content at that index represents the count of occurrences. This method allows the counting operation to be completed in a single iteration of the array, making it highly efficient.

#include<bits/stdc++.h>
using namespace std;

int main() {
    int arr[] = {5, 6, 5, 6, 9, 6};
    int hashTable[10] = {0};
    for(int i = 0; i < 6; i++) {
        hashTable[arr[i]]++;
    }
    cout << hashTable[6] << endl;  // Output: 3
    return 0;
}

Character Hashing

Character hashing is a powerful technique used in many programming problems to count the frequency of characters in a string. This technique allows us to efficiently solve problems involving character analysis, such as determining the uniqueness of characters, finding duplicates, checking for anagrams, or simply counting the occurrence of each character.

Character hashing is a specialized form of hashing used to efficiently store and retrieve data related to individual characters, particularly within a given character set such as ASCII. This technique is invaluable in various applications, from counting character frequencies to implementing efficient lookups in text processing tasks

In character hashing technique we use a hash table (often using a simple array) to store the frequencies of characters in a string.

The characters are stored in computer memory as the numeric values using a standard encoding system. This numeric representation allows the computer to efficiently process, store, and manipulate text data.

Character Encoding:
  • ASCII (American Standard Code for Information Interchange):
    • The ASCII is a character encoding standard that assigns numerical values to characters, with lowercase letter a through z being assigned values from 97 to 122.
    • One of the earliest and simplest encoding systems.
    • Each character (including letters, digits, and symbols) is represented by a unique 7-bit binary number, which is usually stored as 8 bits (1 byte) in modern systems.
    • For example:
      • 'A' is represented by the binary number 01000001, which is 65 in decimal.
      • 'a' is represented by 01100001, which is 97 in decimal.

        For instance, if there is a need to increment a value associated with the character 'a', the corresponding operation in terms of its ASCII value would be:
        
        hash['a']++ translates to hash[97]++
        
        
        This direct mapping between characters and their ASCII values allows for the creation of a hash table where each index corresponds to a specific character. When dealing with only lowercase alphabets, this hash table can be efficiently utilized by reducing the character’s ASCII value relative to 'a'.
  • Extended ASCII:
    • Extends the original 7-bit ASCII to 8 bits, allowing for 256 possible characters. This includes additional symbols, accented letters, and control characters.
  • Unicode:
    • A more comprehensive encoding system designed to represent characters from all languages and symbols.
    • UTF-8: A variable-width encoding that uses 1 to 4 bytes per character. It's backward compatible with ASCII for the first 128 characters.
    • UTF-16: Uses 2 or 4 bytes per character.
    • UTF-32: Uses 4 bytes for all characters, providing direct access to any Unicode character but requiring more memory.
Storing Characters in Memory:
  • When a character is stored in memory, it is saved as its corresponding binary (or hexadecimal) value according to the encoding scheme used by the system.
  • For example, storing the character 'A' in memory involves saving the byte 01000001 (or 0x41 in hexadecimal) at a specific memory address.
Examples:
  • Storing the word "Hello":
    • In ASCII:
      • 'H' = 01001000 (0x48)
      • 'e' = 01100101 (0x65)
      • 'l' = 01101100 (0x6C)
      • 'l' = 01101100 (0x6C)
      • 'o' = 01101111 (0x6F)
    • The word "Hello" would be stored as a sequence of these bytes in memory.

What is Character Hashing?

Character hashing involves using a data structure, typically an array or hash map, to store the frequency of each character in a given character set (like lowercase letters, uppercase letters, digits, or any set of characters). The term "hashing" here refers to the idea of mapping characters to an index in an array or a key in a map. In character hashing we map character to indices.

Key Idea: For every character in the input string, calculate its hash (or index) and update its frequency in the array or map.

Basic Concept of Character Hashing:

Each character is mapped to a unique index in an array. The size of the array is usually based on the character set you are dealing with. Character hashing is based on the concept of direct addressing using an array where:

  • Index Calculation: Each character is mapped to a unique index. For example, in an array freq of size 26, representing lowercase English letters ('a' to 'z'), the index for 'a' can be calculated as 0 (using the formula ch - 'a' where ch is 'a') and 'z' as 25.
  • Frequency Counting: The array stores the count (frequency) of each character at the computed index. If the input character is 'c', the index is 2 ('c' - 'a'), and freq[2] is incremented to reflect the occurrence of 'c'.

Understand the Character Set

  • Lowercase Letters Only (a-z): If the problem statement guarantees that the input string contains only lowercase English letters ('a' to 'z'), you can use an array of size 26. This optimization helps save memory. Each index directly corresponds to a letter.

    int freq[26] = {0};  // For 'a' to 'z'
    for (char ch : str) {
        freq[ch - 'a']++;
    }
    
    /*
    	Here,
    	'a' maps to index 0
    	'b' maps to index 1
    	'c' maps to index 2
    	...
    	...
    	'z' maps to index 25
    */
    
    /*
    	Hash Function:
    	A simple hash function for characters is to subtract the ASCII value of the character 'a' from the ASCII value of the chracter in question:
    	index = ASCII value of character - ASCII value of 'a'
    	This gives a unique index for each lowercase letter.
    */
  • Uppercase and Lowercase Letters (A-Z, a-z): If the string includes both uppercase ('A' to 'Z') and lowercase letters, use an array of size 52 or normalize the case (e.g., converting everything to lowercase or uppercase).

    int freq[52] = {0};  // For 'A' to 'Z' and 'a' to 'z'
    for (char ch : str) {
        if (ch >= 'a' && ch <= 'z') freq[ch - 'a']++;
        else if (ch >= 'A' && ch <= 'Z') freq[ch - 'A' + 26]++;
    }
    
  • Full ASCII Set: If you need to consider all ASCII characters (0 to 255), use an array of size 256.

    int freq[256] = {0};  // For all ASCII characters
    for (char ch : str) {
        freq[ch]++; // here we don't need to subtract anything, since
        			// it accommodates all the ascii characters.
    }
    
  • Unicode Characters: For problems involving Unicode characters, use a different approach, such as a map or unordered_map in C++, to handle a larger range of characters efficiently.

    std::unordered_map<char, int> freq;
    for (char ch : str) {
        freq[ch]++;
    }

Example:

Consider a simple example to demonstrate character hashing. We want to count the frequency of each lowercase letter in the string "example".

1 Initialize a Frequency Array:
  • We create an array freq of size 26 initialized to zero.
freq = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
         a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
2 Process Each Character:
  • For each character in the string "example", calculate its corresponding index in the frequency array and increment that index.
  • Character e:
    • Index calculation: e - ‘a’ = 4
    • Update Frequency Array: Increment freq[4] by 1.

      freq = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
               a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
      
  • Character x:
    • Index Calculation: x - ‘a’ = 23
    • Update Frequency Array: Increment freq[23] by 1.

      freq = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
               a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
      
  • Continue this process for all characters:
  • 'a' increments freq[0].
  • 'm' increments freq[12].
  • 'p' increments freq[15].
  • 'l' increments freq[11].
  • 'e' increments freq[4] again (now freq[4] becomes 2).

Final frequency array:

freq = [1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
         a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
Result:
  • The frequency array shows the count of each letter in the string "example":
    • a: 1, e: 2, l: 1, m: 1, p: 1, x: 1.

Tips & Tricks for the ASCII character

1️⃣ Initialize the Frequency Array Correctly:

Always initialize the frequency array to zero to avoid garbage values, which can lead to incorrect results.

int freq[26] = {0};  // Initializes all elements to 0

2️⃣ Efficient Index Calculation:

When using a character array for hashing, compute the index efficiently.

For example, for lowercase letters:

freq[ch - 'a']++;

This formula converts the character ch to an index in the range [0, 25].

3️⃣ Handle Case Sensitivity Properly:

If the problem is case-insensitive, consider converting all character to lowercase (or uppercase) before updating the frequency array:

ch = tolower(ch);  // Convert character to lowercase
freq[ch - 'a']++;

4️⃣ Edge Cases to Consider:

  • Empty Strings: Always handle empty input strings to avoid accessing invalid indices or performing unnecessary operations.
  • Non-Alphabetic Characters: If input strings can contain digits or special characters, ensure your code accounts for these or filters them out.

5️⃣ Use Hash Maps for Flexibility:

When the range of characters is unknown or if the input includes diverse characters (like symbols, digits, or extended ASCII), using a map or unordered_map is more flexible than a fixed-size array:

std::unordered_map<char, int> freq;
for (char ch : str) {
    freq[ch]++;
}