Updated on 06 Mar, 202619 mins read 13 views

When you try to sign up on platforms like Instagram, Facebook, Google, or LinkedIn, you must have noticed a common feature:

As soon as you type a username, within a fraction of a second the interface shows:

✓ Username available

or

✗ Username already taken

And it happens almost instantly.

Have you ever wondered how this works internally?

Because if you think about it, these platforms may have hundreds of millions or even billions of users. Checking username availability should technically require searching through a massive database.

Yet the response is almost immediate.

The First Intuition: Query the Database

The most straightforward implementation might look like this:

Whenever a user types a username, the frontend sends a request to the backend.

The backend runs a query like:

SELECT * FROM users WHERE username = "thejat" LIMIT 1;

If a row is returned:

username exists → taken

If nothing is returned:

username does not exist → available

At first glance, this seems perfectly reasonable.

However, this approach does not scale well.

Why This Approach Doesn't Scale

Imagine a platform with:

10 million users

Every time someone types a username:

client → backend → database query

Even if the query is optimized, this leads to several problems:

1 Database Load

Millions of users checking usernames simulatenously would generate massive query traffic.

The database would become a bottleneck.

2 Time Complexity

Without optimization, searching is roughly:

O(n)

Where n = number of users.

Even with indexing, the database still need to perform:

O(log n)

operations.

For massive systems, even logarithmic queries become expensive when performed millions of times per second.

3 Cost of Scaling

To support this load, companies would need:

  • more database replicas
  • more computing power
  • more infrastructure

Which dramatically increases cost and complexity.

So companies avoid hitting the database whenever possible.

Second Idea: Use a Hash Map

A better idea would be to maintain a hash map in memory.

Whenever a new user signs up:

username → true

Example:

{
  "thejat": true,
  "alex_dev": true,
  "coder99": true
}

Now checking a username becomes extremely fast:

hashmap.contains("thejat")

Hash lookups are approximately:

O(1)

Which is very fast.

The Problem with Hash Maps

While hash maps solve the speed issue, they introduce a major new problem: memory usage.

Let's estimate.

Assume:

10 million usernames

If the average username length is:

10 characters

Memory usage roughly becomes:

10M * 10 bytes = 100MB

But in reality, hash maps require:

  • extra metadata
  • hash buckets
  • pointers
  • load factors

So memory usage might grow to:

500MB - 1GB

Now imagine platforms with:

500 million users

Memory usage becomes enormous.

Keeping such a huge structure fully in RAM is expensive and inefficient.

So while hash maps are fast, they are not memory efficient enough for massive systems.

Enter the Game Changer: Bloom Filters

This is where Bloom Filters comes into the picture.

A Bloom Filter is a space-efficient probablistic data structure used to check whether an element is possibly in a set or definitely not in a set.

It allows systems to perform extremely fast membership checks using very little memory

Instead of storing actual usernames, a Bloom filter stores hash fingerprint of usernames inside a bit array.

This dramatically reduces memory usage.

Core Idea Behind Bloom Filters

A Bloom filter consists of two components:

  1. A bit array
  2. Multiple hash functions

Example bit array:

0 0 0 0 0 0 0 0 0 0 0 0

We also define k hash functions:

h1(x)
h2(x)
h3(x)
...
hk(x)

Each hash function maps the input to a position in the bit array.

How Insertion Works

Suppose a user registers with username:

TheJat

We run the username through multiple hash functions:

h1(TheJat) → 3
h2(TheJat) → 7
h3(TheJat) → 9

Now we set those positions in the bit array to 1.

Before insertion:

    0     1     2     3     4     5     6     7     8     9
 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
 |  0  |  0  |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

After insertion:

    0     1     2     3     4     5     6     7     8     9
 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
 |  0  |  0  |  0  |  0  |  0  |  0  |  0  |  1  |  0  |  1  |
 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

Notice something important:

The actual username is never stored.

Only bit positions are set.

How Lookup Works

Now suppose another user tries to check:

TheJat

We run the same hash functions again:

h1(TheJat) → 3
h2(TheJat) → 7
h3(TheJat) → 9

Then we check the bit array.

If any of the bits are 0:

username definitely does not exist

If all bits are 1.

username possibly exists

The Catch: False Positive

Bloom filters are probablistic.

This means they can sometimes say:

username exists

even though it doesn't actually exist.

This is called a false positive.

But Bloom filters guarantee something extremely important:

False negatives never happen.

Meaning:

If the Bloom filter says the username does not exist, then it is 100% guaranteed to be available.

How Platforms Actually Use Bloom Filters

In large systems, the flow typically looks like this:

User types username
        │
        ▼
Check Bloom Filter (RAM)
        │
   ┌────┴────┐
   │         │
   ▼         ▼
Definitely   Possibly
Not Exists   Exists
   │         │
Show         Verify with
Available    Database

This means the database is queried only when necessary, drastically reducing load.

 

Buy Me A Coffee

Leave a comment

Your email address will not be published. Required fields are marked *