Updated on 26 Nov, 202524 mins read 426 views

As our system continues to grow, we have distributed it across multiple servers, multiple databases, and even multiple data centers.

But as soon as you distribute your data, one thing becomes clear:

You can't have everything – you have to choose what you want to sacrifice.

This trade-off is captured beautifully in one of the core principles of distributed systems:

The CAP Theorem

What Is the CAP Theorem?

The CAP Theorem (also known as Brewer's Theorem) states that in a distributed data system, you can only guarantee maximum two out of the following three properties at the same time:

LetterPropertyMeaning
CConsistencyEvery node sees the same data at the same time. In other words, all nodes see the same data at the same time.
AAvailabilityEvery request gets a response (even if some nodes fail), without guarantee that it contains the most recent write.
PPartition ToleranceThe system continues to work even if network communication breaks between nodes.

A distributed system must handle partitions (network failures are inevitable).

So in practice, you usually chose between Consistency and Availability when partitions occur, as Partition Tolerance is mandatory.

The Three Types of Systems (CA, CP, AP)

TypeWhat It PrioritizesWhat It SacrificesExample
CA (Consistency + Availability)Consistent and available as long as there’s no partitionFails under network partitionTraditional single-node RDBMS (MySQL before sharding)
CP (Consistency + Partition Tolerance)Always consistent, even during partitionMay become temporarily unavailableHBase, MongoDB (with strict consistency), Google Spanner
AP (Availability + Partition Tolerance)Always available, even if data might be inconsistent temporarilyEventual consistencyCassandra, DynamoDB, CouchDB

Consistency

Every read receives the most recent write.

If you update data on one node, all other nodes must immediately reflect that update before any other client read it.

consistency
As shown in the figure above, user request for the userId: 1 to all the nodes, In  case of consistency, the response should be same.

If there is any update in any of the node then all the other nodes should also get updated. Thus we can say, client gets the same data at the same time no matter the node.

Let's take the case of Instagram, where user changed the profile picture in the db1, then if anyone hit the request to watch your profile in the db2, and db3, then they should see your updated profile picture.

Example:

Banking systems – When you transfer money, you can't afford inconsistent balances – it must be strongly consistent.

Availability

Every request receives a (non-error) response, even if some nodes are down.

The system will never refuse to respond, even if some data might be slightly outdated.

availability
 

Example: Amazon product pages

If one database node fails, other still serve cached data – better to show slightly old info than to be unavailable.

Partition Tolerance

The system continues to operate even when some nodes cannot communicate with each other.

In distributed systems spanning multiple data centers or regions, network partitions are inevitable – links can drop, timeouts happen, and packets get lost.

Partition tolerance means the system can recover gracefully from such failures.

partition_tolerance
 A partition occurs when:

  • Network links fail
  • Nodes crash temporarily
  • Data centers lose connectivity

In other words, part of your distributed system is isolated from the rest.

Example:

If your US and EU databases lose connection, both should keep running (with some compromise).

The Real Trade-off: You Must Choose

When a network partition happen (P), the system must decide:

  • Should it stop serving requests until everything syncs?
    (Consistency)
  • Or serve data from available nodes even if it's outdated?
    (Availability)

That's the CAP dilemma.

distributed_node_cap

Consider the picture given above as an distributed nodes architecture.

Suppose connection between db3, db2 and db3, db1 broke then it means db3 is running but not able to communicate with db1 and db2.

And out application supports Partition Tolerance.

distributed_node_architecture_cap_with_connection_break
Now we need to prove that we can't achieve both Consistency and Availability.

Data to be Consistent:

When a user updated the profile picture on db1 and another user trying to get that profile on db3. This way the second user will get stale data as profile picture is only updated on db1 and replicated to db2 also but not to db3 (break in connection).

distributed_nodes_cap_proving
 To Make it consistent:

We need to prevent the write operation on db1 and db2 node until the broken link to db3 is restored.

Now if any user try to update the profile picture on node db1 and db2 then we denied this action. However because of this Availability is compromised.

Availability means everyone should get response but we didn't give the status OK (HTTP = 200).

To Make it Available:

We have to allow all operations but doing so our system will not be consistent. As if some user updated the profile pic on db1 and it will be reflected to db2 also but not on db3. So anyone accessing that profile from db3 will get stale data.

So we compromised the consistency.

Consistency vs Availability

Consistency:

All users see the same data at the same time.

What it means:

  • Every read returns the most recent write.
  • The system behaves like a single, up-to-date database.
  • No stale data.

When it's preferred:

  • Financial systems (bank balances, trades)
  • Inventory systems that must avoid overselling
  • Anything where wrong is worse than slow

Trade-off

  • May require waiting for nodes to synchronize
  • Can reject requests if all replicas aren't aligned
  • Strong consistency -> lower availability

Availability

The system always responds, even if it might return older data.

What it means:

  • Every request gets some response, even during failures.
  • System prioritizes uptime and responsiveness.
  • Might return stale data if nodes cannot sync.

When it's preferred:

  • Social media feeds
  • Caches and content delivery
  • Services where slightly outdated is acceptable.

Trade-off:

  • More uptime -> weaker consistency guarantees
  • Clients might temporarily see different data

Availability Numbers

In system design, availability is the percentage of time a system is operational and serving requests over a given period (usually a year).

It is usually expressed in nines, like “99.9% availability” or “five nines.

Calculating Availability

Formula:

Availability (%) = Uptime / (Uptim + Downtime) * 100
  • Uptime = Time system is operational
  • Downtime = Time system is unavailable

Common Availability Levels

AvailabilityDowntime per yearDowntime per monthDowntime per dayDescription
99% (2 nines)~3.65 days~7.3 hours~14.4 minBasic level, often acceptable for internal apps
99.9% (3 nines)~8.76 hours~43.2 min~1.44 minStandard SLA for many SaaS applications
99.99% (4 nines)~52.6 min~4.32 min~8.6 secHigh availability systems; e-commerce, banking
99.999% (5 nines)~5.26 min~25.9 sec~0.86 secMission-critical, extremely resilient systems
99.9999% (6 nines)~31.5 sec~2.59 sec~0.086 secUltra-critical systems like flight control, stock exchanges
Buy Me A Coffee

Leave a comment

Your email address will not be published. Required fields are marked *