Updated on 06 Nov, 202517 mins read 363 views

As our system continues to grow, we have distributed it across multiple servers, multiple databases, and even multiple data centers.

But as soon as you distribute your data, one thing becomes clear:

You can't have everything – you have to choose what you want to sacrifice.

This trade-off is captured beautifully in one of the core principles of distributed systems:

The CAP Theorem

What Is the CAP Theorem?

The CAP Theorem (also known as Brewer's Theorem) states that in a distributed data system, you can only guarantee maximum two out of the following three properties at the same time:

LetterPropertyMeaning
CConsistencyEvery node sees the same data at the same time. In other words, all nodes see the same data at the same time.
AAvailabilityEvery request gets a response (even if some nodes fail), without guarantee that it contains the most recent write.
PPartition ToleranceThe system continues to work even if network communication breaks between nodes.

A distributed system must handle partitions (network failures are inevitable).

So in practice, you usually chose between Consistency and Availability when partitions occur, as Partition Tolerance is mandatory.

The Three Types of Systems (CA, CP, AP)

TypeWhat It PrioritizesWhat It SacrificesExample
CA (Consistency + Availability)Consistent and available as long as there’s no partitionFails under network partitionTraditional single-node RDBMS (MySQL before sharding)
CP (Consistency + Partition Tolerance)Always consistent, even during partitionMay become temporarily unavailableHBase, MongoDB (with strict consistency), Google Spanner
AP (Availability + Partition Tolerance)Always available, even if data might be inconsistent temporarilyEventual consistencyCassandra, DynamoDB, CouchDB

Consistency

Every read receives the most recent write.

If you update data on one node, all other nodes must immediately reflect that update before any other client read it.

consistency
As shown in the figure above, user request for the userId: 1 to all the nodes, In  case of consistency, the response should be same.

If there is any update in any of the node then all the other nodes should also get updated. Thus we can say, client gets the same data at the same time no matter the node.

Let's take the case of Instagram, where user changed the profile picture in the db1, then if anyone hit the request to watch your profile in the db2, and db3, then they should see your updated profile picture.

Example:

Banking systems – When you transfer money, you can't afford inconsistent balances – it must be strongly consistent.

Availability

Every request receives a (non-error) response, even if some nodes are down.

The system will never refuse to respond, even if some data might be slightly outdated.

availability
 

Example: Amazon product pages

If one database node fails, other still serve cached data – better to show slightly old info than to be unavailable.

Partition Tolerance

The system continues to operate even when some nodes cannot communicate with each other.

In distributed systems spanning multiple data centers or regions, network partitions are inevitable – links can drop, timeouts happen, and packets get lost.

Partition tolerance means the system can recover gracefully from such failures.

partition_tolerance
 A partition occurs when:

  • Network links fail
  • Nodes crash temporarily
  • Data centers lose connectivity

In other words, part of your distributed system is isolated from the rest.

Example:

If your US and EU databases lose connection, both should keep running (with some compromise).

The Real Trade-off: You Must Choose

When a network partition happen (P), the system must decide:

  • Should it stop serving requests until everything syncs?
    (Consistency)
  • Or serve data from available nodes even if it's outdated?
    (Availability)

That's the CAP dilemma.

Availability Numbers

In system design, availability is the percentage of time a system is operational and serving requests over a given period (usually a year).

It is usually expressed in nines, like “99.9% availability” or “five nines.

Calculating Availability

Formula:

Availability (%) = Uptime / (Uptim + Downtime) * 100
  • Uptime = Time system is operational
  • Downtime = Time system is unavailable

Common Availability Levels

AvailabilityDowntime per yearDowntime per monthDowntime per dayDescription
99% (2 nines)~3.65 days~7.3 hours~14.4 minBasic level, often acceptable for internal apps
99.9% (3 nines)~8.76 hours~43.2 min~1.44 minStandard SLA for many SaaS applications
99.99% (4 nines)~52.6 min~4.32 min~8.6 secHigh availability systems; e-commerce, banking
99.999% (5 nines)~5.26 min~25.9 sec~0.86 secMission-critical, extremely resilient systems
99.9999% (6 nines)~31.5 sec~2.59 sec~0.086 secUltra-critical systems like flight control, stock exchanges
Buy Me A Coffee

Leave a comment

Your email address will not be published. Required fields are marked *