CLOSE
Updated on 19 May, 202621 mins read 9 views

Introduction

Modern distributed systems are incredibly complex.

A single user request may travel through:

  • API gateways,
  • authentication services,
  • microservices,
  • databases,
  • caches,
  • queues,
  • third-party APIs,
  • serverless functions.

When something fails in production, engineers need answers to question like:

  • What happened?
  • Where did the failure occur?
  • Why did it happen?
  • Which service caused it?
  • How many users are affected?
  • Is the system still healthy?

This is where Observability becomes essential.

Observability is the ability to understand the internal state of a system by analyzing the data it produces.

The foundation of observability is built on three major components known as:

The Three Pillars of Observability

  1. Logs
  2. Metrics
  3. Traces

These three pillars work together to provide visibility into distributed systems.

What Are the Three Pillars of Observability?

The three pillars represent the primary types of telemetry data generated by modern systems.

Each pillar answers different kind of questions.

PillarPurpose
LogsWhat exactly happened?
MetricsIs the system healthy overall?
TracesHow did a request travel through the system?

Together they create a complete operational understanding of production systems.

Why These Pillars Exist

In distributed systems:

  • failures are partial,
  • logs are fragmented,
  • services are independent,
  • debugging is difficult.

No single telemetry type is sufficient.

Example:

  • Metrics may show latency spiked,
  • Logs may show database errors,
  • Traces may reveal which services caused the slowdown.

Only together do they provide full observability.

Pillar 1 – Logs

What Are Logs?

Logs are timestamped records of events generated by applications or infrastructure.

They describe:

  • system behavior,
  • errors,
  • warnings,
  • state changes,
  • request activity.

Logs are the most detailed form of telemetry.

Example Log:

{
  "timestamp": "2026-05-19T10:00:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment processing failed",
  "correlationId": "abc123"
}

Characteristics of Logs

Logs are:

  • event-based,
  • highly detailed,
  • timestamped,
  • unstructured or structured.

Types of Logs

1 Application Logs

Generated by application code.

Example:

  • user login,
  • payment processed,
  • API request failed.

2 System Logs

Generated by operating systems.

Example:

  • disk failures,
  • memory warnings,
  • kernel events.

3 Access Logs

Track incoming requests.

Example:

  • HTTP request logs,
  • API gateway logs.

4 Audit Logs

Track security-sensitive actions.

Example:

  • password changes,
  • admin actions,
  • permission updates.

Log Levels

Most logging systems use levels:

LevelMeaning
DEBUGDetailed development information
INFONormal system operation
WARNPotential issues
ERRORFailures occurred
FATALCritical system failure

Structured Logging

Modern systems prefer structured logs using JSON.

Why?

  • searchable,
  • filterable,
  • machine-readable,
  • scalable.

Example:

{
  "service": "auth-service",
  "level": "INFO",
  "userId": 101,
  "message": "User authenticated"
}

Centralized Logging

In distributed systems:

  • logs come from many servers,
  • many containers,
  • many services.

Centralized logging aggregates logs into one platform.

Strenghts of Logs

Logs are excelled for:

  • debugging,
  • root cause analysis,
  • detailed investigation.

Weakness of Logs

Logs are:

  • noisy,
  • high-volume,
  • expensive to store,
  • difficult to aggregate statistically.

This is why metrics exist.

Pillar 2 – Metrics

What Are Metrics?

Metrics are numerical measurements collected over time.

They summarize system behavior.

Metric answer:

  • Is the system healthy?
  • How much traffic exists?
  • Is latency increasing?
  • Are errors increasing?

Example Metrics

MetricValue
CPU Usage75%
Request Rate10,000 req/sec
Error Rate2%
API Latency120 ms

Characteristics of Metrics

Metrics are:

  • aggregated,
  • lightweight,
  • time-series based,
  • highly efficient.

Types of Metrics

1 Counter

Only increases.

Example:

  • total requests,
  • total errors.

2 Gauge

Represents current value.

Example:

  • CPU usage,
  • memory usage.

3 Histogram

Measures distribution.

Example:

  • request latency ranges.

4 Summary

Provides percentile statistics.

Example:

  • P95 latency,
  • P99 latency.

Time-Series Data

Metrics are stored as time-series data.

Example:

10:00 → 50ms
10:01 → 55ms
10:02 → 70ms

This allows:

  • trend analysis,
  • dashboards,
  • alerting.

Strengths of Metrics

Metrics are:

  • lightweight,
  • scalable,
  • excellent for alerting,
  • ideal for dashboards.

Weaknesses of Metrics

Metrics lack detail.

Metrics may show:

  • latency increased,

but not:

  • why latency increased.

This is where logs and traces help.

Pillar 3 – Traces

What Are Traces?

Traces track the complete journey of a request across distributed systems.

They answer:

  • Which services were involved?
  • Where was latency introduced?
  • Which service failed?
  • How long did each operation take?

Example Request Flow

Client
  ↓
API Gateway
  ↓
Auth Service
  ↓
Payment Service
  ↓
Database

A trace records the entire path.

Trace IDs and Span IDs

Trace ID

Represents:

  • one complete distributed request.

Span ID

Represents:

  • one operation inside the trace.

Example:

Trace
 ├── API Gateway Span
 ├── Auth Service Span
 ├── Payment Service Span
 └── Database Span

Example Trace Insights

Tracing can reveal:

  • slow database queries,
  • failed downstream services,
  • retry storms,
  • bottlenecks.

Strengths of Traces

Traces are excelled for:

  • microservices debugging,
  • latency analysis,
  • request flow visualization.

Weaknesses of Traces

Tracing systems:

  • can be expensive,
  • generate large telemetry volumes,
  • require instrumentation.

Relationship Between Logs, Metrics, and Traces

The three pillars complement each other.

Example Scenario

Suppose users report:

“Checkout is slow.”

Metrics Show

P95 latency increased from 200ms to 3s

Metrics reveal:

  • system health issue exists.

Traces Show

Payment Service → Database query taking 2.5s

Traces identify:

  • where slowdown occurred.

Logs Show

Database connection pool exhausted

Logs reveal:

  • root cause.

This Is the Power of Observability

Metrics detect.

Traces locate.

Logs explain.

Mental Model of the Three Pillars

Think of observability like a hospital:

PillarAnalogy
MetricsVital signs
LogsMedical records
TracesPatient journey

Each provides different visibility into system health.

Buy Me A Coffee

Leave a comment

Your email address will not be published. Required fields are marked *

Your experience on this site will be improved by allowing cookies Cookie Policy