Updated on 13 Jun, 202623 mins read 16 views

The Dream: Real-Time Human Communication

Imagine you want to build a video conferencing platform.

Not a toy project.

Something like:

  • Zoom
  • Google Meet
  • Microsoft Teams
  • Discord
  • WhatsApp Calling

Let's define the requirements.

Requirement 1: Very Low Latency

Human conversation is extremely sensitive to delay.

Imagine this dialogue.

Alice:

Hello Bob

Bob hears it:

2 seconds later

Now Bob responds.

Alice hears it:

 2 seconds later

The conversation becomes painful.

Acceptable Latency

Roughly speaking:

LatencyExperience
< 100msExcellent
100-200msGood
200-400msNoticeable
> 500msAnnoying
> 1000msDifficult

Human communication requires extremely low latency.

This immediately eliminates many traditional approaches.

Requirement 2: Continuous Streaming

A chat application sends:

Message
Pause
Message
Pause

A video call sends:

Frame
Frame
Frame
Frame
Frame
Frame
...

continuously.

For example:

30 FPS video:

30 frames per second

60 FPS video:

60 frames per second

The system must handle a nonstop stream of data.

Requirement 3: Audio and Video Synchronization

Suppose Alice says:

Hello

Her lips move.

The second should match the movement.

If video arrives first:

Lips move

and audio arrives later:

Hello

the experience feels broken.

Synchronization becomes critical.

Requirement 4: Adaptive Quality

Network conditions constantly change.

Example:

At 9:00 AM:

Available Bandwidth: 20 Mbps

At 9:05 AM:

Availabel Bandwidth: 3 Mbps

The communication system must adapt automatically.

Otherwise:

Video freezes
Call drops

Requirement 5: Security

Voice calls contain sensitive information.

Video calls contain sensitive information.

Screen sharing may expose:

  • Passwords
  • Emails
  • Financial data

Therefore:

Encryption Required

not optional.

Requirement 6: NAT Traversal

As we learned:

Most devices are hidden behind:

  • Routers
  • NAT
  • Firewalls

The communication system must somehow connect them anyway.

This is one of the hardest requirements.

Requirement 7: Browser Support

Before WebRTC, video communication usually required plugins.

Examples:

  • Flash
  • Java Applets
  • Proprietary software

Problems:

Install plugin
Update plugin
Security vulnerabilities
Browser incompatibility

Users hated this.

Developers hated this.

Browser vendors hated this.

A better solution was needed.

Why WebSockets Were Not Enough

Many engineers ask:

If WebSockets provide real-time communication, why didn't we just use WebSockets?

Let's examine it carefully.

What WebSockets Actually Provide

WebSockets provide:

Persistent Bidirectional Communication

Example:

Client <--> Server

Messages can flow both ways.

But WebSockets only solve one proble:

Transporting Bytes

What WebSockets Do NOT Provide

WebSockets do not provide:

NAT Traversal

No STUN

No TURN

No ICE

Audio Processing

  • No codecs
  • No compression
  • No encoding
  • No decoding

Video Processing

  • No frame handling
  • No synchronization
  • No bitrate adaptation

Congestion Control

  • No bandwidth management
  • No quality adaptation

Media Security

  • No media-specific encryption pipeline

Peer Discovery

  • No peer connection mechanism

Pakcet Loss Recovery

  • No real-time media optimization

The Hiden Complexity

Suppose you want to build Zoom using WebSockets.

You would need to build:

Media Engine
Codec Engine
Audio Processing
Video Processing
Encryption
Congestion Control
NAT Traversal
Bandwidth Adaptation
Peer Discovery
Connection Negotiation

yourself.

This is an enormous undertaking.

Essentially:

You would end up rebuilding WebRTC.

The Industry's Realization

Engineers around the world kept solving the same problems repeatedly.

Every communication platform needed:

Audio
Video
Security
NAT Traversal
Low Latency
Adaptive Bitrate

Again and again.

The industry needed a standardized solution.

Google's Proposal

Around 2010, Google acquired a company called:

Global IP Solutions

commonly known as GIPS.

GIPS specialized in:

  • Voice over IP
  • Video communication
  • Real-time media technologies

Google recognized someting important:

Real-time communication should be built directly into browsers.

Not through plugins.

Not through third-party software.

Directly into the web platform.

This idea eventually evolved into WebRTC.

The Core Vision

The vision was simple:

Allow developers to build:

Audio Calls
Video Calls
Screen Sharing
File Transfer

using standard browser APIs.

Without plugins.

Without installations.

Without proprietary technology.

The Three Major Goals

WebRTC was designed around three primary goals.

Goal 1: Real-Time Communication

The system must support:

Audio
Video
Data

with minimal latency.

Goal 2: Peer-to-Peer First

Whenever possible:

Alice <-----> Bob

direct communication.

This reduces:

  • latency
  • bandwidth costs
  • infrastructure requirements

Goal 3: Secure by Default

Unlike many older systems:

WebRTC made encryption mandatory.

Not optional

Every WebRTC connection must be encrypted.

The WebRTC Philosophy

The designers of WebRTC asked:

What if browsers could provide all the hard parts automatically?

Instead of developers implementing:

NAT Traversal
Codecs
Encryption
Media Transport

the browser would provide them.

Developers would simply use APIs.

This philosophy became the foundation of WebRTC.

What WebRTC Actually Is

One of the biggest misconceptions:

WebRTC is a protocol

Wrong.

WebRTC is a framework.

More precisely:

A collection of standards,
protocols,
APIs,
and media technologies
working together.

It is not one thing.

It is many technologies integrated into one system.

The Major Building Blocks

At a high level, WebRTC consists of several major subsystems.

Media Capture

Responsible for obtaining:

Camera
Microphone
Screen

from the user's device.

Peer Connectivity

Responsible for:

Finding Peers
Creating Connections
Maintaining Connections

NAT Traversal

Responsible for:

STUN
TURN
ICE

operations.

Media Transport

Responsible for moving:

Audio
Video

across networks.

Security

Responsible for:

Encryption
Authentication
Key Exchange

Congestion Control

Responsible for adapting:

Bitrate
Quality
Resolution

based on network conditions.

A High-Level WebRTC Call

Let's see the entire journey before diving into details.

Imagine Alice starts a call.

Step 1

Capture media.

Camera
Microphone

become available

Step 2

Create a peer connection.

Browser prepares communication systems.

Step 3

Exchange connection information.

Peers share:

Capabilities
Addresses
Media Information

Step 4

Discover network routes.

Using:

STUN
TURN
ICE

Step 5

Establish secure communication.

Encryption keys created.

Step 6

Begin media transport.

Audio and video start flowing.

Step 7

Continuously adapt.

Monitor:

Bandwidth
Packet Loss
Latency

Adjust quality dynamically.

Why WebRTC Feels Complex

Many developers first encounter terms like:

Offer
Answer
SDP
ICE
STUN
TURN
RTP
RTCP
DTLS
SRTP

and become overwhelmed.

The reason is simple.

WebRTC combines knowledge from:

  • Networking
  • Security
  • Distributed Systems
  • Audio Engineering
  • Video Engineering
  • Browser Internals

into one platform.

 

Buy Me A Coffee

Leave a comment

Your email address will not be published. Required fields are marked *