Databases

Databases are a core component in system design, serving as the backbone for storing, retrieving, and managing data. In a well-designed system, the database must balance performance, scalability, consistency, and reliability while supporting the application's specific data requirements.

Types of Databases

Relational Databases (SQL)

  • Characteristics:
    Use structured schemas, support ACID transactions (Atomicity, Consistency, Isolation, Durability), and rely on SQL for querying.
  • Examples:
    MySQL, PostgreSQL, Oracle.
  • Use Cases:
    Ideal for applications that require strong consistency and structured data (e.g., financial systems, CRM).

NoSQL Databases

  • Characteristics:
    Offer flexible schemas, can scale horizontally, and are designed to handle large volumes of unstructured or semi-structured data. They often sacrifice strict consistency for improved scalability and availability.
  • Types:
    • Document Stores: (e.g., MongoDB, Couchbase)
    • Key-Value Stores: (e.g., Redis, DynamoDB)
    • Column-Family Stores: (e.g., Cassandra, HBase)
    • Graph Databases: (e.g., Neo4j, JanusGraph)
  • Use Cases:
    Suited for real-time analytics, content management, and applications that require flexible data models (e.g., social media, IoT).

1️⃣ Relational Database

A relational database is a type of database that stores data in structured tables, where relationships between the data are established through keys. It’s built on the relational model, which was introduced by E.F. Codd in 1970.

Core Components

  • Tables:
    The fundamental building blocks of a relational database. Each table represents an entity (for example, Customers or Orders) and contains rows (records) and columns (attributes).
  • Rows:
    Each row in a table is a single record or instance of the entity.
  • Columns:
    Columns define the data attributes for the entity, such as customer names, order dates, or product prices. Each column has a defined data type (e.g., integer, string, date).
  • Primary Key:
    A unique identifier for each row in a table. It ensures that each record can be uniquely distinguished.
  • Foreign Key:
    A field in one table that uniquely identifies a row of another table, establishing a relationship between the two tables.
  • Relationships:
    • One-to-One: Each record in Table A relates to one record in Table B.
    • One-to-Many: A single record in Table A relates to multiple records in Table B.
    • Many-to-Many: Records in Table A relate to multiple records in Table B and vice versa, often managed with a junction table.

Structured Query Language (SQL)

SQL is the standardized language used to manage and query relational databases.

  • Capabilities:
    • Data Definition Language (DDL): Commands like CREATE, ALTER, and DROP define and modify schema.
    • Data Manipulation Language (DML): Commands such as INSERT, UPDATE, DELETE, and SELECT handle data retrieval and manipulation.
    • Data Control Language (DCL): Commands like GRANT and REVOKE manage access control and permissions.

ACID Properties

Relational databases are known for supporting ACID transactions, ensuring data reliability and consistency:

  • Atomicity:
    A transaction is treated as a single unit that either fully completes or fully fails.
  • Consistency:
    Transactions bring the database from one valid state to another, enforcing all predefined rules (constraints).
  • Isolation:
    Concurrent transactions operate independently without interference, ensuring that intermediate transaction states are not visible to others.
  • Durability:
    Once a transaction is committed, it remains persistent, even in the event of a system failure.

Data Modeling and Normalization

  • Data Modeling:
    The process of structuring data logically within tables, ensuring that each piece of data is stored in an appropriate location.
  • Normalization:
    A systematic approach to organizing data to reduce redundancy and improve data integrity.
    • First Normal Form (1NF): Eliminate duplicate columns, and ensure each cell contains a single value.
    • Second Normal Form (2NF): Remove subsets of data that apply to multiple rows and place them in separate tables.
    • Third Normal Form (3NF): Eliminate columns that are not dependent on the primary key.
    • Boyce-Codd Normal Form (BCNF): A stricter version of 3NF to handle specific types of anomalies.

N + 1 Problem

The N + 1 query problem in SQL is a common performance pitfall that occurs when an application executes one query to retrieve a set of records (the 1 query) and then, for each of those records (the N records), it executes an additional query to fetch related data. This results in a total of N + 1 queries, which significantly degrade performance, especially when N is large.

How the Problem Occurs

Imagine you have two related tables, such as users and posts, where each user can have multiple posts. The goal is to display a list of users along with their posts.

Naive Approach

  1. Initial Query:
    1. You run a query to retrieve all users:

      SELECT * FROM users;
      
    2. Suppose this return 100 users
  2. N Additional Queries
    1. Then, for each user, you run another query to fetch that user's posts:

      SELECT * FROM posts WHERE user_id = ?;
      
    2. If you execute this query 100 times (once per user), you end up with 1 (initial) + 100 (once per user) = 101 queries.

This pattern of one query to load the parent records and one additional queries per parent to load the child records is what constitutes the N+1 query problem.