Memory Alignment

The Basics of Memory Access

The CPU interacts with memory to read data from or write data to it. While it can theoretically access a single byte at a time, this is not how modern CPUs are optimized to operate. Instead, CPUs typically access memory in chunks of 2, 4, 8, 16, or even 32 bytes. These larger chunks are known as "words" and align with the data bus width of the CPU.

What is Memory Alignment?

Memory alignment refers to the arrangement of data in memory to adhere to specific boundaries. The alignment of data can affect the performance of a system due to how the CPU accesses memory. Properly aligned data allows the CPU to read and write data more efficiently, reducing the number of memory accesses required and thus speeding up operations.

Memory alignment refers to the way data is arranged and accessed in computer memory. It ensures that data is stored at memory addresses that are multiples of their size, which allows the CPU to read and write data more efficiently. For instance, a 4-byte integer is ideally stored at an address that is a multiple of 4 (e.g., 0x0004, 0x0008, etc.).

Alignment Boundaries:

Data is typically aligned on boundaries that are multiples of its size. For instance, Here are some common alignment rules for basic types in a typical 32-bit system:

  • char: 1-byte alignment (no specific alignment requirement).
    • It can be placed at any address.
  • short: 2-byte alignment (must be stored at addresses divisible by 2).
    • It should be placed at even addresses.
  • int: 4-byte alignment (must be stored at addresses divisible by 4).
    • It should be placed at addresses that are multiples of 4.
  • float: 4-byte alignment (must be stored at addresses divisible by 4).
    • It should be placed at addresses that are multiples of 4.
  • double: 8-byte alignment (must be stored at addresses divisible by 8).
    • It should be placed at addresses that are multiples of 8.
  • long: 4 bytes alignment (on a 32-bit system)
    • Should be placed at address divisible by 8.
  • long long: 8 bytes alignment.
    • Should be placed at addresses divisible by 8.
  • long double: 8 byte or 16-byte alignment (depends on the system).
    • Should be placed at addresses divisible by 8 or 16.
  • pointer: 4 bytes alignment (on a 32-bit system).
    • Should be placed at addresses divisible by 4.

Below is the table of Aligned and Unaligned memory access based on address and access size.

AddressByte (8 bits)2 Bytes (16 bits)4 Bytes (32 bits)8 Bytes (64 bits)
0x0alignedalignedalignedaligned
0x1alignedunalignedunalignedunaligned
0x2alignedalignedunalignedunaligned
0x3alignedunalignedunalignedunaligned
0x4alignedalignedalignedunaligned
0x5alignedunalignedunalignedunaligned
0x6alignedalignedunalignedunaligned
0x7alignedunalignedunalignedunaligned
0x8alignedalignedalignedaligned
0x9alignedunalignedunalignedunaligned
0xAalignedalignedunalignedunaligned
0xBalignedunalignedunalignedunaligned
0xCalignedalignedalignedunaligned
0xDalignedunalignedunalignedunaligned
0xEalignedalignedunalignedunaligned
0xFalignedunalignedunalignedunaligned

As a practical note, If the rightmost digit of the address (represented in a hexadecimal format) is divisible by the number of bytes, we have aligned memory access.

Why Alignment Matters

  1. Performance: Aligned memory accesses are faster because the CPU can read or write an entire word in a single operation. Misaligned accesses may require multiple operations, additional processing, and memory fetches, leading to performance degradation.
  2. Correctness: Some CPUs enforce alignment requirements and generate faults or exceptions on misaligned accesses. Ensuring proper alignment prevents such issues and enhances software stability.
  3. Hardware Optimization: Modern CPUs and memory subsystems are optimized for aligned accesses. Proper alignment allows the use of hardware features like cache lines and prefetching, further boosting performance.

Memory Alignment in Different Architectures

The alignment requirements and the way CPUs handle misaligned accesses vary across different architectures:

  1. x86 Architecture: Generally tolerant of misaligned accesses but at the cost of performance penalties due to additional processing.
  2. ARM Architecture: Earlier versions strictly required aligned accesses, while modern ARM CPUs can handle misaligned accesses but with a performance hit.
  3. PowerPC Architecture: Strict alignment requirements, with misaligned accesses causing exceptions or faults.

The Importance of Memory Alignment

To fully understand the significance of memory alignment, consider a scenario where data is misaligned. Suppose you have a 4-byte int stored at an address that is not divisible by 4. The CPU would need to perform two memory accesses to read or write this data, first accessing part of the data from one memory address and then accessing the remaining part from the next address. This not only doubles the number of memory accesses but also introduces additional computational overhead to merge or split the data.

In contrast, when data is aligned properly, the CPU can read or write the entire unit in a single, efficient memory access. This is why programming languages and compilers often include mechanisms to ensure proper alignment, and why developers need to be mindful of alignment when optimizing performance-critical code.

CPU Aligned and Misaligned Memory Read:

The CPU tries to read data at its word size for the efficiency. Word Size of a CPU typically refers to the number of bits it can process at once in a single instruction. For example, 32-bit system word size is 32-bit (4 bytes) and 64-bit system's word size is 64-bit (8 bytes).

For example: let's have a struct in memory that looks like this:

struct Example {
    char a;  // one byte
    int b;   // four bytes
    short c; // two bytes
}

On a 32-bit processor it would most likely be aligned like shown here:

image-202.png

The processor can read each of these members in one cycle.

Suppose you trying to access the char a, the CPU just read it in a single cycle since 0x0000 is 4-byte aligned. If you trying to access the int b it is easier for the CPU by reading at memory address 0x0004 which is 4 byte aligned and at last it is easier to access the short c, as it is also the 4 byte aligned.

If you use the packed attribute in your structure, then the compiler will not add padding to align it to 4-byte.

image-203.png

In the provided image, the structure is laid out in memory as follows:

  1. char a is at address 0x0000.
  2. int b starts at address 0x0001 (this is misaligned since int typically needs to be on a 4-byte boundary).
  3. short c starts at address 0x0005 (this is also misaligned since short typically needs to be on a 2-byte boundary).
| Address|
| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
|--------|--------|--------|--------|--------|--------|--------|--------|

| Data   |
| a      | b1     | b2     | b3     | b4     | c1     | c2     | ...... |

Here, b1, b2, b3, and b4 are the bytes of the int b, with b1 being the LSB and b4 the MSB.

-: Reading char a :-

  • Address: 0x0000
  • Size: 1 byte

Since char a is only 1 byte, it can be read directly from memory without any issues, regardless of alignment. The CPU fetches the byte at 0x0000.

| Address | 0x0000 | 0x0001 | 0x0002 | 0x0003 |
|---------|--------|--------|--------|--------|
| Data    | a      | b1     | b2     | b3     |
  • Single Fetch:
    • The CPU performs a read operation at the address 0x0000 to fetch the byte a.
    • The CPU reads the 4-byte word starting at 0x0000, which fetches data `[]

-: Reading int b :-

  • First Fetch:
    • The CPU reads the 4-byte word starting at 0x0000, getting the data: [a, b1, b2, b3].
  • Second Fetch:
    • The CPU reads the 4-byte word starting at 0x0004, getting the data: [b4, c1, c2, ......].
  • Combining Data:
    • The CPU then extracts the relevant bytes from these reads to form the 4-byte int b.
    • From the first fetch: it takes b1, b2, b3.
    • From the second fetch: it takes b4.
  • Shifting and Combining:
    • The CPU aligns these bytes correctly to form the integer.
    • int b = (b4 << 24) | (b3 << 16) | (b2 << 8) | b1;
| Address|
| 0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
|--------|--------|--------|--------|--------|--------|--------|--------|

| Data   |
| a      | b1     | b2     | b3     | b4     | c1     | c2     | ...... |
|        | <----- First Fetch-----> | <----- Second Fetch------->       |

Extra Operations Required:

  • 2 read operation (one starting at 0x0000 and another at 0x0004).
  • Additional steps to extract and combine bytes (this is typically handled internally by the CPU but can be considered as extra processing overhead).

-: Reading short c :-

c1 and c2 are the bytes of the short c, with c1 being the LSB and c2 the MSB in a little-endian system.

  • First Fetch:
    • The CPU reads the 4-byte word starting at 0x0004, getting the data: [b4, c1, c2, ......].
  • Extracting Data:
    • From the fetched data, the CPU needs the bytes starting from 0x0005 to form the 2-byte short c.
    • It extracts c1 and c2.
  • Combining Bytes:
    • In a little-endian system, the combination is done as follows:
    • short c = (c2 << 8) | c1;
| Address | 0x0004 | 0x0005 | 0x0006 | 0x0007 |
|---------|--------|--------|--------|--------|
| Data    | b4     | c1     | c2     | ...... |
|         | <- 4-byte Fetch ->       |

Visualizing Memory Alignment

Let's visualize memory alignment with an example. Consider a structure in C:

struct Example {
    char a;   // 1 byte
    int b;    // 4 bytes
    short c;  // 2 bytes
};

Without alignment, the structure would occupy the following bytes:

Byte OffsetData
0a
1padding
2padding
3padding
4b (start)
5b
6b
7b (end)
8c (start)
9c

Here, a occupies the first byte, but b must start at the 4-byte boundary, so bytes 1, 2, and 3 are padding. Similarly, c starts at byte 8 to maintain the 2-byte alignment.

Visualizing this:

| Address | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---------|---|---|---|---|---|---|---|---|---|---|
| Data    | a | P | P | P | b | b | b | b | c | c |

Here, P denotes padding bytes. The size of the structure is 10 bytes.

Implications of Misalignment

If data is misaligned, the CPU may need to perform more operations to access the data. For example, accessing a misaligned int might require two memory accesses instead of one, significantly impacting performance.

Consider accessing an int that is not aligned:

| Address | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---------|---|---|---|---|---|---|---|---|
| Data    | a | b | b | b | b | c | c |   |

Accessing the int that starts at address 1 would require the CPU to read parts of the int from two different memory locations, increasing the number of cycles needed.