All About Linking Process

The build process in software development involves several  stages that transform source code into an executable program. This process is especially detailed and crucial in systems programming, such as kernel development. Below, we'll go through the stages of the build process, including compiling, assembling, and linking, with a focus on how these stages apply in a typical kernel development environment.

1 Overview of the Build Process

  1. Preprocessing: Handles macro substitution, file inclusion, and conditional compilation.
  2. Compilation: Converts preprocessed source code into assembly language.
  3. Assembly: Translates assembly code into machine code, producing object files.
  4. Linking: Combines object files and libraries into a single executable or binary.
  5. Loading: 

Detailed Stages of the Build Process

1 Preprocessing:

The preprocessing stage handles the directives in the source code that start with # (e.g., #include, #define). The preprocessor performs the following tasks:

  • File Inclusion: Replaces #include directives with the contents of the included files.
  • Macro Expansion: Replaces macro names with their definitions.
  • Conditional Compilation: Includes or excludes parts of the code based on #ifdef, #ifndef, and similar directives.

Example source code (main.c):

#include <stdio.h>
#define MESSAGE "Hello, World!"

int main() {
    printf("%s\n", MESSAGE);
    return 0;
}

Preprocessed output (main.i):

extern int printf(const char *, ...);
int main() {
    printf("%s\n", "Hello, World!");
    return 0;
}

2 Compilation

The compilation stage converts preprocessed source code into assembly language. Each source file is compiled independently into an assembly file.

Example assembly output (main.s):

.file "main.c"
.section .rodata
.LC0:
    .string "Hello, World!"
.text
.globl main
.type main, @function
main:
    pushl %ebp
    movl %esp, %ebp
    subl $8, %esp
    movl $.LC0, (%esp)
    call printf
    leave
    ret

3 Assembly

The assembly stage translates the assembly code into machine code, producing object files. An assembler takes the assembly code and generates binary instructions for the target architecture.

Example object file (main.o, in binary):

4 Linking

To combine object files into a single executable, resolving symbol references and arranging sections in memory.

Tools: Linker (e.g., ld)

Steps:

  • Combine multiple object files.
  • Resolve external symbols (e.g., function calls between object files).
  • Arrange code and data in memory according to the linker script.
  • Generate the final executable or binary file.

5 Loading

To load the executable into memory and start its execution.

Steps:

  • Load the executable into the appropriate memory location.
  • Set up the initial execution context (stack, registers).
  • Jump to the entry point of the executable (e.g., start in the linker script).

2 Overview of the Linking Process

The linking process can be divided into several key tasks:

  1. Symbol Resolution
  2. Relocation
  3. Section Merging
  4. Memory Layout and Address Assignment
  5. Generation of Executable or Library

1 Symbol Resolution

Purpose: To resolve all symbols (functions, variables) referenced in the code but defined in different modules or libraries.

Steps:

  • The linker collects all object files and libraries specified in the link command.
  • It builds a symbol table from these files, mapping symbol names to their addresses.
  • If a symbol is referenced in one object file but defined in another, the linker ensures it knows where to find the definition.

2 Relocation

Purpose: To adjust addresses within the code and data sections to reflect their actual locations in memory.

Steps:

  • Each object file contains relocation entries that tell the linker where to adjust addresses.
  • The linker calculates the final memory addresses for code and data sections.
  • It updates the addresses in the code and data sections based on the calculated addresses.

3 Section Merging

Purpose: To combine sections of the same type (e.g., .text, .data) from different object files into single sections.

Steps:

  • The linker merges all .text sections from different object files into a single .text section.
  • Similarly, it merges .data sections, .bss sections, and other relevant sections.

4 Memory Layout and Address Assignment

Purpose: To define the memory layout of the final executable, specifying where each section should be loaded into memory.

Steps:

  • The linker script (if used) provides detailed instructions on how to arrange sections in memory.
  • The linker uses this script to assign starting addresses to each section.
  • It ensures that sections are aligned correctly and follow the specified memory layout.

5 Generation of Executable or Library

Purpose: To create the final output file (executable, shared library, or static library) that can be loaded and executed by the operating system.

Steps:

  • The linker writes the combined and relocated sections to the output file.
  • It generates necessary headers and tables (e.g., symbol table, relocation table) required by the operating system.

3 Example: Linking Process with Linker Script

To illustrate these concepts, let's consider an example using a linker script to create an executable for a simple kernel.

Linker Script Example (link.ld)

OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
phys = 0x00100000;

SECTIONS
{
    .text phys : AT(phys) {
        code = .;
        *(.text)
        *(.rodata)
        . = ALIGN(4096);
    }
    .data : AT(phys + (data - code)) {
        data = .;
        *(.data)
        . = ALIGN(4096);
    }
    .bss : AT(phys + (bss - code)) {
        bss = .;
        *(.bss)
        . = ALIGN(4096);
    }
    end = .;
    /DISCARD/ : {
        *(.comment)
        *(.eh_frame)
        *(.note.gnu.build-id)
    }
}

Breakdown of the Linker Script:

  • OUTPUT_FORMAT(elf32-i386): Specifies the output format as ELF for a 32-bit Intel architecture.
  • ENTRY(start): Defines the entry point of the executable as the symbol start.
    • When the OS or bootloader loads this executable, it will begin execution at the address associated with the start symbol.
    • The start symbol is typically defined in one of the source file, usually in assembly or C, and marks the beginning of the program's execution flow.
    • The start symbol, which is typically the first instruction of your kernel or program, will be located at the physical address 0x00100000.
  • phys = 0x00100000;: Sets a physical address where the code should be loaded.
    • Defines a base physical address for the sections in the executable.
  • SECTIONS { ... }: Defines the memory layout of the executable.

Sections Defined:

  • .text: Contains the code and read-only data, starting at the physical address specified by phys.
    • phys specifies the physical address where the .text section starts.
    • This section includes the code and read-only data (*(.text) and *(.rodata)).
    • The starting address of this section is 0x00100000.
  • .data: Contains initialized data, positioned right after the .text section.
    • This section starts at the address calculated by phys + (data - code), ensuring that it follows the .text section.
    • It includes initialized data (*(.data)).
  • .bss: Contains uninitialized data, positioned after the .data section.
    • This section starts at the address calculated by phys + (bss - code), ensuring that it follows the .data section.
  • end = .;: Marks the end address of the entire memory layout.
    • It marks the end of the memory layout.
  • /DISCARD/: Excludes unnecessary sections from the final executable.
    • Discards specific sections that are not needed in the final executable.

Linking Process Steps:

  1. Symbol Resolution: The linker collects all object files (e.g., kernel.o) and libraries specified in the link command. It builds a symbol table mapping symbol names to their addresses.
  2. Relocation: The linker reads relocation entries in the object files and adjusts addresses based on the final memory layout.
  3. Section Merging: It merges .text, .data, and .bss sections from different object files into single sections as specified in the linker script.
  4. Memory Layout and Address Assignment: Using the linker script, the linker assigns starting addresses to each section and ensures proper alignment.
  5. Generation of Executable: The linker writes the combined and relocated sections to the output file (e.g., kernel.elf), generates headers, and necessary tables.

Practical Outcome

Every time the linker script is used:

1 Placement of start:

  • The start symbol, which is typically the first instruction of your kernel or program, will be located at the physical address 0x00100000.

2 Loading:

  • When a bootloader loads the ELF file generated by this linker script, it will load the .text section (and thus the start symbol) into memory starting at 0x00100000.

3 Execution:

  • The bootloader will then jump to 0x00100000 to start executing the code at start.

Example:

.section .text
.globl start
start:
    cli                 # Clear interrupts
    hlt                 # Halt the CPU

.section .rodata
message:
    .ascii "Hello, kernel!\n"

.section .data
my_data:
    .long 0xdeadbeef

.section .bss
.lcomm my_bss, 4

When you link this code using the provided linker script:

  • The .text section, containing the start label and its instructions, is placed at 0x00100000.
  • The .data section is placed after .text.
  • The .bss section is placed after .data.

Loading and Execution by the Bootloader:

  • The bootloader reads the ELF file and loads the .text section at 0x00100000.
  • It then jumps to 0x00100000, starting execution from the start label.

4 Assumptions in Code

Let's assume you have the following C code:

int print() {
    // Function implementation
}

int start() {
    // Function implementation
}

4.1 Compilation and Linking

1 Compilation:

  • When the source files are compiled, the print and start functions will be placed in the .text section of the compiled object files.

1 Linking:

  • During linking, the .text sections of all object files are merged together and placed in the .text section defined in the linker script.

4.2 Address Calculation

The linker script specifies that the .text section starts at 0x00100000:

.text phys : AT(phys) {
    code = .;
    *(.text)
    *(.rodata)
    . = ALIGN(4096);
}
  • phys is defined as 0x00100000.
  • .text section starts at phys, which is 0x00100000.

Address of start

  • The start function is marked as the entry point with ENTRY(start).
  • Given that .text starts at 0x00100000, start will be the first symbol in the .text section if it appears first in the .text segment during linking.
  • Therefore, start will be at address 0x00100000.

Address of print

  • The print function will follow in the .text section after start.
  • The exact address of print depends on the size of the start function.
  • If start is, for example, 0x20 bytes long, then print will be located at 0x00100020.

4.3 Example Memory Layout

Assuming the following:

  • start function is placed first and is 32 bytes long (0x20).
  • print function follows immediately after start.

The memory layout would be:

  • start at 0x00100000
  • print at 0x00100020 (assuming start is 32 bytes)

4.4 Graphical Representation

Here's a graphical representation of the memory layout:

0x00100000  -->  start function
                +-------------------+
                | start() code      |
                | (32 bytes)        |
                +-------------------+
0x00100020  -->  print function
                +-------------------+
                | print() code      |
                | (next function)   |
                +-------------------+
                | ...               |
                +-------------------+

5 Linker Script Overview

OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
phys = 0x00100000;

SECTIONS
{
    .text phys : AT(phys) {
        code = .;
        *(.text)
        *(.rodata)
        . = ALIGN(4096);
    }
    .data : AT(phys + (data - code)) {
        data = .;
        *(.data)
        . = ALIGN(4096);
    }
    .bss : AT(phys + (bss - code)) {
        bss = .;
        *(.bss)
        . = ALIGN(4096);
    }
    end = .;
    /DISCARD/ : {
        *(.comment)
        *(.eh_frame)
        *(.note.gnu.build-id)
    }
}

5.1 Linker Script Breakdown

1. OUTPUT_FORMAT(elf32-i386)

  • Purpose: Specifies the format of the output file.
  • Usage: OUTPUT_FORMAT(elf32-i386)
  • Explanation: This tells the linker to produce an ELF (Executable and Linkable Format) file for the 32-bit x86 architecture.

2. ENTRY(start)

  • Purpose: Defines the entry point of the program.
  • Usage: ENTRY(start)
  • Explanation: This specifies that the symbol start is the entry point of the executable. When the program starts executing, it begins at the start symbol.

3. phys = 0x00100000;

  • Purpose: Defines a variable representing a physical address.
  • Usage: phys = 0x00100000;
  • Explanation: Sets the phys variable to 0x00100000 (1 MB). This is a common address for loading the kernel in x86 architecture, as it is above the memory used by BIOS and other system functions.

4. SECTIONS { ... }

  • Purpose: Defines the memory layout and sections of the output file.
  • Usage: SECTIONS { ... }
  • Explanation: The SECTIONS command defines how different sections of the input files should be mapped into the output file.

5. .text phys : AT(phys) { ... }

  • Purpose: Specifies the .text section's location and content.
  • Usage: .text phys : AT(phys) { ... }
  • Explanation:
    • .text phys specifies that the .text section should start at the physical address phys, which is 0x00100000.
    • AT(phys) tells the linker that the load address for this section is 0x00100000.
    • Inside the braces {}, the code = .; line sets the variable code to the current address, marking the start of the .text section.
    • * wildcard pattern: This includes all input sections named .text and .rodata from the input object files.
    • . = ALIGN(4096); ensures the next section starts at a 4096-byte aligned address.

6. .data : AT(phys + (data - code)) { ... }

  • Purpose: Specifies the .data section's location and content.
  • Usage: .data : AT(phys + (data - code)) { ... }
  • Explanation:
    • .data specifies the start of the .data section.
    • AT(phys + (data - code)) calculates the load address for the .data section as the physical address plus the offset between the data and code symbols.
    • Inside the braces {}, data = .; sets the variable data to the current address, marking the start of the .data section.
    • * wildcard pattern: This includes all input sections named .data from the input object files.
    • . = ALIGN(4096); ensures the next section starts at a 4096-byte aligned address.

7. .bss : AT(phys + (bss - code)) { ... }

  • Purpose: Specifies the .bss section's location and content.
  • Usage: .bss : AT(phys + (bss - code)) { ... }
  • Explanation:
    • .bss specifies the start of the .bss section.
    • AT(phys + (bss - code)) calculates the load address for the .bss section as the physical address plus the offset between the bss and code symbols.
    • Inside the braces {}, bss = .; sets the variable bss to the current address, marking the start of the .bss section.
    • * wildcard pattern: This includes all input sections named .bss from the input object files.
    • . = ALIGN(4096); ensures the next section starts at a 4096-byte aligned address.

8. end = .;

  • Purpose: Defines a symbol marking the end of the last section.
  • Usage: end = .;
  • Explanation: Sets the end symbol to the current address, which marks the end of all sections defined so far. This is useful for calculating the size of the program or for placing data after all sections.

9. /DISCARD/ : { ... }

  • Purpose: Discards unwanted sections.
  • Usage: /DISCARD/ : { ... }
  • Explanation: Sections listed within the /DISCARD/ block are not included in the final output file. This is typically used to remove debugging or unnecessary sections.
    • * wildcard pattern: This includes all input sections named .comment, .eh_frame, and .note.gnu.build-id from the input object files and discards them.

6 Linker Script

Linker script is a text file used by the linker which explains how different sections of the object files should be merged to create an output file.

It controls the layout and organization of the output file by specifying how the linker should place the sections of the input files in the output file, how to handle memory regions, how to define symbols, and more.

  • GNU linker script has the file extension of .ld.
  • It specifies how different sections of code and data should be placed in memory.
  • You must supply linker script at the linking phase to the linker using -T option.

Key Fields and Directives in Linker Scripts

  1. OUTPUT_FORMAT
  2. ENTRY
  3. SECTIONS
  4. MEMORY
  5. PHDRS
  6. SYMBOLS
  7. ASSERT
  8. INCLUDE
  9. STARTUP
  10. OUTPUT
  11. REGION_ALIAS
  12. SEARCH_DIR
  13. EXTERN
  14. FORCE_COMMON_ALLOCATION
  15. FORCE_SHARED_ALLOCATION

1 OUTPUT_FORMAT

Specifies the format of the output file.

Usage:

OUTPUT_FORMAT(format)

Example:

OUTPUT_FORMAT(elf32-i386)

This specifies that the output file should be in the ELF format for a 32-bit x86 architecture.

2 ENTRY

Defines the entry point of the program where execution starts.

Usage:

ENTRY(symbol)

Example:

ENTRY(start)

This specifies that the start symbol is the entry point of the executable.

3 SECTIONS

Describes the layout of the output file in memory, specifying where each section should be placed.

Usage:

SECTIONS
{
    ...
}

Example:

SECTIONS
{
    .text : {
        *(.text)
        *(.rodata)
        . = ALIGN(4096);
    } > RAM

    .data : {
        *(.data)
        . = ALIGN(4096);
    } > RAM

    .bss : {
        *(.bss)
        . = ALIGN(4096);
    } > RAM

    /DISCARD/ : {
        *(.comment)
        *(.eh_frame)
    }
}

Key Terms:

  • Section Name: The name of the section, such as .text, .data, or .bss.
  • Address Assignment: Specifies the start address of a section.
    • Example: .text 0x00100000 : { ... }
  • Wildcard Patterns: Used to match section names from input files.
    • Example: *(.text), *(.data)
  • Alignment: Ensures sections start at aligned addresses.
    • Example: . = ALIGN(4096);

4 MEMORY

Defines memory regions for placing sections.

Usage:

MEMORY
{
    name (attr) : ORIGIN = origin, LENGTH = length
    ...
}

Example:

MEMORY
{
    RAM (wx) : ORIGIN = 0x00100000, LENGTH = 0x400000
    ROM (rx) : ORIGIN = 0x00000000, LENGTH = 0x10000
}

Key Terms:

  • Name: The name of the memory region.
  • Attributes: Specify the permissions, such as r (read), w (write), x (execute).
  • ORIGIN: The start address of the memory region.
  • LENGTH: The size of the memory region.

5 PHDRS

Describes the program headers for the output file.

Usage:

PHDRS
{
    name type [attributes]
    ...
}

Example:

PHDRS
{
    text PT_LOAD FILEHDR PHDRS;
    data PT_LOAD;
}

Key Terms:

  • Name: The name of the program header.
  • Type: The type of the segment, such as PT_LOAD.
  • Attributes: Additional attributes, like FILEHDR, PHDRS.

6 SYMBOLS

Defines symbols and assigns values to them.

Usage:

symbol = expression;

Example:

_start = 0x100000;

This defines  the _start symbol with the value 0x1000001.

Advanced Usage:

  • PROVIDE: Defines a symbol only if it is not already defined.
    • Example: PROVIDE(_stack = 0x200000);
  • ASSERT: Ensures certain conditions are met.
    • Example: ASSERT(_stack > 0x200000, "Stack too low!");

7 ASSERT

Ensures certain conditions are met during linking.

Usage:

ASSERT(condition, message)

Example:

ASSERT(_stack > 0x200000, "Stack too low!");

This asserts that the _stack symbols is greater than 0x200000, otherwise it will produce an error message Stack too low!.

8 INCLUDE

Includes another linker script within the current script.

Usage:

INCLUDE "filename"

Example:

INCLUDE "common.ld"

This includes the contents of common.ld into the current linker script.

9 STARTUP

Specifies the startup file to be linked first.

Usage:

STARTUP(filename)

Example:

STARTUP(startup.o)

This ensures that startup.o is linked first.

10 OUTPUT

Specifies the name of the output file.

Usage:

OUTPUT(filename)

Example:

OUTPUT("kernel.bin")

This sets the output file name to kernel.bin.

11 REGION_ALIAS

Defines an alias for a memory region.

Usage:

REGION_ALIAS(alias, region)

Example:

REGION_ALIAS(RAM_ALIAS, RAM)

This defines RAM_ALIAS as an alias for the RAM memory region.

12 SEARCH_DIR

Adds a directory to the search path for libraries and object files.

Usage:

SEARCH_DIR("directory")

Example:

SEARCH_DIR("/usr/local/lib")

This adds /usr/local/lib to the search path.

13 EXTERN

Forces undefined symbols to be added to the symbol table.

Usage:

EXTERN(symbol)

Example:

EXTERN(_start)

This ensures that _start is included in the symbol table even if it is undefined.

14 FORCE_COMMON_ALLOCATION

Forces allocation of common symbols even if there are undefined symbols.

Usage:

FORCE_COMMON_ALLOCATION

Example:

FORCE_COMMON_ALLOCATION

This ensures that common symbols are allocated despite undefined symbols.

15 FORCE_SHARED_ALLOCATION

Forces allocation of shared symbols

Usage:

FORCE_SHARED_ALLOCATION

Example:

FORCE_SHARED_ALLOCATION

This ensures that shared symbols are allocated.

7 Various Symbols and Commands

1 . (Dot):

Represents the current location counter within the memory layout. It's used to specify the current memory address.

2 Wildcards:

Wildcards like * and ** are used to match multiple sections with similar names or properties. For example, *(.text) matches all sections named .text in input files.

3 Comments:

Comments in linker script start with # and are used to provide explanations and annotations within the script.