The build process in software development involves several stages that transform source code into an executable program. This process is especially detailed and crucial in systems programming, such as kernel development. Below, we'll go through the stages of the build process, including compiling, assembling, and linking, with a focus on how these stages apply in a typical kernel development environment.
1 Overview of the Build Process
- Preprocessing: Handles macro substitution, file inclusion, and conditional compilation.
- Compilation: Converts preprocessed source code into assembly language.
- Assembly: Translates assembly code into machine code, producing object files.
- Linking: Combines object files and libraries into a single executable or binary.
- Loading:
Detailed Stages of the Build Process
1 Preprocessing:
The preprocessing stage handles the directives in the source code that start with #
(e.g., #include
, #define
). The preprocessor performs the following tasks:
- File Inclusion: Replaces
#include
directives with the contents of the included files. - Macro Expansion: Replaces macro names with their definitions.
- Conditional Compilation: Includes or excludes parts of the code based on
#ifdef
,#ifndef
, and similar directives.
Example source code (main.c):
#include <stdio.h>
#define MESSAGE "Hello, World!"
int main() {
printf("%s\n", MESSAGE);
return 0;
}
Preprocessed output (main.i):
extern int printf(const char *, ...);
int main() {
printf("%s\n", "Hello, World!");
return 0;
}
2 Compilation
The compilation stage converts preprocessed source code into assembly language. Each source file is compiled independently into an assembly file.
Example assembly output (main.s):
.file "main.c"
.section .rodata
.LC0:
.string "Hello, World!"
.text
.globl main
.type main, @function
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $.LC0, (%esp)
call printf
leave
ret
3 Assembly
The assembly stage translates the assembly code into machine code, producing object files. An assembler takes the assembly code and generates binary instructions for the target architecture.
Example object file (main.o, in binary):
4 Linking
To combine object files into a single executable, resolving symbol references and arranging sections in memory.
Tools: Linker (e.g., ld
)
Steps:
- Combine multiple object files.
- Resolve external symbols (e.g., function calls between object files).
- Arrange code and data in memory according to the linker script.
- Generate the final executable or binary file.
5 Loading
To load the executable into memory and start its execution.
Steps:
- Load the executable into the appropriate memory location.
- Set up the initial execution context (stack, registers).
- Jump to the entry point of the executable (e.g.,
start
in the linker script).
2 Overview of the Linking Process
The linking process can be divided into several key tasks:
- Symbol Resolution
- Relocation
- Section Merging
- Memory Layout and Address Assignment
- Generation of Executable or Library
1 Symbol Resolution
Purpose: To resolve all symbols (functions, variables) referenced in the code but defined in different modules or libraries.
Steps:
- The linker collects all object files and libraries specified in the link command.
- It builds a symbol table from these files, mapping symbol names to their addresses.
- If a symbol is referenced in one object file but defined in another, the linker ensures it knows where to find the definition.
2 Relocation
Purpose: To adjust addresses within the code and data sections to reflect their actual locations in memory.
Steps:
- Each object file contains relocation entries that tell the linker where to adjust addresses.
- The linker calculates the final memory addresses for code and data sections.
- It updates the addresses in the code and data sections based on the calculated addresses.
3 Section Merging
Purpose: To combine sections of the same type (e.g., .text
, .data
) from different object files into single sections.
Steps:
- The linker merges all
.text
sections from different object files into a single.text
section. - Similarly, it merges
.data
sections,.bss
sections, and other relevant sections.
4 Memory Layout and Address Assignment
Purpose: To define the memory layout of the final executable, specifying where each section should be loaded into memory.
Steps:
- The linker script (if used) provides detailed instructions on how to arrange sections in memory.
- The linker uses this script to assign starting addresses to each section.
- It ensures that sections are aligned correctly and follow the specified memory layout.
5 Generation of Executable or Library
Purpose: To create the final output file (executable, shared library, or static library) that can be loaded and executed by the operating system.
Steps:
- The linker writes the combined and relocated sections to the output file.
- It generates necessary headers and tables (e.g., symbol table, relocation table) required by the operating system.
3 Example: Linking Process with Linker Script
To illustrate these concepts, let's consider an example using a linker script to create an executable for a simple kernel.
Linker Script Example (link.ld
)
OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
phys = 0x00100000;
SECTIONS
{
.text phys : AT(phys) {
code = .;
*(.text)
*(.rodata)
. = ALIGN(4096);
}
.data : AT(phys + (data - code)) {
data = .;
*(.data)
. = ALIGN(4096);
}
.bss : AT(phys + (bss - code)) {
bss = .;
*(.bss)
. = ALIGN(4096);
}
end = .;
/DISCARD/ : {
*(.comment)
*(.eh_frame)
*(.note.gnu.build-id)
}
}
Breakdown of the Linker Script:
OUTPUT_FORMAT(elf32-i386)
: Specifies the output format as ELF for a 32-bit Intel architecture.ENTRY(start)
: Defines the entry point of the executable as the symbolstart
.- When the OS or bootloader loads this executable, it will begin execution at the address associated with the
start
symbol. - The
start
symbol is typically defined in one of the source file, usually in assembly or C, and marks the beginning of the program's execution flow. - The
start
symbol, which is typically the first instruction of your kernel or program, will be located at the physical address0x00100000
.
- When the OS or bootloader loads this executable, it will begin execution at the address associated with the
phys = 0x00100000;
: Sets a physical address where the code should be loaded.- Defines a base physical address for the sections in the executable.
SECTIONS { ... }
: Defines the memory layout of the executable.
Sections Defined:
.text
: Contains the code and read-only data, starting at the physical address specified byphys
.phys
specifies the physical address where the.text
section starts.- This section includes the code and read-only data (
*(.text)
and*(.rodata)
). - The starting address of this section is
0x00100000
.
.data
: Contains initialized data, positioned right after the.text
section.- This section starts at the address calculated by
phys + (data - code)
, ensuring that it follows the.text
section. - It includes initialized data (
*(.data)
).
- This section starts at the address calculated by
.bss
: Contains uninitialized data, positioned after the.data
section.- This section starts at the address calculated by
phys + (bss - code)
, ensuring that it follows the.data
section.
- This section starts at the address calculated by
end = .;
: Marks the end address of the entire memory layout.- It marks the end of the memory layout.
/DISCARD/
: Excludes unnecessary sections from the final executable.- Discards specific sections that are not needed in the final executable.
Linking Process Steps:
- Symbol Resolution: The linker collects all object files (e.g.,
kernel.o
) and libraries specified in the link command. It builds a symbol table mapping symbol names to their addresses. - Relocation: The linker reads relocation entries in the object files and adjusts addresses based on the final memory layout.
- Section Merging: It merges
.text
,.data
, and.bss
sections from different object files into single sections as specified in the linker script. - Memory Layout and Address Assignment: Using the linker script, the linker assigns starting addresses to each section and ensures proper alignment.
- Generation of Executable: The linker writes the combined and relocated sections to the output file (e.g.,
kernel.elf
), generates headers, and necessary tables.
Practical Outcome
Every time the linker script is used:
1 Placement of start
:
- The
start
symbol, which is typically the first instruction of your kernel or program, will be located at the physical address0x00100000
.
2 Loading:
- When a bootloader loads the ELF file generated by this linker script, it will load the
.text
section (and thus thestart
symbol) into memory starting at0x00100000
.
3 Execution:
- The bootloader will then jump to
0x00100000
to start executing the code atstart
.
Example:
.section .text
.globl start
start:
cli # Clear interrupts
hlt # Halt the CPU
.section .rodata
message:
.ascii "Hello, kernel!\n"
.section .data
my_data:
.long 0xdeadbeef
.section .bss
.lcomm my_bss, 4
When you link this code using the provided linker script:
- The
.text
section, containing thestart
label and its instructions, is placed at0x00100000
. - The
.data
section is placed after.text
. - The
.bss
section is placed after.data
.
Loading and Execution by the Bootloader:
- The bootloader reads the ELF file and loads the
.text
section at0x00100000
. - It then jumps to
0x00100000
, starting execution from thestart
label.
4 Assumptions in Code
Let's assume you have the following C code:
int print() {
// Function implementation
}
int start() {
// Function implementation
}
4.1 Compilation and Linking
1 Compilation:
- When the source files are compiled, the
print
andstart
functions will be placed in the.text
section of the compiled object files.
1 Linking:
- During linking, the
.text
sections of all object files are merged together and placed in the.text
section defined in the linker script.
4.2 Address Calculation
The linker script specifies that the .text
section starts at 0x00100000
:
.text phys : AT(phys) {
code = .;
*(.text)
*(.rodata)
. = ALIGN(4096);
}
phys
is defined as0x00100000
..text
section starts atphys
, which is0x00100000
.
Address of start
- The
start
function is marked as the entry point withENTRY(start)
. - Given that
.text
starts at0x00100000
,start
will be the first symbol in the.text
section if it appears first in the.text
segment during linking. - Therefore,
start
will be at address0x00100000
.
Address of print
- The
print
function will follow in the.text
section afterstart
. - The exact address of
print
depends on the size of thestart
function. - If
start
is, for example, 0x20 bytes long, thenprint
will be located at0x00100020
.
4.3 Example Memory Layout
Assuming the following:
start
function is placed first and is 32 bytes long (0x20).print
function follows immediately afterstart
.
The memory layout would be:
start
at0x00100000
print
at0x00100020
(assumingstart
is 32 bytes)
4.4 Graphical Representation
Here's a graphical representation of the memory layout:
0x00100000 --> start function
+-------------------+
| start() code |
| (32 bytes) |
+-------------------+
0x00100020 --> print function
+-------------------+
| print() code |
| (next function) |
+-------------------+
| ... |
+-------------------+
5 Linker Script Overview
OUTPUT_FORMAT(elf32-i386)
ENTRY(start)
phys = 0x00100000;
SECTIONS
{
.text phys : AT(phys) {
code = .;
*(.text)
*(.rodata)
. = ALIGN(4096);
}
.data : AT(phys + (data - code)) {
data = .;
*(.data)
. = ALIGN(4096);
}
.bss : AT(phys + (bss - code)) {
bss = .;
*(.bss)
. = ALIGN(4096);
}
end = .;
/DISCARD/ : {
*(.comment)
*(.eh_frame)
*(.note.gnu.build-id)
}
}
5.1 Linker Script Breakdown
1. OUTPUT_FORMAT(elf32-i386)
- Purpose: Specifies the format of the output file.
- Usage:
OUTPUT_FORMAT(elf32-i386)
- Explanation: This tells the linker to produce an ELF (Executable and Linkable Format) file for the 32-bit x86 architecture.
2. ENTRY(start)
- Purpose: Defines the entry point of the program.
- Usage:
ENTRY(start)
- Explanation: This specifies that the symbol
start
is the entry point of the executable. When the program starts executing, it begins at thestart
symbol.
3. phys = 0x00100000;
- Purpose: Defines a variable representing a physical address.
- Usage:
phys = 0x00100000;
- Explanation: Sets the
phys
variable to0x00100000
(1 MB). This is a common address for loading the kernel in x86 architecture, as it is above the memory used by BIOS and other system functions.
4. SECTIONS { ... }
- Purpose: Defines the memory layout and sections of the output file.
- Usage:
SECTIONS { ... }
- Explanation: The
SECTIONS
command defines how different sections of the input files should be mapped into the output file.
5. .text phys : AT(phys) { ... }
- Purpose: Specifies the
.text
section's location and content. - Usage:
.text phys : AT(phys) { ... }
- Explanation:
.text phys
specifies that the.text
section should start at the physical addressphys
, which is0x00100000
.AT(phys)
tells the linker that the load address for this section is0x00100000
.- Inside the braces
{}
, thecode = .;
line sets the variablecode
to the current address, marking the start of the.text
section. *
wildcard pattern: This includes all input sections named.text
and.rodata
from the input object files.. = ALIGN(4096);
ensures the next section starts at a 4096-byte aligned address.
6. .data : AT(phys + (data - code)) { ... }
- Purpose: Specifies the
.data
section's location and content. - Usage:
.data : AT(phys + (data - code)) { ... }
- Explanation:
.data
specifies the start of the.data
section.AT(phys + (data - code))
calculates the load address for the.data
section as the physical address plus the offset between thedata
andcode
symbols.- Inside the braces
{}
,data = .;
sets the variabledata
to the current address, marking the start of the.data
section. *
wildcard pattern: This includes all input sections named.data
from the input object files.. = ALIGN(4096);
ensures the next section starts at a 4096-byte aligned address.
7. .bss : AT(phys + (bss - code)) { ... }
- Purpose: Specifies the
.bss
section's location and content. - Usage:
.bss : AT(phys + (bss - code)) { ... }
- Explanation:
.bss
specifies the start of the.bss
section.AT(phys + (bss - code))
calculates the load address for the.bss
section as the physical address plus the offset between thebss
andcode
symbols.- Inside the braces
{}
,bss = .;
sets the variablebss
to the current address, marking the start of the.bss
section. *
wildcard pattern: This includes all input sections named.bss
from the input object files.. = ALIGN(4096);
ensures the next section starts at a 4096-byte aligned address.
8. end = .;
- Purpose: Defines a symbol marking the end of the last section.
- Usage:
end = .;
- Explanation: Sets the
end
symbol to the current address, which marks the end of all sections defined so far. This is useful for calculating the size of the program or for placing data after all sections.
9. /DISCARD/ : { ... }
- Purpose: Discards unwanted sections.
- Usage:
/DISCARD/ : { ... }
- Explanation: Sections listed within the
/DISCARD/
block are not included in the final output file. This is typically used to remove debugging or unnecessary sections.*
wildcard pattern: This includes all input sections named.comment
,.eh_frame
, and.note.gnu.build-id
from the input object files and discards them.
6 Linker Script
Linker script is a text file used by the linker which explains how different sections of the object files should be merged to create an output file.
It controls the layout and organization of the output file by specifying how the linker should place the sections of the input files in the output file, how to handle memory regions, how to define symbols, and more.
- GNU linker script has the file extension of
.ld
. - It specifies how different sections of code and data should be placed in memory.
- You must supply linker script at the linking phase to the linker using
-T
option.
Key Fields and Directives in Linker Scripts
- OUTPUT_FORMAT
- ENTRY
- SECTIONS
- MEMORY
- PHDRS
- SYMBOLS
- ASSERT
- INCLUDE
- STARTUP
- OUTPUT
- REGION_ALIAS
- SEARCH_DIR
- EXTERN
- FORCE_COMMON_ALLOCATION
- FORCE_SHARED_ALLOCATION
1 OUTPUT_FORMAT
Specifies the format of the output file.
Usage:
OUTPUT_FORMAT(format)
Example:
OUTPUT_FORMAT(elf32-i386)
This specifies that the output file should be in the ELF format for a 32-bit x86 architecture.
2 ENTRY
Defines the entry point of the program where execution starts.
Usage:
ENTRY(symbol)
Example:
ENTRY(start)
This specifies that the start
symbol is the entry point of the executable.
3 SECTIONS
Describes the layout of the output file in memory, specifying where each section should be placed.
Usage:
SECTIONS
{
...
}
Example:
SECTIONS
{
.text : {
*(.text)
*(.rodata)
. = ALIGN(4096);
} > RAM
.data : {
*(.data)
. = ALIGN(4096);
} > RAM
.bss : {
*(.bss)
. = ALIGN(4096);
} > RAM
/DISCARD/ : {
*(.comment)
*(.eh_frame)
}
}
Key Terms:
- Section Name: The name of the section, such as
.text
,.data
, or.bss
. - Address Assignment: Specifies the start address of a section.
- Example:
.text 0x00100000 : { ... }
- Example:
- Wildcard Patterns: Used to match section names from input files.
- Example:
*(.text)
,*(.data)
- Example:
- Alignment: Ensures sections start at aligned addresses.
- Example:
. = ALIGN(4096);
- Example:
4 MEMORY
Defines memory regions for placing sections.
Usage:
MEMORY
{
name (attr) : ORIGIN = origin, LENGTH = length
...
}
Example:
MEMORY
{
RAM (wx) : ORIGIN = 0x00100000, LENGTH = 0x400000
ROM (rx) : ORIGIN = 0x00000000, LENGTH = 0x10000
}
Key Terms:
- Name: The name of the memory region.
- Attributes: Specify the permissions, such as
r
(read),w
(write),x
(execute). - ORIGIN: The start address of the memory region.
- LENGTH: The size of the memory region.
5 PHDRS
Describes the program headers for the output file.
Usage:
PHDRS
{
name type [attributes]
...
}
Example:
PHDRS
{
text PT_LOAD FILEHDR PHDRS;
data PT_LOAD;
}
Key Terms:
- Name: The name of the program header.
- Type: The type of the segment, such as
PT_LOAD
. - Attributes: Additional attributes, like
FILEHDR
,PHDRS
.
6 SYMBOLS
Defines symbols and assigns values to them.
Usage:
symbol = expression;
Example:
_start = 0x100000;
This defines the _start
symbol with the value 0x1000001
.
Advanced Usage:
- PROVIDE: Defines a symbol only if it is not already defined.
- Example:
PROVIDE(_stack = 0x200000);
- Example:
- ASSERT: Ensures certain conditions are met.
- Example:
ASSERT(_stack > 0x200000, "Stack too low!");
- Example:
7 ASSERT
Ensures certain conditions are met during linking.
Usage:
ASSERT(condition, message)
Example:
ASSERT(_stack > 0x200000, "Stack too low!");
This asserts that the _stack
symbols is greater than 0x200000
, otherwise it will produce an error message Stack too low!
.
8 INCLUDE
Includes another linker script within the current script.
Usage:
INCLUDE "filename"
Example:
INCLUDE "common.ld"
This includes the contents of common.ld
into the current linker script.
9 STARTUP
Specifies the startup file to be linked first.
Usage:
STARTUP(filename)
Example:
STARTUP(startup.o)
This ensures that startup.o
is linked first.
10 OUTPUT
Specifies the name of the output file.
Usage:
OUTPUT(filename)
Example:
OUTPUT("kernel.bin")
This sets the output file name to kernel.bin
.
11 REGION_ALIAS
Defines an alias for a memory region.
Usage:
REGION_ALIAS(alias, region)
Example:
REGION_ALIAS(RAM_ALIAS, RAM)
This defines RAM_ALIAS
as an alias for the RAM
memory region.
12 SEARCH_DIR
Adds a directory to the search path for libraries and object files.
Usage:
SEARCH_DIR("directory")
Example:
SEARCH_DIR("/usr/local/lib")
This adds /usr/local/lib
to the search path.
13 EXTERN
Forces undefined symbols to be added to the symbol table.
Usage:
EXTERN(symbol)
Example:
EXTERN(_start)
This ensures that _start
is included in the symbol table even if it is undefined.
14 FORCE_COMMON_ALLOCATION
Forces allocation of common symbols even if there are undefined symbols.
Usage:
FORCE_COMMON_ALLOCATION
Example:
FORCE_COMMON_ALLOCATION
This ensures that common symbols are allocated despite undefined symbols.
15 FORCE_SHARED_ALLOCATION
Forces allocation of shared symbols
Usage:
FORCE_SHARED_ALLOCATION
Example:
FORCE_SHARED_ALLOCATION
This ensures that shared symbols are allocated.
7 Various Symbols and Commands
1 .
(Dot):
Represents the current location counter within the memory layout. It's used to specify the current memory address.
2 Wildcards:
Wildcards like *
and **
are used to match multiple sections with similar names or properties. For example, *(.text)
matches all sections named .text
in input files.
3 Comments:
Comments in linker script start with #
and are used to provide explanations and annotations within the script.