How Characters are Stored in Memory

How Characters are Stored in Memory

Well we know that decimal numbers are stored in the computer memory as binary format. For instance 7 is stored as 0111 in memory. Have you ever wondered how does characters like English letters stored in memory?

We will uncover all this in this article post.

How does characters stored in the computer Memory🤔?

Well, character are stored in computer memory binary format as well, but their storage process is different from that of numbers. Each character, like a letter, number, or symbol, is assigned a unique numeric code through character encoding standards like ASCII (American Standard Code for Information Interchange) or Unicode.

In C/C++, when we declare variables like int integer = 65; and char alphabet = 'A';, the compiler determines the type of data (integer or character) based on the variable’s declaration, and this influences how the data is stored in memory.

int integer = 65;
char alphabet = 'a';

Here's what happens in each case:

  1. Integer Declaration (int integer = 65;):
    • The compiler recognizes integer as an int type, which is stored as a binary number in memory.
    • The decimal value 65 is converted directly into binary (for instance, 00000000 00000000 00000000 01000001 in a 32-bit system) and stored in memory.
  2. Character Declaration (char alphabet = ‘A’;):
    • The compiler identifies alphabet as a char type, so it interprets 'A' as a character rather than a numerical value.
    • To store the character 'A', the compiler refers to the ASCII encoding, where 'A' corresponds to the decimal value 65.
    • This ASCII code (65) is then converted to binary (e.g., 01000001 for an 8-bit char) and stored in memory.

Different Character Encoding:

Character encoding schemes define how characters are represented as binary data in computers.

1️⃣ ASCII:

ASCII is a character encoding standard that was developed to represent text in computers, communications equipment, and other devices that use text. ASCII assigns a unique numeric code to each character, symbol, or control command, allowing computers to store and process text data.

History of ASCII

  • Developed in 1963: The ASCII standard was created by a committee led by Robert W. Bemer and adopted as a U.S. standard in 1963.
  • Goal: The main aim was to create a common code for text that different types of computers could recognize, facilitating communication across systems.
  • Updates: ASCII has gone through several minor revisions but has remained largely unchanged since the 1960s.

Structure of ASCII

Standard ASCII (7-bit ASCII)
  • Range: ASCII originally used 7 bits to represent each character, giving it a range of 128 possible codes (0–127).
  • Characters: These include:
    • Control Characters (0–31): Non-printable commands used for text control, like newline, carriage return, and tab.
    • Printable Characters (32–126): Visible symbols, including:
      • Numbers (48–57): '0'–'9'
      • Uppercase Letters (65–90): 'A'–'Z'
      • Lowercase Letters (97–122): 'a'–'z'
      • Punctuation and Symbols: Various symbols like !, @, #, $, etc.
Extended ASCII (8-bit ASCII)
  • Range: Extends ASCII to use all 8 bits, adding another 128 characters (128–255).
  • Usage: Adds support for additional symbols, graphical characters, and accents for various Western European languages.
  • Limitations: Extended ASCII is not standardized, meaning there are different "extended" ASCII versions (e.g., ISO-8859-1 for Latin characters).

ASCII Table Breakdown

Here’s a summary of the key ASCII character ranges:

DecimalBinaryCharacterDescription
000000000NULNull
900001001TABHorizontal Tab
1000001010LFLine Feed (New Line)
1300001101CRCarriage Return
3200100000(space)Space
48-5700110000-0-9Numbers
65-9001000001-A-ZUppercase Letters
97-12201100001-a-zLowercase Letters
12701111111DELDelete

ASCII Control Characters

  • Control Characters (0–31): ASCII includes 32 control characters used for device commands and text control. They’re often non-printable:
    • NUL (0): Null character, used to indicate end-of-string in C/C++ programming.
    • LF (10) and CR (13): Line feed and carriage return, used in text formatting.
    • BEL (7): Causes a "bell" sound or visual signal.

Printable ASCII Characters

  • Numbers (48–57): Represent digits '0' through '9'.
  • Uppercase Letters (65–90): Represent 'A' through 'Z'.
  • Lowercase Letters (97–122): Represent 'a' through 'z'.
  • Symbols and Punctuation: Includes characters like @, #, $, %, &, etc.

How ASCII Characters Are Stored in Memory

  • ASCII characters are stored as 7-bit or 8-bit binary numbers.
  • In a 7-bit system, the values are stored directly.
  • In 8-bit systems, the highest bit is often set to 0 for standard ASCII characters, or used to represent extended ASCII.

ASCII in Programming

  • Strings in ASCII: In many programming languages, strings are arrays of ASCII characters, with each character represented by its ASCII value.
  • Null Terminator: In languages like C/C++, strings are typically null-terminated, meaning they end with a NUL (0) byte.
  • Character Encoding: ASCII values are used directly for character representation. For instance:

    char letter = 'A';   // Stored as 65 in ASCII (binary: 01000001)
    

ASCII Applications and Usage

  • Text Files: Plain text files (.txt) use ASCII encoding to store characters.
  • Programming Languages: ASCII codes are widely used in programming for text manipulation.
  • Communication Protocols: ASCII is still used in protocols, such as HTTP and SMTP, which transmit plain text over networks.
  • Legacy Systems: Many older or embedded systems still rely on ASCII due to its simplicity and compatibility.

ASCII Limitations

  • Limited Character Set: ASCII lacks support for many characters, especially those outside the English language, as it only encodes 128 or 256 characters.
  • Lack of Internationalization: ASCII does not support characters in non-Latin alphabets, such as Cyrillic, Chinese, Arabic, or other languages, which led to the development of Unicode and other encoding standards.

2️⃣ Unicode:

Unicode is a universal character encoding standard that assigns unique codes to characters, symbols, and scripts from all languages and writing systems, as well as technical symbols, emojis, and more. It was developed to overcome the limitations of ASCII and other early encoding systems, allowing consistent representation of text across different platforms and languages.

Key Concepts of Unicode

  1. Code Points:
    • Unicode assigns a unique code point to each character. A code point is a number, often represented in hexadecimal, that serves as the character's unique identifier.
    • Code points are written in the form U+XXXX, where XXXX is a hexadecimal number. For example, U+0041 is the code point for the letter ‘A’.
  2. Unicode Range:
    • Unicode supports a vast range of code points, from U+0000 to U+10FFFF, which allows for over a million unique codes.
    • These code points cover most written languages, mathematical symbols, currency symbols, and even emojis.
  3. Unicode Planes:
    • Unicode code points are divided into 17 planes. The most commonly used characters are in the Basic Multilingual Plane (BMP), which covers code points from U+0000 to U+FFFF.
    • Other planes include additional symbols, ancient scripts, and more specialized characters.

Unicode Encoding Forms

Unicode can be encoded in various ways, depending on how many bytes are needed to store each character. The three primary encoding forms are UTF-8, UTF-16, and UTF-32:

  1. UTF-8:
    • Bits: Variable-length (1 to 4 bytes per character)
    • Compatibility: Backward-compatible with ASCII; ASCII characters use 1 byte.
    • Usage: The most widely used encoding on the web and for file storage, as it is space-efficient for common characters.
    • Examples:
      • ‘A’ (U+0041): 01000001 (1 byte)
      • (Euro sign, U+20AC): 11100010 10000010 10101100 (3 bytes)
      • 😂 (Face with tears of joy, U+1F602): 11110000 10011111 10011000 10000010 (4 bytes)
  2. UTF-16:
    • Bits: Variable-length (2 or 4 bytes per character)
    • Usage: Common in certain operating systems, including Windows, and ideal for characters in the BMP.
    • Encoding: Uses 2 bytes for BMP characters; supplementary characters (outside the BMP) use 4 bytes, using surrogate pairs.
    • Examples:
      • ‘A’: 00000000 01000001 (2 bytes)
      • : 00100000 10101100 (2 bytes)
      • 😂: Represented with a surrogate pair, 11011000 00111110 11011111 10000010 (4 bytes)
  3. UTF-32:
    • Bits: Fixed-length (4 bytes per character)
    • Usage: Sometimes used for internal processing in applications due to its simplicity in indexing characters.
    • Encoding: Every character, regardless of its complexity, is represented by a fixed 4 bytes.
    • Examples:
      • ‘A’: 00000000 00000000 00000000 01000001
      • : 00000000 00000000 00100000 10101100
      • 😂: 00000000 00000001 11110110 00000010

Conversion of Character to Integer

To convert a character representing a digit (like '7') into its corresponding integer value, we need to understand how characters are stored. In ASCII, characters like '0', '1', ..., '9' are stored with specific decimal values:

  • The ASCII value of '0' is 48 in decimal.
  • The ASCII value of '1' is 49 in decimal.
  • Similarly, the ASCII value of '7' is 55 in decimal.

So, if we have a character like '7', it is stored as 55 in decimal (or 0011 0111 in binary).

To convert this character to the integer value 7, we can use the following approach:

  1. Subtract the ASCII value of '0' from the ASCII value of the character.
  2. This works because ASCII values for digits are sequentially ordered from '0' to '9'.

Conversion Example:

To convert the character '7' to its integer value:

  1. Get the ASCII value of '7', which is 55.
  2. Subtract the ASCII value of '0'(which is 48) from 55:

    55 - 48 = 7

  3. The result 7, is the integer representation of the character '7'.

Code Example in C/C++:

char digitChar = '7';
int digitInt = digitChar - '0';

Here, digitInt will hold the integer value 7.

Why This Works:

The ASCII values for the characters '0' to '9' are in a contiguous range (48 to 57). Subtracting '0' from any digit character aligns it with its corresponding integer value:

  • '1' - '0' results in 1
  • '2' - '0' results in 2
  • and so on up to '9' - '0', which results in 9

Conversion String into Integer:

Well now, if we have string something like 123, how to convert it to integer.

C's standard library provides a function to do so, which is atoi(ASCII to integer). It converts a string to an integer. It is defined in stdlib.h header. It is commonly used to convert string representing numbers, like "123", into acutal integers (123).

How it is actually done:
  • Initialize the Result:
    • Start with an integer variable, say result, initialized to 0. This variable will store the final integer value.
  • Iterate Through Each Character:
    • For each character in the string, convert it to an integer (similar to the method we discussed for single characters) and add it to result by shifting result left by one decimal place (multiplying by 10).
  • Convert and Accumulate:
    • For each character c in the string:
      • Convert c to its integer value by subtracting '0'.
      • Multiply result by 10 and add the integer value of c.
      • This shifts result left by one decimal place, making space for the new digit.
Example: Converting "123" to Integer:

Let's go through an example with the string "123":

  1. Initializeresult = 0.
  2. Process each character:
    • For the first character, '1':
      • Convert '1' to integer: 1 - '0' = 1.
      • Update result: result = result * 10 + 1 = 0 * 10 + 1 = 1.
    • For the second character, '2':
      • Convert '2' to integer: 2 - '0' = 2.
      • Update result: result = result * 10 + 2 = 1 * 10 + 2 = 12.
    • For the third character, '3':
      • Convert '3' to integer: 3 - '0' = 3.
      • Update result: result = result * 10 + 3 = 12 * 10 + 3 = 123.

At the end of this process, result holds the integer value 123.

#include <stdio.h>

int stringToInt(const char* str) {
    int result = 0;
    
    // Loop through each character in the string until the null terminator
    for (int i = 0; str[i] != '\0'; i++) {
        // Convert the character to integer and accumulate in result
        result = result * 10 + (str[i] - '0');
    }
    
    return result;
}

int main() {
    const char* str = "123";
    int number = stringToInt(str);
    printf("The integer value is: %d\n", number);
    return 0;
}
Explanation of Code:
  • Loop through characters: The for loop iterates through each character in the string str until it encounters the null terminator '\0', which marks the end of the string.
  • Convert and Accumulate: For each character, we subtract '0' to get its integer equivalent, then update result by shifting it left by one decimal place (multiplying by 10) and adding the new digit.
Important Considerations:
  • Negative Numbers: For negative numbers, you may need to check for a leading '-' character and adjust the logic accordingly.
  • Non-digit Characters: This method assumes that the input string contains only valid digit characters. To handle non-digit characters, you would add a check to ensure each character is between '0' and '9'.

Using Standard Library Functions

In many programming languages, there are standard functions for converting strings to integers:

  • C: atoi() function
  • C++: std::stoi() in <string> header
  • Python: int() function
  • Java: Integer.parseInt()

These functions handle additional cases like negative numbers, but understanding the manual conversion process is helpful for learning how numbers are represented and manipulated in programming.

The Admin

The Admin

And yet you incessantly stand on their slates, when the White Rabbit: it was YOUR table,' said.