Well we know that decimal numbers are stored in the computer memory as binary format. For instance 7
is stored as 0111
in memory. Have you ever wondered how does characters like English letters stored in memory?
We will uncover all this in this article post.
Table of contents [Show]
How does characters stored in the computer Memory🤔?
Well, character are stored in computer memory binary format as well, but their storage process is different from that of numbers. Each character, like a letter, number, or symbol, is assigned a unique numeric code through character encoding standards like ASCII (American Standard Code for Information Interchange) or Unicode.
In C/C++, when we declare variables like int integer = 65;
and char alphabet = 'A';
, the compiler determines the type of data (integer or character) based on the variable’s declaration, and this influences how the data is stored in memory.
int integer = 65;
char alphabet = 'a';
Here's what happens in each case:
- Integer Declaration (
int integer = 65;
):- The compiler recognizes
integer
as anint
type, which is stored as a binary number in memory. - The decimal value 65 is converted directly into binary (for instance,
00000000 00000000 00000000 01000001
in a 32-bit system) and stored in memory.
- The compiler recognizes
- Character Declaration (
char alphabet = ‘A’;
):- The compiler identifies
alphabet
as achar
type, so it interprets'A'
as a character rather than a numerical value. - To store the character
'A'
, the compiler refers to the ASCII encoding, where'A'
corresponds to the decimal value 65. - This ASCII code (65) is then converted to binary (e.g.,
01000001
for an 8-bitchar
) and stored in memory.
- The compiler identifies
Different Character Encoding:
Character encoding schemes define how characters are represented as binary data in computers.
1️⃣ ASCII:
ASCII is a character encoding standard that was developed to represent text in computers, communications equipment, and other devices that use text. ASCII assigns a unique numeric code to each character, symbol, or control command, allowing computers to store and process text data.
History of ASCII
- Developed in 1963: The ASCII standard was created by a committee led by Robert W. Bemer and adopted as a U.S. standard in 1963.
- Goal: The main aim was to create a common code for text that different types of computers could recognize, facilitating communication across systems.
- Updates: ASCII has gone through several minor revisions but has remained largely unchanged since the 1960s.
Structure of ASCII
Standard ASCII (7-bit ASCII)
- Range: ASCII originally used 7 bits to represent each character, giving it a range of 128 possible codes (0–127).
- Characters: These include:
- Control Characters (0–31): Non-printable commands used for text control, like newline, carriage return, and tab.
- Printable Characters (32–126): Visible symbols, including:
- Numbers (48–57): '0'–'9'
- Uppercase Letters (65–90): 'A'–'Z'
- Lowercase Letters (97–122): 'a'–'z'
- Punctuation and Symbols: Various symbols like !, @, #, $, etc.
Extended ASCII (8-bit ASCII)
- Range: Extends ASCII to use all 8 bits, adding another 128 characters (128–255).
- Usage: Adds support for additional symbols, graphical characters, and accents for various Western European languages.
- Limitations: Extended ASCII is not standardized, meaning there are different "extended" ASCII versions (e.g., ISO-8859-1 for Latin characters).
ASCII Table Breakdown
Here’s a summary of the key ASCII character ranges:
Decimal | Binary | Character | Description |
---|---|---|---|
0 | 00000000 | NUL | Null |
9 | 00001001 | TAB | Horizontal Tab |
10 | 00001010 | LF | Line Feed (New Line) |
13 | 00001101 | CR | Carriage Return |
32 | 00100000 | (space) | Space |
48-57 | 00110000- | 0-9 | Numbers |
65-90 | 01000001- | A-Z | Uppercase Letters |
97-122 | 01100001- | a-z | Lowercase Letters |
127 | 01111111 | DEL | Delete |
ASCII Control Characters
- Control Characters (0–31): ASCII includes 32 control characters used for device commands and text control. They’re often non-printable:
- NUL (0): Null character, used to indicate end-of-string in C/C++ programming.
- LF (10) and CR (13): Line feed and carriage return, used in text formatting.
- BEL (7): Causes a "bell" sound or visual signal.
Printable ASCII Characters
- Numbers (48–57): Represent digits '0' through '9'.
- Uppercase Letters (65–90): Represent 'A' through 'Z'.
- Lowercase Letters (97–122): Represent 'a' through 'z'.
- Symbols and Punctuation: Includes characters like @, #, $, %, &, etc.
How ASCII Characters Are Stored in Memory
- ASCII characters are stored as 7-bit or 8-bit binary numbers.
- In a 7-bit system, the values are stored directly.
- In 8-bit systems, the highest bit is often set to 0 for standard ASCII characters, or used to represent extended ASCII.
ASCII in Programming
- Strings in ASCII: In many programming languages, strings are arrays of ASCII characters, with each character represented by its ASCII value.
- Null Terminator: In languages like C/C++, strings are typically null-terminated, meaning they end with a NUL (0) byte.
Character Encoding: ASCII values are used directly for character representation. For instance:
char letter = 'A'; // Stored as 65 in ASCII (binary: 01000001)
ASCII Applications and Usage
- Text Files: Plain text files (.txt) use ASCII encoding to store characters.
- Programming Languages: ASCII codes are widely used in programming for text manipulation.
- Communication Protocols: ASCII is still used in protocols, such as HTTP and SMTP, which transmit plain text over networks.
- Legacy Systems: Many older or embedded systems still rely on ASCII due to its simplicity and compatibility.
ASCII Limitations
- Limited Character Set: ASCII lacks support for many characters, especially those outside the English language, as it only encodes 128 or 256 characters.
- Lack of Internationalization: ASCII does not support characters in non-Latin alphabets, such as Cyrillic, Chinese, Arabic, or other languages, which led to the development of Unicode and other encoding standards.
2️⃣ Unicode:
Unicode is a universal character encoding standard that assigns unique codes to characters, symbols, and scripts from all languages and writing systems, as well as technical symbols, emojis, and more. It was developed to overcome the limitations of ASCII and other early encoding systems, allowing consistent representation of text across different platforms and languages.
Key Concepts of Unicode
- Code Points:
- Unicode assigns a unique code point to each character. A code point is a number, often represented in hexadecimal, that serves as the character's unique identifier.
- Code points are written in the form
U+XXXX
, whereXXXX
is a hexadecimal number. For example,U+0041
is the code point for the letter ‘A’.
- Unicode Range:
- Unicode supports a vast range of code points, from
U+0000
toU+10FFFF
, which allows for over a million unique codes. - These code points cover most written languages, mathematical symbols, currency symbols, and even emojis.
- Unicode supports a vast range of code points, from
- Unicode Planes:
- Unicode code points are divided into 17 planes. The most commonly used characters are in the Basic Multilingual Plane (BMP), which covers code points from
U+0000
toU+FFFF
. - Other planes include additional symbols, ancient scripts, and more specialized characters.
- Unicode code points are divided into 17 planes. The most commonly used characters are in the Basic Multilingual Plane (BMP), which covers code points from
Unicode Encoding Forms
Unicode can be encoded in various ways, depending on how many bytes are needed to store each character. The three primary encoding forms are UTF-8, UTF-16, and UTF-32:
- UTF-8:
- Bits: Variable-length (1 to 4 bytes per character)
- Compatibility: Backward-compatible with ASCII; ASCII characters use 1 byte.
- Usage: The most widely used encoding on the web and for file storage, as it is space-efficient for common characters.
- Examples:
- ‘A’ (U+0041):
01000001
(1 byte) - € (Euro sign, U+20AC):
11100010 10000010 10101100
(3 bytes) - 😂 (Face with tears of joy, U+1F602):
11110000 10011111 10011000 10000010
(4 bytes)
- ‘A’ (U+0041):
- UTF-16:
- Bits: Variable-length (2 or 4 bytes per character)
- Usage: Common in certain operating systems, including Windows, and ideal for characters in the BMP.
- Encoding: Uses 2 bytes for BMP characters; supplementary characters (outside the BMP) use 4 bytes, using surrogate pairs.
- Examples:
- ‘A’:
00000000 01000001
(2 bytes) - €:
00100000 10101100
(2 bytes) - 😂: Represented with a surrogate pair,
11011000 00111110 11011111 10000010
(4 bytes)
- ‘A’:
- UTF-32:
- Bits: Fixed-length (4 bytes per character)
- Usage: Sometimes used for internal processing in applications due to its simplicity in indexing characters.
- Encoding: Every character, regardless of its complexity, is represented by a fixed 4 bytes.
- Examples:
- ‘A’:
00000000 00000000 00000000 01000001
- €:
00000000 00000000 00100000 10101100
- 😂:
00000000 00000001 11110110 00000010
- ‘A’:
Conversion of Character to Integer
To convert a character representing a digit (like '7'
) into its corresponding integer value, we need to understand how characters are stored. In ASCII, characters like '0'
, '1'
, ..., '9'
are stored with specific decimal values:
- The ASCII value of
'0'
is 48 in decimal. - The ASCII value of
'1'
is 49 in decimal. - Similarly, the ASCII value of
'7'
is 55 in decimal.
So, if we have a character like '7'
, it is stored as 55 in decimal (or 0011 0111 in binary).
To convert this character to the integer value 7
, we can use the following approach:
- Subtract the ASCII value of
'0'
from the ASCII value of the character. - This works because ASCII values for digits are sequentially ordered from
'0'
to'9'
.
Conversion Example:
To convert the character '7'
to its integer value:
- Get the ASCII value of
'7'
, which is 55. Subtract the ASCII value of
'0'
(which is 48) from 55:55 - 48 = 7
- The result
7
, is the integer representation of the character'7'
.
Code Example in C/C++:
char digitChar = '7';
int digitInt = digitChar - '0';
Here, digitInt
will hold the integer value 7
.
Why This Works:
The ASCII values for the characters '0'
to '9'
are in a contiguous range (48 to 57). Subtracting '0'
from any digit character aligns it with its corresponding integer value:
'1' - '0'
results in 1'2' - '0'
results in 2- and so on up to
'9' - '0'
, which results in 9
Conversion String into Integer:
Well now, if we have string something like 123
, how to convert it to integer.
C
's standard library provides a function to do so, which is atoi
(ASCII to integer). It converts a string to an integer. It is defined in stdlib.h
header. It is commonly used to convert string representing numbers, like "123"
, into acutal integers (123
).
How it is actually done:
- Initialize the Result:
- Start with an integer variable, say
result
, initialized to0
. This variable will store the final integer value.
- Start with an integer variable, say
- Iterate Through Each Character:
- For each character in the string, convert it to an integer (similar to the method we discussed for single characters) and add it to
result
by shiftingresult
left by one decimal place (multiplying by 10).
- For each character in the string, convert it to an integer (similar to the method we discussed for single characters) and add it to
- Convert and Accumulate:
- For each character
c
in the string:- Convert
c
to its integer value by subtracting'0'
. - Multiply
result
by10
and add the integer value ofc
. - This shifts
result
left by one decimal place, making space for the new digit.
- Convert
- For each character
Example: Converting "123"
to Integer:
Let's go through an example with the string "123"
:
- Initialize
result = 0
. - Process each character:
- For the first character,
'1'
:- Convert
'1'
to integer:1 - '0' = 1
. - Update
result
:result = result * 10 + 1 = 0 * 10 + 1 = 1
.
- Convert
- For the second character,
'2'
:- Convert
'2'
to integer:2 - '0' = 2
. - Update
result
:result = result * 10 + 2 = 1 * 10 + 2 = 12
.
- Convert
- For the third character,
'3'
:- Convert
'3'
to integer:3 - '0' = 3
. - Update
result
:result = result * 10 + 3 = 12 * 10 + 3 = 123
.
- Convert
- For the first character,
At the end of this process, result
holds the integer value 123
.
#include <stdio.h>
int stringToInt(const char* str) {
int result = 0;
// Loop through each character in the string until the null terminator
for (int i = 0; str[i] != '\0'; i++) {
// Convert the character to integer and accumulate in result
result = result * 10 + (str[i] - '0');
}
return result;
}
int main() {
const char* str = "123";
int number = stringToInt(str);
printf("The integer value is: %d\n", number);
return 0;
}
Explanation of Code:
- Loop through characters: The
for
loop iterates through each character in the stringstr
until it encounters the null terminator'\0'
, which marks the end of the string. - Convert and Accumulate: For each character, we subtract
'0'
to get its integer equivalent, then updateresult
by shifting it left by one decimal place (multiplying by 10) and adding the new digit.
Important Considerations:
- Negative Numbers: For negative numbers, you may need to check for a leading
'-'
character and adjust the logic accordingly. - Non-digit Characters: This method assumes that the input string contains only valid digit characters. To handle non-digit characters, you would add a check to ensure each character is between
'0'
and'9'
.
Using Standard Library Functions
In many programming languages, there are standard functions for converting strings to integers:
- C:
atoi()
function - C++:
std::stoi()
in<string>
header - Python:
int()
function - Java:
Integer.parseInt()
These functions handle additional cases like negative numbers, but understanding the manual conversion process is helpful for learning how numbers are represented and manipulated in programming.