Unicode is a standard for representing text in computers. It assigns a unique number (called a code point) to every character, regardless of platform, program, or language covering everything from English letters, emojis to ancient script.
→ is assigned the code point U+2192.
The arrow
Important
Code points are expressed in hexadecimal numbers. Code points from U+0000 to U+007F correspond exactly to the original ASCII characters.
Code points ranges
The largest Unicode code point is U+10FFFF, which is a 6-digit hexadecimal number, so there are over 1.1 million possible code point. Most Unicode characters fall within the range U+0000 to U+FFFF (the Basic Multilingual Plane, BMP), which only requires 4 hex digits.
Characters outside this range, such as emoji and historical scripts, are represented in the supplementary planes and use the full 6-digit format (e.g., U+1F600 for 😀).
Code Point
Short Form
Full 6-Digit Form
Description
U+0000
U+0000
U+000000
Null character
U+0041
U+0041
U+000041
Letter ‘A’
U+2600
U+2600
U+002600
Sun symbol (☀)
U+1F600
U+1F600
U+01F600
Grinning face emoji (😀)
Unicode itself is just a set of code points. To actually store or transmit text, we need an encoding—a way to represent these code points in binary.
Normalization forms
Unicode provides multiple ways to represent certain characters. Normalization ensures consistency by converting these forms into a standard.
NFC (Normalization Form C): Combines characters into a single, precomposed form.
Example: é (U+00E9) is a single code point in NFC.
NFD (Normalization Form D): Splits characters into decomposed form.
Example: é becomes two code points: e (U+0065) and a combining acute accent (U+0301).
Normalization forms help ensure that text is treated consistently, especially when comparing or searching for strings. Different systems or tools might store text in different forms:`
MacOS tends to use NFD.
Linux and Windows typically use NFC.
This can cause issues when comparing files or rendering characters, especially if one system expects a precomposed form and encounters a decomposed one.
UTF (Unicode Transformation Format) is a family of encodings for Unicode:
UTF-8: Variable-length encoding (1–4 bytes per character). It’s compact for ASCII characters (1 byte each) and widely used on the web.
UTF-16: Fixed or variable-length (2 or 4 bytes). Used in Windows and Java internally.
UTF-32: Fixed-length (4 bytes for every character). Not commonly used due to high memory use.
Tip
Unicode is the abstract standard, and UTF-8, UTF-16, UTF-32 are concrete encodings that implement it.
Example
The SUN SYMBOL ☀ has the code pointU+2600: it means the hexadecimal number 2600 represents this character in Unicode.
Unicode transformation format (UTF)
UTF-8
UTF-8 adopts a variable length encoding format using templates. Each template specifies how many bytes are used and how the bits of the code point are split across those bytes. The template defines:
leading bits that indicate how many bytes are used for the encoding
continuation bytes starting with 10. The remaining bits in the template are filled with the binary representation
Code Point Range
Bytes
Encoding Template
U+0000 to U+007F
1
0xxxxxxx
U+0080 to U+07FF
2
110xxxxx 10xxxxxx
U+0800 to U+FFFF
3
1110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF
4
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Example
2 bytes template: the first 1 determines there are 2 bytes allocated for this code point, the 10 in each byte act as a prefix, 11 bits are available for identifying the code point .
Example
U+2600 lies in the range U+0800 to U+FFFF, which requires 3 bytes. The 0x2600 is converted into three bytes following a template1110xxxx 10xxxxxx 10xxxxxx for the 3-byte characters, which filled would be 11100010 10011000 10000000 or in heE2 98 80** (3 bytes).
UTF-16
In UTF-16, characters in the Basic Multilingual Plane(BMP, U+0000 to U+FFFF) equires a 2 bytes but code points above U+FFFF use 4 bytes (surrogate pairs). Additionally, UTF-16 can be big-endian or little-endian, affecting how bytes are stored.
U+2600 is below U+FFFF, so it's encoded in 2 bytes, typically expressed as four hexadecimal digits. Another example might be code point U+0041 (Latin letter 'A'), which comes0000 0000 0100 0001 in binary and 0x0041 (2 bytes: 00 41 in big-endian).
Characters outside the BMP require 4 bytes in UTF-16, encoded as two 16-bit units called surrogate pairs. These are created using the following steps:
Normalize the Code Point: Subtract U+10000 from the code point to shift its range to 0x00000–0xFFFFF (20 bits). This ensures the code point fits into two 10-bit values.
Split into 10-Bit Values: Divide the 20-bit value into two parts, the most significant and the least significant
High 10 bits (most significant).
Low 10 bits (least significant).
Add Surrogate Prefixes:
High surrogates: Add 0xD800 to the high 10 bits (0xD800–0xDBFF).
Low surrogates: Add 0xDC00 to the low 10 bits (0xDC00–0xDFFF).
UTF-32
UTF-32 adopts fixed-length encoding: every character is 4 bytes, regardless of code point.
UTF-8 is the most widely used, compact for ASCII and Latin scripts, slightly larger for others. Ideal for text files, web content, and databases.
UTF-16 is efficient for languages with many non-ASCII characters (e.g., East Asian languages) and is used internally by many systems, like Windows and Java.
UTF-32 simplifies processing since each character is a fixed size and is used in some low-level systems where memory usage is less critical.
Example U+2600
Encoding
Representation (Hex)
Bytes Used
Notes
UTF-8
E2 98 80
3
Compact, ASCII-compatible
UTF-16
26 00 (LE)
2
Efficient for most common characters
UTF-32
00 00 26 00
4
Simplest but least space-efficient
Using xxd on Linux to inspect characters
xxd is a Linux command-line utility that creates a hex dump of a file or input, showing the binary representation of data in a human-readable format.
echo -n '→' | xxd -ps -u
Output:
E28692
This tells you that the character → is encoded in UTF-8 as three bytes: E2 86 92.