Unicode and UTF

Unicode is a standard for representing text in computers. It assigns a unique number (called a code point) to every character, regardless of platform, program, or language covering everything from English letters, emojis to ancient script.

→ is assigned the code point U+2192.

The arrow

Important

Code points are expressed in hexadecimal numbers. Code points from U+0000 to U+007F correspond exactly to the original ASCII characters.

Code points ranges

The largest Unicode code point is U+10FFFF, which is a 6-digit hexadecimal number, so there are over 1.1 million possible code point. Most Unicode characters fall within the range U+0000 to U+FFFF (the Basic Multilingual Plane, BMP), which only requires 4 hex digits.

Characters outside this range, such as emoji and historical scripts, are represented in the supplementary planes and use the full 6-digit format (e.g., U+1F600 for 😀).

Code Point	Short Form	Full 6-Digit Form	Description
`U+0000`	`U+0000`	`U+000000`	Null character
`U+0041`	`U+0041`	`U+000041`	Letter ‘A’
`U+2600`	`U+2600`	`U+002600`	Sun symbol (☀)
`U+1F600`	`U+1F600`	`U+01F600`	Grinning face emoji (😀)
Unicode itself is just a set of code points. To actually store or transmit text, we need an encoding—a way to represent these code points in binary.

Normalization forms

Unicode provides multiple ways to represent certain characters. Normalization ensures consistency by converting these forms into a standard.

NFC (Normalization Form C): Combines characters into a single, precomposed form.
- Example: é (U+00E9) is a single code point in NFC.
NFD (Normalization Form D): Splits characters into decomposed form.
- Example: é becomes two code points: e (U+0065) and a combining acute accent (U+0301).

Normalization forms help ensure that text is treated consistently, especially when comparing or searching for strings. Different systems or tools might store text in different forms:`

MacOS tends to use NFD.
Linux and Windows typically use NFC. This can cause issues when comparing files or rendering characters, especially if one system expects a precomposed form and encounters a decomposed one. UTF (Unicode Transformation Format) is a family of encodings for Unicode:
UTF-8: Variable-length encoding (1–4 bytes per character). It’s compact for ASCII characters (1 byte each) and widely used on the web.
UTF-16: Fixed or variable-length (2 or 4 bytes). Used in Windows and Java internally.
UTF-32: Fixed-length (4 bytes for every character). Not commonly used due to high memory use.

Tip

Unicode is the abstract standard, and UTF-8, UTF-16, UTF-32 are concrete encodings that implement it.

Example

The SUN SYMBOL ☀ has the code point U+2600: it means the hexadecimal number 2600 represents this character in Unicode.

Unicode transformation format (UTF)

UTF-8

UTF-8 adopts a variable length encoding format using templates. Each template specifies how many bytes are used and how the bits of the code point are split across those bytes. The template defines:

leading bits that indicate how many bytes are used for the encoding
continuation bytes starting with 10. The remaining bits in the template are filled with the binary representation

Code Point Range	Bytes	Encoding Template
`U+0000` to `U+007F`	1	`0xxxxxxx`
`U+0080` to `U+07FF`	2	`110xxxxx 10xxxxxx`
`U+0800` to `U+FFFF`	3	`1110xxxx 10xxxxxx 10xxxxxx`
`U+10000` to `U+10FFFF`	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

Example

2 bytes template: the first 1 determines there are 2 bytes allocated for this code point, the 10 in each byte act as a prefix, 11 bits are available for identifying the code point .

Example

U+2600 lies in the range U+0800 to U+FFFF, which requires 3 bytes. The 0x2600 is converted into three bytes following a template 1110xxxx 10xxxxxx 10xxxxxx for the 3-byte characters, which filled would be 11100010 10011000 10000000 or in heE2 98 80** (3 bytes).

UTF-16

In UTF-16, characters in the Basic Multilingual Plane(BMP, U+0000 to U+FFFF) equires a 2 bytes but code points above U+FFFF use 4 bytes (surrogate pairs). Additionally, UTF-16 can be big-endian or little-endian, affecting how bytes are stored.

U+2600 is below U+FFFF, so it's encoded in 2 bytes, typically expressed as four hexadecimal digits. Another example might be code point U+0041 (Latin letter 'A'), which comes0000 0000 0100 0001 in binary and 0x0041 (2 bytes: 00 41 in big-endian).

Characters outside the BMP require 4 bytes in UTF-16, encoded as two 16-bit units called surrogate pairs. These are created using the following steps:

Normalize the Code Point: Subtract U+10000 from the code point to shift its range to 0x00000–0xFFFFF (20 bits). This ensures the code point fits into two 10-bit values.
Split into 10-Bit Values: Divide the 20-bit value into two parts, the most significant and the least significant
- High 10 bits (most significant).
- Low 10 bits (least significant).
Add Surrogate Prefixes:
- High surrogates: Add 0xD800 to the high 10 bits (0xD800–0xDBFF).
- Low surrogates: Add 0xDC00 to the low 10 bits (0xDC00–0xDFFF).

UTF-32

UTF-32 adopts fixed-length encoding: every character is 4 bytes, regardless of code point.

Example

U+2600 directly fits into 4 bytes. Binary: 0000 0000 0010 0110 0000 0000., final UTF-32 encoding (hex): 00 00 26 00 (4 bytes, big-endian).

Comparison

UTF-8 is the most widely used, compact for ASCII and Latin scripts, slightly larger for others. Ideal for text files, web content, and databases.

UTF-16 is efficient for languages with many non-ASCII characters (e.g., East Asian languages) and is used internally by many systems, like Windows and Java.

UTF-32 simplifies processing since each character is a fixed size and is used in some low-level systems where memory usage is less critical.

Example U+2600

Encoding	Representation (Hex)	Bytes Used	Notes
UTF-8	`E2 98 80`	3	Compact, ASCII-compatible
UTF-16	`26 00` (LE)	2	Efficient for most common characters
UTF-32	`00 00 26 00`	4	Simplest but least space-efficient

Using xxd on Linux to inspect characters

xxd is a Linux command-line utility that creates a hex dump of a file or input, showing the binary representation of data in a human-readable format.

echo -n '→' | xxd -ps -u

Output:

E28692

This tells you that the character → is encoded in UTF-8 as three bytes: E2 86 92.

Practical Tips for Working with Unicode

Check the Encoding of a File:
```
file -i your_file
```
Example output: UTF-8.

Normalize Text to NFC:

uconv -x any-nfc -o normalized_file your_file

Or, in Python:

import unicodedata
normalized_text = unicodedata.normalize('NFC', 'your_text')
print(normalized_text)

Use UTF-8 Everywhere: Ensure your tools and files consistently use UTF-8 encoding.
Inspect Characters with xxd:
```
echo -n 'your_character' | xxd -ps -u
```
This reveals the exact byte sequence of the character in its encoding.

Edmondo's Vault

Explorer

Unicode and UTF

Code points ranges

Normalization forms

Unicode transformation format (UTF)

UTF-8

UTF-16

UTF-32

Comparison

Using xxd on Linux to inspect characters

Practical Tips for Working with Unicode

Graph View

Table of Contents

Backlinks