Unicode is a standard for representing text in computers. It assigns a unique number (called a code point) to every character, regardless of platform, program, or language covering everything from English letters, emojis to ancient script.

is assigned the code point U+2192.

The arrow

Important

Code points are expressed in hexadecimal numbers. Code points from U+0000 to U+007F correspond exactly to the original ASCII characters.

Code points ranges

The largest Unicode code point is U+10FFFF, which is a 6-digit hexadecimal number, so there are over 1.1 million possible code point. Most Unicode characters fall within the range U+0000 to U+FFFF (the Basic Multilingual Plane, BMP), which only requires 4 hex digits.

Characters outside this range, such as emoji and historical scripts, are represented in the supplementary planes and use the full 6-digit format (e.g., U+1F600 for 😀).

Code PointShort FormFull 6-Digit FormDescription
U+0000U+0000U+000000Null character
U+0041U+0041U+000041Letter ‘A’
U+2600U+2600U+002600Sun symbol (☀)
U+1F600U+1F600U+01F600Grinning face emoji (😀)
Unicode itself is just a set of code points. To actually store or transmit text, we need an encoding—a way to represent these code points in binary.

Normalization forms

Unicode provides multiple ways to represent certain characters. Normalization ensures consistency by converting these forms into a standard.

  • NFC (Normalization Form C): Combines characters into a single, precomposed form.
    • Example: é (U+00E9) is a single code point in NFC.
  • NFD (Normalization Form D): Splits characters into decomposed form.
    • Example: é becomes two code points: e (U+0065) and a combining acute accent (U+0301).

Normalization forms help ensure that text is treated consistently, especially when comparing or searching for strings. Different systems or tools might store text in different forms:`

  • MacOS tends to use NFD.
  • Linux and Windows typically use NFC. This can cause issues when comparing files or rendering characters, especially if one system expects a precomposed form and encounters a decomposed one. UTF (Unicode Transformation Format) is a family of encodings for Unicode:
  • UTF-8: Variable-length encoding (1–4 bytes per character). It’s compact for ASCII characters (1 byte each) and widely used on the web.
  • UTF-16: Fixed or variable-length (2 or 4 bytes). Used in Windows and Java internally.
  • UTF-32: Fixed-length (4 bytes for every character). Not commonly used due to high memory use.

Tip

Unicode is the abstract standard, and UTF-8, UTF-16, UTF-32 are concrete encodings that implement it.

Example

The SUN SYMBOL ☀ has the code point U+2600: it means the hexadecimal number 2600 represents this character in Unicode.

Unicode transformation format (UTF)

UTF-8

UTF-8 adopts a variable length encoding format using templates. Each template specifies how many bytes are used and how the bits of the code point are split across those bytes. The template defines:

  • leading bits that indicate how many bytes are used for the encoding
  • continuation bytes starting with 10. The remaining bits in the template are filled with the binary representation
Code Point RangeBytesEncoding Template
U+0000 to U+007F10xxxxxxx
U+0080 to U+07FF2110xxxxx 10xxxxxx
U+0800 to U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example

2 bytes template: the first 1 determines there are 2 bytes allocated for this code point, the 10 in each byte act as a prefix, 11 bits are available for identifying the code point .

Example

U+2600 lies in the range U+0800 to U+FFFF, which requires 3 bytes. The 0x2600 is converted into three bytes following a template 1110xxxx 10xxxxxx 10xxxxxx for the 3-byte characters, which filled would be 11100010 10011000 10000000 or in heE2 98 80** (3 bytes).

UTF-16

In UTF-16, characters in the Basic Multilingual Plane(BMP, U+0000 to U+FFFF) equires a 2 bytes but code points above U+FFFF use 4 bytes (surrogate pairs). Additionally, UTF-16 can be big-endian or little-endian, affecting how bytes are stored.

U+2600 is below U+FFFF, so it's encoded in 2 bytes, typically expressed as four hexadecimal digits. Another example might be code point U+0041 (Latin letter 'A'), which comes0000 0000 0100 0001 in binary and 0x0041 (2 bytes: 00 41 in big-endian).

Characters outside the BMP require 4 bytes in UTF-16, encoded as two 16-bit units called surrogate pairs. These are created using the following steps:

  • Normalize the Code Point: Subtract U+10000 from the code point to shift its range to 0x00000–0xFFFFF (20 bits). This ensures the code point fits into two 10-bit values.
  • Split into 10-Bit Values: Divide the 20-bit value into two parts, the most significant and the least significant
    • High 10 bits (most significant).
    • Low 10 bits (least significant).
  • Add Surrogate Prefixes:
    • High surrogates: Add 0xD800 to the high 10 bits (0xD800–0xDBFF).
    • Low surrogates: Add 0xDC00 to the low 10 bits (0xDC00–0xDFFF).

UTF-32

UTF-32 adopts fixed-length encoding: every character is 4 bytes, regardless of code point.

Example

U+2600 directly fits into 4 bytes. Binary: 0000 0000 0010 0110 0000 0000., final UTF-32 encoding (hex): 00 00 26 00 (4 bytes, big-endian).

Comparison

UTF-8 is the most widely used, compact for ASCII and Latin scripts, slightly larger for others. Ideal for text files, web content, and databases.

UTF-16 is efficient for languages with many non-ASCII characters (e.g., East Asian languages) and is used internally by many systems, like Windows and Java.

UTF-32 simplifies processing since each character is a fixed size and is used in some low-level systems where memory usage is less critical.

Example U+2600

EncodingRepresentation (Hex)Bytes UsedNotes
UTF-8E2 98 803Compact, ASCII-compatible
UTF-1626 00 (LE)2Efficient for most common characters
UTF-3200 00 26 004Simplest but least space-efficient

Using xxd on Linux to inspect characters

xxd is a Linux command-line utility that creates a hex dump of a file or input, showing the binary representation of data in a human-readable format.

echo -n '→' | xxd -ps -u

Output:

E28692

This tells you that the character is encoded in UTF-8 as three bytes: E2 86 92.


Practical Tips for Working with Unicode

  1. Check the Encoding of a File:
    file -i your_file
    Example output: UTF-8.
  2. Normalize Text to NFC:
    uconv -x any-nfc -o normalized_file your_file
    Or, in Python:
    import unicodedata
    normalized_text = unicodedata.normalize('NFC', 'your_text')
    print(normalized_text)
  3. Use UTF-8 Everywhere: Ensure your tools and files consistently use UTF-8 encoding.
  4. Inspect Characters with xxd:
    echo -n 'your_character' | xxd -ps -u
    This reveals the exact byte sequence of the character in its encoding.