ASCII, Unicode, UTF-8? How do computers process text?

Introduction

I realized that I don’t fully understand how strings work in Go. Explored documentation to understand how computers process text.

How did computers historically process text?

  1. Computers store and compute using numbers, and text is encoded using a series of numbers.
  2. Early computer manufacturers had their own standards (EBCDIC🔗).
  3. ASCII🔗 (American Standard Code for Information Interchange) standardized commonly used English characters, using 7 bits to record characters, with the highest bit reserved for parity checks, due to high transmission error rates and lack of proper protocols at that time.
  4. The 8th bit was later extended by different standards to represent different characters, resulting in standard confusion and “garbled text” problems.

The need for more character expansion

But how to deal with the thousands of Chinese characters? The new standard: Universal Character Set Unicode🔗 assigns a corresponding codepoint🔗 for every character in the world.

UTF-8 vs UTF-16 vs UTF-32

So we talked about standards represent “a number corresponding to a character”, but more specifically, how should text be stored? Unicode Transformation Format, abbreviated as UTF, where the number represents the smallest bit unit for storage.

  1. Using excessive space to store characters is inefficient.
  2. The new storage format standard must be backward compatible with ASCII.
  3. Too many programs are already developed based on 8-bit encoding.

Considering the above, UTF-8 has become the most popular encoding format on the internet.

Features of UTF-32:

  • Each character is represented using 4 Bytes.
  • Not compatible with ASCII.
  • Easy to process, can access the nth character directly by index.
  • Wastes space and is rarely used in practice.

Features of UTF-16:

  • Each character is represented using 2 or 4 Bytes.
  • Not compatible with ASCII.
  • Most common characters are represented using 2 Bytes.
  • Languages like JavaScript and Java use this encoding internally.
const s = "😀"; // emoji exceeds the 16-bit range
console.log(s.length); // 2 (not 1!)

Features of UTF-8:

  • Each character is represented using 1 to 4 Bytes.
  • ASCII compatible: ASCII characters use only 1 Byte, and the encoding is exactly the same.
  • Self-synchronization: can start decoding from any position, as the first few bits of each Byte indicate whether this is a single byte character or part of a multi-byte character.
Where am I in the character?
See 0xxxxxxx → "I'm in a single-byte character"
See 110xxxxx → "I'm at the beginning of a two-byte character"
See 1110xxxx → "I'm at the beginning of a three-byte character"
See 11110xxx → "I'm at the beginning of a four-byte character"
See 10xxxxxx → "I'm somewhere in the middle of a multi-byte character"

Further Reading