The World Beyond ASCII: Unicode and UTF-8

How do you represent every language in the world?

ASCII was published in 1963 and, for almost 30 years, it was enough. The computing world was smaller, more closed off, and the people who used computers back then used them mostly in English. ASCII worked, and it worked very well. To this day it's still present everywhere: any code you write in Rust will use ASCII characters, any website address you type into your browser will use ASCII characters. It hasn't disappeared, and it's not going anywhere anytime soon.

See:

urust.org contains only characters present in ASCII. So does https://.

But as you saw in the subsection The Limitation of ASCII in the previous section, as computers spread across the world and communication networks connected different countries and cultures, ASCII's 128 symbols revealed a limit that, sooner or later, would have to be dealt with.

It's also worth remembering another detail you saw in the previous section: one of the reasons ASCII limited itself to 7 bits was cost. In the 1960s, every extra bit of hardware was too expensive to waste, and the ASA committee had to be smart about how many symbols to include. 128 was enough for the need, and 128 was what fit the era's budget.

Three decades is a long time in the world of technology. Between 1961 and the late 1980s, computers went from being enormous machines that filled entire rooms and cost fortunes - accessible only to governments and large corporations - to equipment that fit on a desk and that ordinary people could buy. The capacity to store information and represent bits got so much cheaper that nobody needed to lose sleep anymore over how many bits they were using per character. The technological obstacle had disappeared. What was missing now wasn't making computers cheaper. It was exactly what had been missing before ASCII: an agreement.

In 1987, three engineers (Joe Becker and Lee Collins, who worked at Xerox, and Mark Davis, who worked at Apple) started working on an idea that, at the time, sounded almost impossible: a single table that included every writing system of all humanity. Not just the Western alphabets, but also Arabic, Japanese, Chinese, Korean, Devanagari, Egyptian hieroglyphs, the cuneiform script of the Sumerians, and hundreds of others, including writing systems of civilizations that don't even exist anymore, but whose symbols would be worth having available for use, so they could be represented on a computer if anyone wanted to. A project that, if it worked, would solve once and for all the problem ASCII couldn't.

Three years later, in 1991, the project's first version was published. The project's name was Unicode.

Unicode

There's nothing magical or extraordinarily different about Unicode compared to ASCII. It works the same way, each character receiving a unique number. What changes is the scale of the question the project was trying to answer.

While in 1961 the ASA committee asked "how many symbols does written English need", and the answer was 95, the three engineers in 1987 asked: how many symbols does all of humanity need? And the answer to that question is much bigger than it seems. Do you have an estimate?

Most of the world uses the Latin alphabet, with 26 letters.
Russian uses Cyrillic, with 33.
Arabic has 28 base letters, and is written from right to left.
Japanese mixes three writing systems within the same text, and just one of them alone has thousands of characters.
Chinese has tens of thousands.

And that's not counting hieroglyphs, cuneiform script, alphabets of extinct civilizations whose texts researchers still need to represent digitally...

For each symbol, of each system, in each language, Unicode needed to assign a unique numeric entry. The table they produced had more than 7,000 symbols. 7 thousand assigned symbols, spread across more than 160 different writing systems. Impressive, right? A giant leap from what ASCII was.

But it doesn't stop there. It's important to know that the fact that Unicode represented 7 thousand symbols doesn't mean it could only represent 7 thousand symbols. Think with me: ASCII represented 128 symbols because that was its maximum capacity. It used 7 bits, so it only had 128 available slots, and it used them all. If they had used 8 bits but hadn't assigned the remaining numbers to any symbols, they would have had capacity for 256 symbols, but would still be representing only 128.

With that understood, if you found 7 thousand impressive, here's an even more impressive number: unlike ASCII, which used all of its possible combinations without a single one left over, Unicode was designed with room to grow over the years. It was designed with a total capacity of approximately 1,100,000 symbols.

And I have one more surprise: that figure of 7 thousand symbols in the table that I mentioned above was when Unicode was published. Back in 1991. Over the years, up to today, it has kept growing more and more, and today it has more than 150 thousand symbols.

Now that's impressive.

What About the Computers That Already Existed?

A project of this scale really is admirable, but if you look closely, there was something they had to think about before launching: in 1991, three decades had passed since ASCII was established. That means the world was full of machines using ASCII. For three decades, computers had been built primarily around ASCII. For 30 years, people had been writing and distributing programs that printed letters or symbols on screen based entirely on ASCII. Decades of text files, programs, databases - all assuming ASCII's 128 symbols in ASCII's positions. If Unicode simply redistributed the numbers differently, all of that would become garbage overnight and would make Unicode's adoption much harder.

So the creators of Unicode made a decision that only seems obvious after you hear it: the first 128 code points of Unicode are exactly the same 128 from ASCII, at the same numbers. The A is 65 in ASCII and remains 65 in Unicode. The Enter remains a control character with number 10. Any text file written since 1963 remained valid without a single change. The past didn't need to be rewritten for the future to work.

From 128 onward, Unicode simply... kept going. The ã got code point 227. The kanji 日 got 26085. The emoji 🌍 got 127757. The same logic as ASCII, extended as far as necessary.

But here a problem emerges that may not be so obvious to you. A problem ASCII never had to face, and to understand it, you need a piece of information that I've deliberately withheld until now.

This whole time, I've been saying ASCII uses 7 bits. And that's true; that's precisely why the table has 128 symbols, because 2⁷ = 128. But I have to tell you: that's not exactly how the computer handles it in practice. In the computer, each character takes up 8 bits. To understand that, you need to understand how the computer stores bits. It doesn't store them in bits; it stores them in bytes.

Going deeper

You're probably thinking right now: if the computer stores ASCII in 8 bits, what happens to the eighth bit? Why doesn't ASCII use all 8 bits, then?

The answer for not using one more bit is the same: cost. Actually, it's not that it would be expensive to build 8-bit computers. If ASCII used 7 bits, computers would already have to think in 8 bits anyway. So if there had to be one more bit, computers would have to think in 9 bits.

And to understand that, and also what happens to the eighth bit, the answer lies in the context in which ASCII was created. Remember that it wasn't made only for computers, but also for telecommunications machines and teletypes? That equipment needed the 8th bit for another purpose: error checking. On regular computers, the eighth bit just stayed zeroed out.

The Computer Doesn't Think in Bits, It Thinks in Bytes

Remember how a computer stores anything? Deep down, everything is billions of bits scattered throughout its components, and the computer needs to be able to find any one of them when it needs to. But how do you locate one specific bit in a sea of billions of bits when you need to know its value?

It's not a trivial question. Imagine you have a notebook with billions of pages, and on one of those pages is a piece of information you need. Without any organization, the only way to find it would be to flip through page by page, which would take forever. The obvious solution is to number the pages. With numbered pages, if someone tells you "it's on page 3,000,000", you go straight there.

IMPORTANT CONCEPT

The computer solves this in exactly the same way: each storage position receives a unique number, which in computing we call an address.

But now comes the question that really matters: how much information goes at each address?

The First Constraint

Think of it this way: if each address pointed to just 1 bit, you'd have the maximum possible precision. It would be fantastic and the ideal solution. Right? Right! Every bit would be individually addressable. But think about the size of the problem: a simple 1-megabyte file contains 8 million bits. To address those 8 million bits individually, you'd need another 8 million different addresses. And a 1-terabyte hard drive? That's 8 trillion addresses. Each of those addresses is a number the computer needs to know, manage, and process. The circuitry responsible for that would end up absurdly large, complex, and expensive.

The other extreme doesn't work either. If each address pointed to an enormous quantity of bits - say, 1000 - you really would have very few addresses to manage, and it would be easier to design a computer where each set of 1000 bits has a specific address. However, every time you needed any piece of information, you'd be forced to load 1000 bits taking up space all at once, even if you only needed a few of them.

The ideal was a size that balanced both sides. Small enough not to waste, large enough not to multiply addresses unnecessarily. That size had to exist somewhere between "1 bit" and "1000 bits". But where?

First constraint

Okay, we have the first constraint: balance.

The Second Constraint

The next constraint comes from the very nature of digital circuits. You already know that each bit has two states: 0 or 1. And you already know that when you combine bits, the combinations multiply: 2 bits give 4 combinations, 3 bits give 8, 4 bits give 16, 5 give 32... Each bit you add doubles the total number of combinations.

Physical circuits are built following exactly that structure. Each layer of the circuit doubles the capacity of the previous layer, following the same multiply-by-2 pattern. That means the circuit works efficiently and cleanly when the number of bits it manages is a power of 2. If you tried to build a circuit for a quantity of bits that wasn't a power of 2 - like 6 or 10, for example - you'd break that naturally occurring doubling pattern, and the hardware would need extra, irregular layers to compensate. It works, but it's more complex, more expensive, and with no benefit whatsoever.

Second constraint

We have the second constraint too: the number of bits we carry at each address has to be a power of 2 (2, 4, 8, 16, 32...).

In the first commercial computers, 8 bits proved sufficient to represent a character, a number from 0 to 255, or a basic instruction (I'm talking about ASCII), which covered most of the era's needs. 8-bit computers were cheaper to manufacture than 16-bit ones, and the cost of bits was an absolutely significant factor at the time. Consequently, the first successful personal computers, like the Apple II and the Commodore 64, were built to work with 8 bits at a time. That created a cascade effect: all the programs, operating systems, programming languages, and peripherals (computer components) were written and manufactured with 8 bits in mind, making it harder and harder to change that number. Once the whole world was building around it, the 8-bit byte became the standard.

So we arrive at an ideal number of bits for each address: 8 bits.

Apple II, one of the first successful personal computers, released in 1977 — **Apple II (1977)** Rama, CC BY-SA 2.0 FR, via Wikimedia Commons

Commodore 64, the best-selling personal computer in history, released in 1982 — **Commodore 64 (1982)** Evan-Amos, public domain, via Wikimedia Commons

How Unicode Characters Are Represented

But do you remember how we got here and why you learned that the computer stores information in bytes? If not, no problem. Let me remind you:

From 128 onward, Unicode simply... kept going. The ã got code point 227. The kanji 日 got 26085. The emoji 🌍 got 127757. The same logic as ASCII, extended as far as necessary.
But here a problem emerges that may not be so obvious to you. A problem ASCII never had to face, and to understand it, you need a piece of information that I've deliberately withheld until now.

The information I had withheld was precisely that each ASCII symbol uses 8 bits (1 byte) to be stored, not 7, like you probably thought. Both ASCII and Unicode represent their symbols in bytes.

Well then, so what's the problem Unicode would need to solve that ASCII didn't?

It was precisely HOW to store its symbols in bytes. 1 byte could only represent 256 symbols. The ã is 227, which fits comfortably in one byte. But the Japanese kanji 日 is code point 26085 - that number doesn't fit in a single byte (which only goes up to 255). And the emoji 🌍, which is code point 127757? Not even close.

Unicode solves the problem of the what: it defines which number corresponds to which character. But a new problem arises: the how.

So, how do you store those numbers in bytes? There are a few different ways to do it, and each one has a name. These ways are called encodings, and the most used one in the world (responsible for 98% of all content on the internet) is UTF-8.

UTF-8: The Elegant Solution

The simplest solution would be to always use 4 bytes per character, since 4 bytes is enough to represent any Unicode code point. It would work. But it would have an enormous cost - the same one behind the reason we don't just use 1000 bits per address: you'd be wasting space. A simple character that used to take up 1 byte per letter would now take up 4 bytes per letter. Files would get 4 times bigger. The entire internet would get 4 times slower.

UTF-8 does something smarter: it uses a variable number of bytes depending on the size of the code point.

Characters with code points from 0 to 127 (the same ones as ASCII) - take up 1 byte.
Characters with code points from 128 to 2047, which includes most of the accents in Portuguese, Spanish, French, German, and other European alphabets - take up 2 bytes.
Characters with code points from 2048 to 65535, which includes practically all the characters of Japanese, Korean, and Chinese - take up 3 bytes.
The rest (emojis, hieroglyphs, and rarer characters) - take up 4 bytes.

An easier way to visualize it

1 Byte: Standard ASCII characters (values 0-127).
2 Bytes: Letters with diacritics, additional Latin characters, Arabic, Hebrew, Cyrillic.
3 Bytes: Asian characters (Korean, Japanese, Chinese...) and other characters from the Basic Multilingual Plane.
4 Bytes: Emojis, rare mathematical symbols, and historical characters.

The result is that a pure English text in UTF-8 takes up exactly the same space it took up in ASCII. A text in Portuguese takes up a little more. A text in Japanese takes up even more. Each language uses only what it needs, nothing more.

The compatibility masterstroke

The most important consequence of UTF-8 using 1 byte for the first 128 code points is this: any text file created in ASCII is automatically a valid UTF-8 file. The bytes are identical. A program that reads UTF-8 can read an ASCII file without any conversion, and vice versa for texts that only use those characters.

It's no exaggeration to say this design decision was the main reason UTF-8 won. In a world where millions (or billions) of files already existed in ASCII, an encoding that broke compatibility would have met enormous resistance. UTF-8 didn't ask anyone to migrate anything; it simply worked with what already existed.

Are Unicode and UTF-8 the Same Thing?

No! And this confusion is extremely common - even among experienced programmers.

Unicode is the agreement: the table that defines which number corresponds to which character. It's an abstract specification. Unicode itself says nothing about bytes.

UTF-8 is one of the ways to turn those numbers into bytes so they can be stored in a file or transmitted over the internet. There are other encodings that also use the Unicode table, like UTF-16, which uses 2 or 4 bytes, and UTF-32, which always uses 4 for everything, but UTF-8 is by far the most common.

Why Does This Matter to You as a Programmer?

When you start programming in Rust, one of the most common things your program will manipulate is text (words, sentences, or any sequence of characters).

In programming, that kind of data is called a string. A string is simply a piece of text stored inside a program, like "Olá", "Rust", or "123".

For a computer to be able to store and understand text, it needs to use an encoding system, and Rust has a very clear stance on encoding: all strings in Rust are UTF-8, always. There is no other option.

This means you'll never have to worry about "remembering to use UTF-8"; it simply is the default. But it also means that when you manipulate text in Rust, you'll run into behaviors that only make sense if you understand what just happened here.

For example: in Rust, you can't access the third character of a string the same way you access a list of numbers, and the reason is precisely what you've learned: different characters take up different numbers of bytes. The "third character" might start at the third byte, or the fifth, or the seventh. It depends on which characters came before it. Rust forces you to think about this explicitly, instead of sweeping the problem under the rug.

You'll understand this in practice when you get there. For now, what matters is that you already have the conceptual foundation for it to make sense when it happens.

Before moving on

Can you explain the difference between ASCII and Unicode? Between Unicode and UTF-8? And can you say why UTF-8 was the encoding the world adopted, instead of simply using 4 bytes for every single character?

If so, you're ready for the next section.

Serialization

Async Runtime

Web Frameworks

Database

Error Handling

CLI

Testing

Profiling

Common Optimizations

Community

The World Beyond ASCII: Unicode and UTF-8

Unicode

What About the Computers That Already Existed?

The Computer Doesn't Think in Bits, It Thinks in Bytes

The First Constraint

The Second Constraint

How Unicode Characters Are Represented

UTF-8: The Elegant Solution

Are Unicode and UTF-8 the Same Thing?

Why Does This Matter to You as a Programmer?

The World Beyond ASCII: Unicode and UTF-8 ​

Unicode ​

What About the Computers That Already Existed? ​

The Computer Doesn't Think in Bits, It Thinks in Bytes ​

The First Constraint ​

The Second Constraint ​

How Unicode Characters Are Represented ​

UTF-8: The Elegant Solution ​

Are Unicode and UTF-8 the Same Thing? ​

Why Does This Matter to You as a Programmer? ​

The World Beyond ASCII: Unicode and UTF-8

Unicode

What About the Computers That Already Existed?

The Computer Doesn't Think in Bits, It Thinks in Bytes

The First Constraint

The Second Constraint

How Unicode Characters Are Represented

UTF-8: The Elegant Solution

Are Unicode and UTF-8 the Same Thing?

Why Does This Matter to You as a Programmer?