Encoding schemes are ways to store and retrieve characters in computers. For example, in
ISO-8859-1(which is one of various encoding schemes), we use
01100001 to denote
00111111 to denote
?. Because computers only know how to store and fetch
1, we must have a way to store other characters like alphabets, or even Chinese characters.
ISO-8859-1 is also called LATIN-1, which is a deprecated encoding scheme. It can only store the 256 characters in the ASCII table.
For example, let’s write a text file with the following contents:
Let’s see how the computer stores it. Type the command
xxd -b resources/test/ASCII.txt to display the real contents the computer stores.
We can see that
& was stored as
1 was stored as
00110001. This is exactly how ISO-8859-1 processes characters. It maps each character to a unique byte. Refer to the code page displayed in Wikipedia to see the map. Because ISO-8859-1 doesn’t have a way to store Chinese characters, we can see that 中文 are mapped to
Because ISO-8859-1 doesn’t have a way to store Chinese characters, we can see that 中文 are mapped to
00111111 00111111, which are just two question marks in ISO-8859-1.
You may notice that ISO-8859-1 can only store 256 characters, this is not enough. How can we store Chinese characters, which include way more characters than 256? Here comes Unicode.
Unicode is a computing industry standard that is able to store millions of characters in computers. Basically it’s just a huge map, which maps every character in this world to a number, which takes 1 ~ 4 bytes depending on what the character is.
A is mapped to
w is mapped to
中 is mapped to
0x4E2D. Each character in this world has its number mapped. You can find the whole mapping table in here.
Unicode may have solved the problem, right? Why would we need UTF-8, and what is UTF-8? To find out the reason, first we need to find out if Unicode could solve our problem directly.
Say I want to store
AA on my disk. In Unicode,
A is mapped to
0x41, so what I need to do is just store
0x4141 in my computer, right? No, it’s not going to work. How can we know what
0x4141 is if we try to decode it? Is it
AA, or just a character whose mapping number is exactly
0x4141? Because a Unicode character takes 1 ~ 4 bytes, you will never know the boundary of each character if you store it directly on the disk.
How can we solve the problem? The simplest method is to store each character in 4 bytes, if a character’s mapping number is less than 4 bytes, left padding it with zeros, so
A would become
0x00000041 instead of just
0x41, that’s a way, it would work, but since most characters take less than 4 bytes in Unicode, it would waste a lot of space if we use this method. So here comes UTF-8.
In UTF-8, the first 128 characters in the ASCII table take only 1 byte each.
For those characters, the first bit in each byte is 0. When we need to denote a character that is not one of those characters, like
中, which takes 2 bytes in Unicode, we set the first bit of its first byte to 1, and set the rest bits according to the Unicode Standard. More detailed can be seen from here.
So we can say that Unicode is just a standard, UTF-8 is a way to implement the standard, which specifies in detail how to store the Unicode number onto disk.