Unicode and UTF-8
There are many discrepancy towards unicode and UTF=8, for very long time I did tno figure out what the difference between thiese two things.
character set
Also known as char set, defined some specific word or character on this world into sequential number.
There ever have some sort of char set in this world:
Universal Character Set(UCS)
Which is obsoletedUnicode
This is currently prominentASCII
Ancient but still affect currentcharset
But what on Earth is the word char set?
Just as I have mentioned at the begining of this chapter, char set is just to define a map in which sequential number and any word or character in this world.
ASCII
At ancient time, ASCII was first introduced on the meeting of the American Standards Association’s (ASA) X3.2 subcommittee, to include all alphabetic character and digit character, as well as some other control characters. ASCII unified the character set at that time, even today, ASCII still affecting our new generation character set like Unicode.
ASCII use 7 bits to represent 128 characters.
sample ASCII character set table
| Binary | Oct | Dec | Hex | Glyph |
|---|---|---|---|---|
| 010 0100 | 044 | 36 | 24 | $ |
| 100 0000 | 100 | 64 | 40 | @ |
| 100 0001 | 101 | 65 | 41 | A |
| 110 0001 | 141 | 97 | 61 | a |
| 011 0000 | 060 | 48 | 30 | 0 |
But as you can also image, ASCII character set can not fit current requirement for much larger words and characters, thus we have to emplify and fill more character into a new generation character set.
UNICODE
Unicode is a prominent and new generation character set for today, it try alot hard to include almost all characters and words in this world, since its versatile representation capability, more companies and organization keen on it as their main character set.
Just like ASCII, Unicode not only provide all ASCII character set with its original position for better compatibility with legacy charset, but also extends many more characters like Chinese, Japanese, Indian and others.
Unicode use 24 bits to represent, that is almost 3 bytes.
sample UNICODE character set table
| Dec | Hex | Glyph |
|---|---|---|
| 36 | U+0024 | $ |
| 163 | U+00A2 | ¢ |
| 8364 | U+20AC | € |
character encoding
In this chapter I just want to talk about encoding with Unicode for simplicity.
After defining all those available characters as a set, what I do in information transmission? Of course you can throw those 3 three bytes onto network or hard disk, but there could have a better project.
You may find that alphabets and digits are used much more frequently than other rare characters, since Huffman Coding already give us a great practice, we can reduce cost or transmission burden by shorten encoding of different character for disparate length.
For instance, since A is used so frequently than Chinese character, we could encoding A in one byte or 8 bits, but encoding Chinese in 4 bytes.
That is how and why character encoding works:
UTF-8
UTF-8 is a variable-width encoding that can represent every character in the Unicode character set.
For original ASCII character, UTF-8 obey its originality, use only one byte to shorten encoding.
While spend more byte for unicode’s larger and less frequently used character.
| Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
|---|---|---|---|---|---|---|---|---|---|
| 7 | U+0000 | U+007F | 1 | 0xxxxxxx | |||||
| 11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | ||||
| 16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
| 21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | ||
| 26 | U+200000 | U+3FFFFFF | 5 | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | |
| 31 | U+4000000 | U+7FFFFFFF | 6 | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
So use this binary code point table along with Unicode character set, we can now have better performed encoding system:
sample UTF-8 mapping table
| Glyph | unicode | Binary code point | Binary UTF-8 | Hexadecimal UTF-8 |
|---|---|---|---|---|
| $ | U+0024 | 0100100 | 00100100 | 24 |
| ¢ | U+00A2 | 000 10100010 | 11000010 | 10100010 |
| € | U+20AC | 00100000 10101100 | 11100010 10000010 10101100 | E2 82 AC |
Some languages
| Language | Range |
|---|---|
| Chinese | u4e00-u9fa5 |
| Korean | x3130-x318F |
| Korean | xAC00-xD7A3 |
| Japaness | u0800-u4e00 |
UTF-16
UTF-16 is a fix-width encoding, any character set from Unicode is literally copied onto UTF-16 mapping table.
I will skip this paragraph since this is much simpler than UTF-8.
Now
Do you understand what the disparity between Unicode and UTF-8?
You need to ask yourself, what about the other charset and encoding?