Unicode and UTF-8
There are many discrepancy towards
UTF=8, for very long time I did tno figure out what the difference between thiese two things.
Also known as
char set, defined some specific word or character on this world into sequential number.
There ever have some sort of
char set in this world:
Universal Character Set(UCS) Which is obsoleted
UnicodeThis is currently prominent
ASCIIAncient but still affect current
But what on Earth is the word
Just as I have mentioned at the begining of this chapter,
char set is just to define a map in which sequential number and any word or character in this world.
At ancient time,
ASCII was first introduced on the meeting of the American Standards Association’s (ASA) X3.2 subcommittee, to include all alphabetic character and digit character, as well as some other control characters.
ASCII unified the character set at that time, even today,
ASCII still affecting our new generation character set like
7 bits to represent
ASCII character set table
But as you can also image, ASCII character set can not fit current requirement for much larger words and characters, thus we have to emplify and fill more character into a new generation character set.
Unicode is a prominent and new generation character set for today, it try alot hard to include almost all characters and words in this world, since its versatile representation capability, more companies and organization keen on it as their main character set.
Unicode not only provide all ASCII character set with its original position for better compatibility with legacy charset, but also extends many more characters like Chinese, Japanese, Indian and others.
24 bits to represent, that is almost 3 bytes.
UNICODE character set table
In this chapter I just want to talk about
Unicode for simplicity.
After defining all those available characters as a set, what I do in information transmission? Of course you can throw those 3 three bytes onto network or hard disk, but there could have a better project.
You may find that alphabets and digits are used much more frequently than other rare characters, since
Huffman Coding already give us a great practice, we can reduce cost or transmission burden by shorten encoding of different character for disparate length.
For instance, since
A is used so frequently than
Chinese character, we could encoding
A in one byte or 8 bits, but encoding
Chinese in 4 bytes.
That is how and why character encoding works:
UTF-8 is a
variable-width encoding that can represent every character in the Unicode character set.
UTF-8 obey its originality, use only one byte to shorten encoding.
While spend more byte for unicode’s larger and less frequently used character.
|Bits of code point||First code point||Last code point||Bytes in sequence||Byte 1||Byte 2||Byte 3||Byte 4||Byte 5||Byte 6|
So use this binary code point table along with Unicode character set, we can now have better performed encoding system:
UTF-8 mapping table
|Glyph||unicode||Binary code point||Binary UTF-8||Hexadecimal UTF-8|
|¢||U+00A2||000 10100010||11000010||10100010||C2 A2|
|€||U+20AC||00100000 10101100||11100010 10000010 10101100||E2 82 AC|
UTF-16 is a fix-width encoding, any character set from Unicode is literally copied onto UTF-16 mapping table.
I will skip this paragraph since this is much simpler than
Do you understand what the disparity between Unicode and UTF-8?
You need to ask yourself, what about the other