Unicode and UTF-8

There are many discrepancy towards unicode and UTF=8, for very long time I did tno figure out what the difference between thiese two things.

character set

Also known as char set, defined some specific word or character on this world into sequential number.
There ever have some sort of char set in this world:

Universal Character Set(UCS)
Which is obsoleted
Unicode
This is currently prominent
ASCII
Ancient but still affect current charset

But what on Earth is the word char set?
Just as I have mentioned at the begining of this chapter, char set is just to define a map in which sequential number and any word or character in this world.

ASCII

At ancient time, ASCII was first introduced on the meeting of the American Standards Association’s (ASA) X3.2 subcommittee, to include all alphabetic character and digit character, as well as some other control characters. ASCII unified the character set at that time, even today, ASCII still affecting our new generation character set like Unicode.
ASCII use 7 bits to represent 128 characters.

sample `ASCII` character set table

Binary	Oct	Dec	Hex	Glyph
010 0100	044	36	24	$
100 0000	100	64	40	@
100 0001	101	65	41	A
110 0001	141	97	61	a
011 0000	060	48	30	0

But as you can also image, ASCII character set can not fit current requirement for much larger words and characters, thus we have to emplify and fill more character into a new generation character set.

UNICODE

Unicode is a prominent and new generation character set for today, it try alot hard to include almost all characters and words in this world, since its versatile representation capability, more companies and organization keen on it as their main character set.

Just like ASCII, Unicode not only provide all ASCII character set with its original position for better compatibility with legacy charset, but also extends many more characters like Chinese, Japanese, Indian and others.
Unicode use 24 bits to represent, that is almost 3 bytes.

sample `UNICODE` character set table

Dec	Hex	Glyph
36	U+0024	$
163	U+00A2	¢
8364	U+20AC	€

character encoding

In this chapter I just want to talk about encoding with Unicode for simplicity.

After defining all those available characters as a set, what I do in information transmission? Of course you can throw those 3 three bytes onto network or hard disk, but there could have a better project.
You may find that alphabets and digits are used much more frequently than other rare characters, since Huffman Coding already give us a great practice, we can reduce cost or transmission burden by shorten encoding of different character for disparate length.
For instance, since A is used so frequently than Chinese character, we could encoding A in one byte or 8 bits, but encoding Chinese in 4 bytes.

That is how and why character encoding works:

UTF-8

UTF-8 is a variable-width encoding that can represent every character in the Unicode character set.
For original ASCII character, UTF-8 obey its originality, use only one byte to shorten encoding.
While spend more byte for unicode’s larger and less frequently used character.

Bits of code point	First code point	Last code point	Bytes in sequence	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+0000	U+007F	1	0xxxxxxx
11	U+0080	U+07FF	2	110xxxxx	10xxxxxx
16	U+0800	U+FFFF	3	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000	U+1FFFFF	4	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
26	U+200000	U+3FFFFFF	5	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
31	U+4000000	U+7FFFFFFF	6	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

So use this binary code point table along with Unicode character set, we can now have better performed encoding system:

sample `UTF-8` mapping table

Glyph	unicode	Binary code point	Binary UTF-8	Hexadecimal UTF-8
$	U+0024	0100100	00100100	24
¢	U+00A2	000 10100010	11000010	10100010
€	U+20AC	00100000 10101100	11100010 10000010 10101100	E2 82 AC

Some languages

Language	Range
Chinese	u4e00-u9fa5
Korean	x3130-x318F
Korean	xAC00-xD7A3
Japaness	u0800-u4e00

UTF-16

UTF-16 is a fix-width encoding, any character set from Unicode is literally copied onto UTF-16 mapping table.
I will skip this paragraph since this is much simpler than UTF-8.

Now

Do you understand what the disparity between Unicode and UTF-8?
You need to ask yourself, what about the other charset and encoding?

development

#charset

Unicode and UTF-8

https://rug.al/2014/2014-07-17-unicode-and-utf8/

Author

Rugal Bernstein

Posted on

July 17, 2014

Licensed under

logistic regression Previous

linear regression Next

Unicode and UTF-8

character set

ASCII

sample ASCII character set table

UNICODE

sample UNICODE character set table

character encoding

UTF-8

sample UTF-8 mapping table

Some languages

UTF-16

Now

sample `ASCII` character set table

sample `UNICODE` character set table

sample `UTF-8` mapping table