Gnutella Forums - View Single Post

bmk · #25 (**permalink**) March 10th, 2002

As to why the new protocol should extend to UNICODE, and why implement this using UTF-8:

UNICODE aspires to define all characters of all languages. Right now, an address space of 2byte (about 64,000 characters) has been defined to cover most languages. This is being extended to 4bytes, but let's keep it at 2 bytes for now.

UTF (more correctly UTF-8) as well as UCS are ways to express the 2byte-number (I skip the 4byte UNICODE) for a character. UCS simply is the number in 2bytes, thus it may contain null-bytes. Normally when talking about UNICODE, the UCS-2 (= 2 bytes) method of expressing UNICODE is being refered to.

UTF or more correctly UTF-8 uses 1, 2 or 3 bytes to express the 2byte number for a UNICODE character. Null bytes do not occur. This works as follows:
<table border=1 cols=4><tr><td>UNICODE character number range (in hex)</td><td>UTF byte 1 (in binary)</td><td>UTF byte 2</td><td>UTF byte 3</td><td></tr><tr><td>0000 - 007f</td><td>0xxxxxxx</td><td>(none)</td><td>(none)</td><td></tr>
<tr><td>0080 - 07ff</td><td>110xxxxx</td><td>10xxxxxx</td><td>(none)</td></tr><tr><td>07ff - ffff</td><td>1110xxxx</td><td>10xxxxxx</td><td>10xxxxxx</td><td></tr></table>
<i><font color=red>UTF can also have 4 bytes, and using the same scheme express a character number up to U+10ffff. That won't be relevant right now, but may be in future. Provisions should be taken for upward compatibility with possible 4-byte UTF code sequences.</font></i>

The first byte of a UTF sequence gives its length in the highest value bits up to the first 0-bit, the following 1 or 2 bytes are easily recognizable as belonging to an UTF sequence by their 2 highest value bits, having a value between 80 and BF. The bits here marked as 'x' give the number of the character in the UNICODE table.

Thus, a UTF character of 1 byte length is exactly the same number as the corresponding ASCII character. However, a Latin-1 character will have a number beyond 7f. So its not possible to say if a single byte is a Latin-1 character or the start of an UTF sequence.

In conclusion, extending the encoding sheme of the protocoll from ASCII to UTF would leave current clients still working, as nullbytes do not occur. Old clients of course would treat each byte of an UTF sequence as a separate character, leading to funny names in the search results. But you get that even now, and searches containing e.g. German special characters do not really work right now: These characters will normally just be ignored. Moving to UCS might make some old clients fail, as one character might contain a nullbyte. Compared to UCS, UTF for a single character either takes less space (for ASCII text), exactly the same space (for the special European characters and any characters up to 07ff UNICODE, for example Russian), or 1 byte more (most notable for Asian languages)

As the bulk of the traffic very probably will remain ASCII for a long time from now, the increase in load by using UTF should be tolerable. You gain a worldwide audience, and you stay compatible with the current standard. Keep in mind that Latin-1 right now neither is standard nor does it work well. Lastly, if at some point in future you desire an extension to cover UNICODE characters up to U+10ffff then UTF-8 can still be used.

Please have a look at the <a href=http://www.unicode.org/>UNICODE Consortium</a>. Demonstration pages for UNICODE (always UTF encoded) can be found anywhere on the web. One such is <a href=http://www.geocities.com/Tokyo/Pagoda/1675/unicode-page.html>here</a>.

If you go for Latin-1, then you need a mechanism to identify the message as Latin-1 or as UNCIODE. If you use UCS, then you probably cannot maintain downward compatibility. You will also get new problems when at some point in the future characters up to U+10ffff should be supported.