Hallo! We want being able to point to one more question

verdyp · #9 (**permalink**) December 11th, 2004

Chinese characters, more exactly Han ideographs, are used in Chinese most often to write one syllable, not to write words or concepts as the term "ideograph" would imply. Linguists prefer the term "sinograph" to designate these characters.

What makes the Han script complex is the number of syllables that the script allow to encode, and the fact that the set of syllables in Chinese is extremely rich, with distinctive diphtongs, stress, tones, and multiple consonnants... When you compare it to other syllabaries (like Hiragana or Katakana used also in Japanese), the individual "letters" of that script becomes as much expressive as 1 or 2 syllables in a Latin-based language. That's why most Chinese words are written with no more than 2 sinographs. This is why Han is not considered as a syllabary, although it should (with the exception of some historic and rarely used sinographs used to represent concepts, or some tradtional sinographs which are widely used and frequent in Chinese texts, and represent a complete word or concept).

The size of the extended Han syllabary is not a problem for LimeWire, which inherits simply from the encoding efforts for Han performed in Unicode. In Unicode the most frenquent sinographs are encoded in the "BMP" after the U+03FF code point limit, meaning that they are represented with 3 bytes in a UTF-8 encoding scheme. A search for these characters will be selective if there is a bit more than 1 common sinograph in the search string.

The rule for allowing searches with:
- at least 3 ASCII-only chars,
- or at least 2 chars if at least one is not ASCII,
- or at least 4 bytes in the UTF-8 representation
would work for Chinese, as well as other languages.

Many more rare Han sinographs are encoded out of the BMP in a supplementary "ideographic" plane (SIP). Within Java and in LimeWire, all Unicode characters in strings are encoded internally with UTF-16 as a pair of "surrogates". But in UTF-8 they will become 4 bytes. If those characters were present in a search string, each of them would highly selective for searches. So a single character would be enough.

So the proposed rule to allow searches would also work well for these extended sinographs in the SIP...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
everybody point and laugh	wrestlingles	Open Discussion topics	2	November 10th, 2005 04:30 AM
Hallo Deutschland	DaleKaufm	Deutsch	1	June 21st, 2005 02:02 PM
What is the point if i cant transfer????	Unregistered	General Mac OSX Support	9	November 1st, 2002 09:14 AM
What is the point of the MP3 player in version 1.8?	Unregistered	Open Discussion topics	2	November 13th, 2001 07:54 AM
Point Server with P2P	allautoweb	General Gnutella / Gnutella Network Discussion	0	April 29th, 2001 06:09 PM