![]() |
|
Register | FAQ | The Twelve Commandments | Members List | Calendar | Arcade | Find the Best VPN | Today's Posts | Search |
Open Discussion topics Discuss the time of day, whatever you want to. This is the hangout area. If you have LimeWire problems, post them here too. |
| LinkBack | Thread Tools | Display Modes |
| ||||
![]() Chinese characters, more exactly Han ideographs, are used in Chinese most often to write one syllable, not to write words or concepts as the term "ideograph" would imply. Linguists prefer the term "sinograph" to designate these characters. What makes the Han script complex is the number of syllables that the script allow to encode, and the fact that the set of syllables in Chinese is extremely rich, with distinctive diphtongs, stress, tones, and multiple consonnants... When you compare it to other syllabaries (like Hiragana or Katakana used also in Japanese), the individual "letters" of that script becomes as much expressive as 1 or 2 syllables in a Latin-based language. That's why most Chinese words are written with no more than 2 sinographs. This is why Han is not considered as a syllabary, although it should (with the exception of some historic and rarely used sinographs used to represent concepts, or some tradtional sinographs which are widely used and frequent in Chinese texts, and represent a complete word or concept). The size of the extended Han syllabary is not a problem for LimeWire, which inherits simply from the encoding efforts for Han performed in Unicode. In Unicode the most frenquent sinographs are encoded in the "BMP" after the U+03FF code point limit, meaning that they are represented with 3 bytes in a UTF-8 encoding scheme. A search for these characters will be selective if there is a bit more than 1 common sinograph in the search string. The rule for allowing searches with: - at least 3 ASCII-only chars, - or at least 2 chars if at least one is not ASCII, - or at least 4 bytes in the UTF-8 representation would work for Chinese, as well as other languages. Many more rare Han sinographs are encoded out of the BMP in a supplementary "ideographic" plane (SIP). Within Java and in LimeWire, all Unicode characters in strings are encoded internally with UTF-16 as a pair of "surrogates". But in UTF-8 they will become 4 bytes. If those characters were present in a search string, each of them would highly selective for searches. So a single character would be enough. So the proposed rule to allow searches would also work well for these extended sinographs in the SIP...
__________________ LimeWire is international. Help translate LimeWire to your own language. Visit: http://www.limewire.org/translate.shtml |
| |
![]() | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
everybody point and laugh | wrestlingles | Open Discussion topics | 2 | November 10th, 2005 04:30 AM |
Hallo Deutschland | DaleKaufm | Deutsch | 1 | June 21st, 2005 02:02 PM |
What is the point if i cant transfer???? | Unregistered | General Mac OSX Support | 9 | November 1st, 2002 09:14 AM |
What is the point of the MP3 player in version 1.8? | Unregistered | Open Discussion topics | 2 | November 13th, 2001 07:54 AM |
Point Server with P2P | allautoweb | General Gnutella / Gnutella Network Discussion | 0 | April 29th, 2001 06:09 PM |