Gnutella Forums  

Go Back   Gnutella Forums > Current Gnutella Client Forums > LimeWire+WireShare (Cross-platform) > Open Discussion topics
Register FAQ The Twelve Commandments Members List Calendar Arcade Find the Best VPN Today's Posts

Open Discussion topics Discuss the time of day, whatever you want to. This is the hangout area. If you have LimeWire problems, post them here too.


 
 
LinkBack Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #9 (permalink)  
Old December 11th, 2004
verdyp's Avatar
LimeWire is International
 
Join Date: January 13th, 2002
Location: Nantes, FR; Rennes, FR
Posts: 306
verdyp is flying high
Default

Chinese characters, more exactly Han ideographs, are used in Chinese most often to write one syllable, not to write words or concepts as the term "ideograph" would imply. Linguists prefer the term "sinograph" to designate these characters.

What makes the Han script complex is the number of syllables that the script allow to encode, and the fact that the set of syllables in Chinese is extremely rich, with distinctive diphtongs, stress, tones, and multiple consonnants... When you compare it to other syllabaries (like Hiragana or Katakana used also in Japanese), the individual "letters" of that script becomes as much expressive as 1 or 2 syllables in a Latin-based language. That's why most Chinese words are written with no more than 2 sinographs. This is why Han is not considered as a syllabary, although it should (with the exception of some historic and rarely used sinographs used to represent concepts, or some tradtional sinographs which are widely used and frequent in Chinese texts, and represent a complete word or concept).

The size of the extended Han syllabary is not a problem for LimeWire, which inherits simply from the encoding efforts for Han performed in Unicode. In Unicode the most frenquent sinographs are encoded in the "BMP" after the U+03FF code point limit, meaning that they are represented with 3 bytes in a UTF-8 encoding scheme. A search for these characters will be selective if there is a bit more than 1 common sinograph in the search string.

The rule for allowing searches with:
- at least 3 ASCII-only chars,
- or at least 2 chars if at least one is not ASCII,
- or at least 4 bytes in the UTF-8 representation
would work for Chinese, as well as other languages.

Many more rare Han sinographs are encoded out of the BMP in a supplementary "ideographic" plane (SIP). Within Java and in LimeWire, all Unicode characters in strings are encoded internally with UTF-16 as a pair of "surrogates". But in UTF-8 they will become 4 bytes. If those characters were present in a search string, each of them would highly selective for searches. So a single character would be enough.

So the proposed rule to allow searches would also work well for these extended sinographs in the SIP...
__________________
LimeWire is international. Help translate LimeWire to your own language.
Visit: http://www.limewire.org/translate.shtml
Reply With Quote
 


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
everybody point and laugh wrestlingles Open Discussion topics 2 November 10th, 2005 04:30 AM
Hallo Deutschland DaleKaufm Deutsch 1 June 21st, 2005 02:02 PM
What is the point if i cant transfer???? Unregistered General Mac OSX Support 9 November 1st, 2002 09:14 AM
What is the point of the MP3 player in version 1.8? Unregistered Open Discussion topics 2 November 13th, 2001 07:54 AM
Point Server with P2P allautoweb General Gnutella / Gnutella Network Discussion 0 April 29th, 2001 06:09 PM


All times are GMT -7. The time now is 10:24 PM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
SEO by vBSEO 3.6.0 ©2011, Crawlability, Inc.

Copyright © 2020 Gnutella Forums.
All Rights Reserved.