Gnutella Forums - Hallo! We want being able to point to one more question

Gnutella Forums (https://www.gnutellaforums.com/)

- Open Discussion topics (https://www.gnutellaforums.com/open-discussion-topics/)

- - Hallo! We want being able to point to one more question (https://www.gnutellaforums.com/open-discussion-topics/30303-hallo-we-want-being-able-point-one-more-question.html)

Yukino

November 22nd, 2004 08:54 AM

Hallo!

As for Limewire 2 characters or less cannot be searched, it is
probably will be?

Lord of the Rings

November 22nd, 2004 09:53 AM

Should I be typing in Japanese? Hey I could be wrong. The error message that comes up is that 2 characters will congest the network, as it will seek out/search thru more items than it could possibly need & without necessarily finding all possible desired sources before the search is exhausted. In other words, a lot of search energy would simply be wasted searching thru unnecessary & unwanted files. One thing the p2p & particularly LW designers try to do is limit unnecessary traffic along the Gnutella network so it doesn't work at snail's pace. Which would you prefer, LW to work as the tortoise or the hare? lol

If you're refering to Asian characters such as used in Chinese or Japanese then perhaps you should put a request in for a new feature in LW. Post here: New Feature Requests And explain your reasons in detail. I can see how it would be a nuisance. Particularly for some names or even simple/short titles.

Yukino

November 22nd, 2004 08:46 PM

We want being able to point to one more question

To the question before answering, thank you for! By any means, unless
search can be written in 2 characters, there are also some which do
not come out it is, but in that case it probably is how it should have
done?

Lord of the Rings

November 22nd, 2004 09:13 PM

The way Limewire is set up at present, you will need to think very carefully how you can search using more than 2 characters to find what you want to find. Because you can search using different types of criteria, you can try to find a way of expressing it using 3 or more characters.

As far as results go, showing results with only 2 characters I don't know. I would need to change my set-up & experiment to find this out. Or perhaps you could tell us.

Thank you for bringing this to our attention & I'll try to leave a note for the developers (not that they ever listen to me anyway, lol :D )

I see that you are in fact using a Japanese ip. Out of curiousity which city are you from?

verdyp

December 8th, 2004 04:41 PM

The limit on 3 characters was designed at a time when only ASCII searches were reliable.
But since we now support Unicode for handling any language, this rule should be rewritten so that it will require a minimum 3 UTF-8 encoded bytes for a search.

This won't change anything for ASCII searches: it will still be 3 characters.

But for geenral European Latin/Greek searches it will mean that 2 characters will be enough if at least one is not ASCII (note however that searches ignores and drop accents, even if combining accents are still returned in the results)

For Asian languages, 3 UTF-8 bytes will code 1 ideograph or 1 Hiragana or Katakana. May be this limit of 3 bytes is too little.

So as a prudent alternative, I would say that 3 ASCII-only characters or 4 bytes of UTF-8 encoding will be needed to perform a search (For European languages, this is 3 ASCII, or 2 ASCII and 1 extended character, or 2 extended characters; for Asian texts, this means a minimum of 2 ideographs or 2 hiragana/katakana, ignoring the combining voice or tone marks)

murasame

December 11th, 2004 04:42 AM

Nihon-go desu ka.
Warui kedo ore-tachi no nihon-go wa mada mada desu
(I still can't read, not to mention write, kanji)
It is really funny to see how those translator applications work (or don't work).

verdyp

December 11th, 2004 10:55 AM

Although I can't read Japanese other than Hiragana and Katakana characters for which an approximative phonetic translitteration to the Latin script is easy to perform (like you did), I can still recognize that "nihon-go" means "Japanese" (the language name).

So I won't be helpful unless there's a translator for support questions in Japanese (even more difficult when Japanese users send us a question in Japanese using in their email some unknown variant of EUC/ISO-2022-JP, instead of the more widely portable Shift-JIS, or even Unicode UTF-8)...

So I have a small support question in Japanese for which I can't reply. Here it is (sorry this forum does not support Unicode characters other than in a UTF-8 form , so characters are shown incorrectly; you need to explicitly select UTF-8 in your browser...):

æ–‡å_—ã?®éƒ¨åˆ†ã?Œâ–¡â–¡â–¡ã?«ã?ªã‚Šã?¾ã?™
ã?ªã?œã?§ã?—ã‚‡ã?†ã?‹?

The question comes with a screen snapshot of Limewire in Japanese, where the title shown at the top of the search box is shown only as a string of square boxes. Apparently that user seems to have problems in his configuration of fonts to display Japanese, but I'm not sure how I can help, given that his display is not the one I get when testing LimeWire in Japanese, where I don't see these square boxes (which mean missing glyphs in the selected font).

So if there are inaccuracies in the encoding of the Japanese translation of LimeWire, there's little I can do. (Some months ago, a Japanese student was working in LimeWire offices in New York, and helped improving this translation, and creating the complete translation of the LimeWire web site in Japanese; he also worked with me to define the rules allowing better handling of Japanese in keyword searches).

Can someone come to the rescue? Aren't there any experimented Japanese user out there?

Lord of the Rings

December 11th, 2004 11:06 AM

How about Chinese characters. 榆

verdyp

December 11th, 2004 11:38 AM

Chinese characters, more exactly Han ideographs, are used in Chinese most often to write one syllable, not to write words or concepts as the term "ideograph" would imply. Linguists prefer the term "sinograph" to designate these characters.

What makes the Han script complex is the number of syllables that the script allow to encode, and the fact that the set of syllables in Chinese is extremely rich, with distinctive diphtongs, stress, tones, and multiple consonnants... When you compare it to other syllabaries (like Hiragana or Katakana used also in Japanese), the individual "letters" of that script becomes as much expressive as 1 or 2 syllables in a Latin-based language. That's why most Chinese words are written with no more than 2 sinographs. This is why Han is not considered as a syllabary, although it should (with the exception of some historic and rarely used sinographs used to represent concepts, or some tradtional sinographs which are widely used and frequent in Chinese texts, and represent a complete word or concept).

The size of the extended Han syllabary is not a problem for LimeWire, which inherits simply from the encoding efforts for Han performed in Unicode. In Unicode the most frenquent sinographs are encoded in the "BMP" after the U+03FF code point limit, meaning that they are represented with 3 bytes in a UTF-8 encoding scheme. A search for these characters will be selective if there is a bit more than 1 common sinograph in the search string.

The rule for allowing searches with:
- at least 3 ASCII-only chars,
- or at least 2 chars if at least one is not ASCII,
- or at least 4 bytes in the UTF-8 representation
would work for Chinese, as well as other languages.

Many more rare Han sinographs are encoded out of the BMP in a supplementary "ideographic" plane (SIP). Within Java and in LimeWire, all Unicode characters in strings are encoded internally with UTF-16 as a pair of "surrogates". But in UTF-8 they will become 4 bytes. If those characters were present in a search string, each of them would highly selective for searches. So a single character would be enough.

So the proposed rule to allow searches would also work well for these extended sinographs in the SIP...

Lord of the Rings

December 11th, 2004 12:18 PM

That's very impressive about 中文 use in LW! I guess it depends upon which Japanese text you use. But their very roughly equivalent of Han might achieve in finding some items in searches (depending on labelling & source, etc.) I don't know japanese as is obvious 石灰ワイヤー So it's a difficult task.

verdyp

December 11th, 2004 03:45 PM

You seem to have interesting and valuable knowledge about Asian scripts. If you have some programming experience, could you join to our team of open-source contributors to help improve further the internationalization of LimeWire?

Unfortunately, my knowledges of these scripts is only theorical, based on the works done in the Unicode standard and related works such as ICU and UniHan properties, but not based on linguistic and semantics.

One thing for which we have no clew is the support of Thai (which unfortunately has a visual ordering in Unicode because of the support of the legacy national TIS-620 standard, instead of a logical one used in other scripts, and also because Thai, like many other Asian languages, do not use any space to separate words).

In the past, I proposed to index Asian filenames by splitting them arbitrarily in units of 2 or 3 character positions, but the number of generated keywords would have been a bit too high:

Suppose that the title "ABCDEFG" is present, where each letter is a ideograph, or a Hiragana or Katakana letter or a Thai letter, the generated searchable and indexed keywords would have been:
"AB", "BC", "CD", "DE", "EF", "FG" (if these two-letter "keywords" respect the minimum UTF-8 length restrictions given in my previous message)
"ABC", "BCD", "CDE", "DEF", "EFG"
Note that there may exist situations where a SINGLE character is a significant keyword.

In LimeWire, we currently detect keyword separations either with:
- spaces and controls
- the general category of characters, so that punctuations or symbols become equivalent to spaces.
- the script type of the character: a transition in Japanese between Hiragana or Katakana or Kanji or Latin implies a keyword separation.

What we really need is a lexer. There are several open-source projects related to such lexical analysis of Asian texts (notably for implementing automatic translators, or input method editors or word processors). The problem is that they often depend on a local database that will store the lexical entities, or long lists of lexical rules. Some projects perform something else: lexical analysis is performed automatically, from an initially empty dictionnary, by statistical analysis of frequent lexical radicals, so that frequently used prefixes and suffixes can be identified (this is also useful for non Asian languages, like German, or for other Latin-written European or African languages like Hungarian or Berber).

This is a research domain which is highly protected by many patents, notably those owned by famous dictionnary editors, or web search engines like Google... Documents on this subject, which would be freely available and that would allow royaltee-free redistribution in open-source software are difficult to find... But I suppose that this has been studied since centuries within some old books whose text is now available in the public domain.

My searches within public libraries like the BDF in France have not found something significant (and getting copies of these documents is often complicate or even expensive, unless these books have been converted to numeric formats avaliable online). Also most of these books imply at least a good knowledge of the referenced languages, something I don't have... It's probably easier to do by natives of countries speaking and writing those languages, that's why we call for contributions by open-sourcers...

Lord of the Rings

December 11th, 2004 04:38 PM

Well I never studied Thai but I did study Kmer but I wouldn't suppose you'd have too many people of that language who'd even use a computer let alone LW. There are some similarities to Thai. But then there's a person here who is Thai. But none of the keyboards are setup here to use Thai. Simply using OSX fontbook to help with translations of text (or make such fonts available.) Likewise for Chinese. The other language I studied is Vietnamese but since that's almost latin based it is not quite as complex. Yes in Thai you have all the xtra accents & what-have-you. In principle the same as Kmer. Lao also has some similarities to Thai but in a different way to kmer.

As far as programming goes, although I loved it at the time, I am now too far removed from it. I am no master of java & only have a very basic & extremely limited knowledge of it. In my work I became too distracted with a locally designed language for work as a sys prog & ended up nowhere except in frustration. It was a rare language at the time so to speak. The co. was disolved as have many over the years.

I guess that's why I studied business & marketing so I would know what I was walking into. lol

But that's not to say I am not keen to help! If I can I will. I just need some instructions preferably by pm. I wanted to say the above things 1st! Technical terms can be a stumbling block. Often they're based directly on english; spoken &/ written .

verdyp

December 11th, 2004 05:06 PM

In principle, Lao and Khmer will be less complex than Thai, because they were encoded in Unicode using the logical model, which makes full-text searches easier to implement. For LimeWire, it means that Lao and Khmer can be handled like other Indian/Brahmic scripts (we don't care if the visual order differs from the logical order, or if there exists input methods that use a visual order, given that the conversion of these texts to Unicode will use a logical order.)

But it's not true for Thai, because Thai is supported in Thailand by an old and widely used TIS-620 standard, made since long in the 70's by IBM for the Thai government which made it mandatory for the representation of Thai texts. Unicode has then borrowed this situation, because it wanted to keep a roundtrip compatibility with the very large existing corpus of Thai texts in computers, files, databases, and input methods, encoded since long with TIS-620 or one of its predecessors.

Thai has always been encoded with the visual order where some letters that are logically after another one (including in phonetic, structure, or collation) must be entered before it, because it will be written graphically in a single right-to-left direction (this visual ordering comes from the legacy limitations of font and display technologies in the 70s). India has chosen to preserve the logical ordering, which looks more like the way Indian users think about their language, and how they spell it orally (there has existed some typewriters in India using the visual orer, but these were considered difficult or illogical to use; for computers, the ISCII standard was created with the correct assumption that computers would make themselves the visual ordering.)

Thai is complex because there does not exist a reliable algorithm to convert from the visual order (encoded) to the logical order (that would be useful for searches and collation). In practice, a Thai collation system comes with a large database containing most Thai words or radicals encoded in logical order. This database could be used to create useful lexical entities in LimeWire, but it is large, and may be subject to some copyright restrictions (I don't know exactly the status of the Thai database that comes with IBM's ICU, i.e. if it can be redistributed freely, like ICU itself).

If LimeWire incorporated this database for Thai users, may be this should be an optional download, because of its size.

As far as I know, no other Asian language needs such a database for collation, but such a database may be needed by a lexer to split sentences into keywords, due to the absence of mandatory spaces. But it's possible that these Asian users have learned to insert spaces or punctuation within their filenames to help the indexation of their files. We have no feedback about this from Chinese or Japanese users, so I don't know if their attempts to search files in their language is successful or not. If not, may be we should consider implementing at least some basic lexer like the one I exposed above (every 2 or 3 or 4 characters within a sequence of letters of the same Asian script).

Lord of the Rings

December 11th, 2004 05:37 PM

I may try to research around to see what I can find out (of those that search in their native language) & see what their responses are. But I won't reply in a hurry it could take quite some time. Particularly this time of year (people on breaks & on holidays overseas, etc.)

Thai font licenced! Well I guess that would explain the limitation of characters. Some of mine come freely from a university apparently. So how does Java handle these fonts? Seem to be fine from here. Everytime I visit hotm'l it's in thai text. Not all correctly displayed but.

How much do you think LW would pay me if I translated a version totally to Thai & had OSX type iTunes support. lol (ummm just a joke!]

verdyp

December 11th, 2004 05:53 PM

I won't reply you about the payment. A LimeWire crew would better replyto you privately.

I contribute to LimeWire as a free open-source contributer, managing most of the work with candidate translators, validating their input, testing them in LimeWire, helping LimeWire for beta-tests, or helping with proposed optimizations or corrections.

I'm not paid to do that, but I have no contractual obligation with LimeWire, but just need to respect its working etiquette. For these reasons, Limewire has granted to me a limited access to their development platform, and I received a couple of checks and some goodies as a way thank me for my past contributions.

verdyp

December 11th, 2004 07:31 PM

Note that I already have some open-source lexers, but they are imlplemented in C, not Java (however this source is not difficult to port).
The only good question is the validity of these lexicon extractors, but some of them are used in wellknown search engines, such as mnoGoSearch, which comes with a lexicon extractor that:
- first extract tokens from text using separators, or returning 1-char tokens for Han and Thai characters
- then uses a dynamic programming to determine lexicons from multiple paths based on word frequencies with maximum path length (this requires a word frequency dictionnary, available for Thai and Chinese, Thai being the smallest, Mandarin/simplified Chinese containing 3 times more entries, and Traditional Chinese being twice the size of Mandarin).

The algorithm implementation is small, and efficient, but the main problem is to encode the dictionnaries in a small and efficient way. For Thai, this could fit in less than 100KB, but for Mandarin it would require about 300 to 400KB, and for Traditional Chinese about 1MB (in memory at run-time)...

murasame

December 12th, 2004 10:17 AM

Poor Yukino: if he/she is still trying to follow this thread and she/he is using some kind of translating software to understand what is being talked about then he/she is pretty much done for. :)

verdyp

December 12th, 2004 11:00 AM

How can we avoid to be technical for such discussions?
At the time when such 3-characters limit was added, it was because there was a huge and unnecessary traffic caused by searches with too many results like "the", or "mp3", or even "*".
Limewire has integrated some filters forcing its users to be more selective, so that requests will not be randomly routed throughout a large part of the network, where there are too many results, so that these requests will completely fill the available bandwidth.

The problem with those request is not that all results will not be returned (due to bandwidth limitation), but the fact that they will completely fill the space shared that would allow routing more specific and more useful requests. This limitation is inherent to the propagation model of these requests: they behave, statistically, exactly like network-wide broadcasts, consuming an extremely large bandwidth both for the broadcasted requests themselves but even worse for the distributed responses spreaded from throughout the network.

I note however, that since Gnutella has evolved, the routing of responses directly to the requester will use a lower shared bandwidth, because many intermediate nodes will not have to support this traffic: the only node that will be overwelmed by this traffic will be the requested host itself, whose input bandwidth will be completely saturated by responses. But Limewire now has an algorithm to easily limit and control this incoming flow. So these replies are less a problem than they were in the past. It remains true that these requests will still propagate as broadcast, without efficient routing. But LimeWire integrates in its router, some algorithms that limit the impact of broadcasts, by not routing a request immediately to all candidate directions.

If you look at the "what's new?" feature, you'll note that it looks like a request that can match many replies from many hosts, possibly nearly all! This feature now works marvelously, and it's possible that these "greedy" short requests would no longer be a problem in LimeWire.

These algorithms are mostly heuristics, they are not perfect. So we still need to carefully study the impact if we lower them. What is clear is that the most problematic greedy requests in the past was with requests containing only ASCII letters. The 3-characters limit was imagined at a time where only ASCII requests were possible on Gnutella, so working well only in English and some languages rarely present on the web. Under this old limit, we had too many files containing "words" with 1 or 2 ASCII letters or digits. In other terms, these requests were not selective enough.

But as we can now search for international strings, with accents on Latin letters, or using other scripts, the 3 characters minimum becomes excessive, because a search for a 1-letter or 2-letter Han ideographic word will be most often more selective than a search for a 3-letter English word. In fact Latin letters with accents or even Greek letters or Cyrillic letters are much less present on the network, even on hosts run by users using this script natively for naming a part of their shared files.

Note however that if I search for "Pô", this search may look selective for the French name of a River in the North of Italy, but in fact the effective search string will be "po" (because searches are matched and routed so that minor differences of orthography are hidden: I can search for "café" and I get results with "CAFE" or "Café" or "cafes" or results where the accute accent above e is encoded separately after the e letter as a combining diacritic, because of the way various hosts and systems encode or strip this accent in their filenames...). So a search for "Pô" is still not selective enough. This is a place for later improvement, with more dynamic behavior based on actual frequencies of results.

For now, we must keep this limit for ASCII letters (which apply to nearly all Latin letters with few exceptions like "ø" or "æ" which may be decomposed into "o" and "ae", or Latin letters like the Icelandic/old English "thorn", the Nordic "eth", or the rare "esh"), but I don't see why we should not relax it for non Latin scripts:

Notably Cyrillic letters (for Russian, Bulgarian, Serbian...), Han ideographs (for Chinese Hanzi characters, Japanese Kanjis or Korean Hanjas), Hiragana and Katakana syllables (for Japanese), and less urgently for the Greek alphabet, and the Hebrew and Arabic abugidas, and some Indian scripts.

All times are GMT -7. The time now is 09:12 PM.