Hallo! We want being able to point to one more question

verdyp · #1 (**permalink**) December 11th, 2004

You seem to have interesting and valuable knowledge about Asian scripts. If you have some programming experience, could you join to our team of open-source contributors to help improve further the internationalization of LimeWire?

Unfortunately, my knowledges of these scripts is only theorical, based on the works done in the Unicode standard and related works such as ICU and UniHan properties, but not based on linguistic and semantics.

One thing for which we have no clew is the support of Thai (which unfortunately has a visual ordering in Unicode because of the support of the legacy national TIS-620 standard, instead of a logical one used in other scripts, and also because Thai, like many other Asian languages, do not use any space to separate words).

In the past, I proposed to index Asian filenames by splitting them arbitrarily in units of 2 or 3 character positions, but the number of generated keywords would have been a bit too high:

Suppose that the title "ABCDEFG" is present, where each letter is a ideograph, or a Hiragana or Katakana letter or a Thai letter, the generated searchable and indexed keywords would have been:
"AB", "BC", "CD", "DE", "EF", "FG" (if these two-letter "keywords" respect the minimum UTF-8 length restrictions given in my previous message)
"ABC", "BCD", "CDE", "DEF", "EFG"
Note that there may exist situations where a SINGLE character is a significant keyword.

In LimeWire, we currently detect keyword separations either with:
- spaces and controls
- the general category of characters, so that punctuations or symbols become equivalent to spaces.
- the script type of the character: a transition in Japanese between Hiragana or Katakana or Kanji or Latin implies a keyword separation.

What we really need is a lexer. There are several open-source projects related to such lexical analysis of Asian texts (notably for implementing automatic translators, or input method editors or word processors). The problem is that they often depend on a local database that will store the lexical entities, or long lists of lexical rules. Some projects perform something else: lexical analysis is performed automatically, from an initially empty dictionnary, by statistical analysis of frequent lexical radicals, so that frequently used prefixes and suffixes can be identified (this is also useful for non Asian languages, like German, or for other Latin-written European or African languages like Hungarian or Berber).

This is a research domain which is highly protected by many patents, notably those owned by famous dictionnary editors, or web search engines like Google... Documents on this subject, which would be freely available and that would allow royaltee-free redistribution in open-source software are difficult to find... But I suppose that this has been studied since centuries within some old books whose text is now available in the public domain.

My searches within public libraries like the BDF in France have not found something significant (and getting copies of these documents is often complicate or even expensive, unless these books have been converted to numeric formats avaliable online). Also most of these books imply at least a good knowledge of the referenced languages, something I don't have... It's probably easier to do by natives of countries speaking and writing those languages, that's why we call for contributions by open-sourcers...

Lord of the Rings · #2 (**permalink**) December 11th, 2004

Well I never studied Thai but I did study Kmer but I wouldn't suppose you'd have too many people of that language who'd even use a computer let alone LW. There are some similarities to Thai. But then there's a person here who is Thai. But none of the keyboards are setup here to use Thai. Simply using OSX fontbook to help with translations of text (or make such fonts available.) Likewise for Chinese. The other language I studied is Vietnamese but since that's almost latin based it is not quite as complex. Yes in Thai you have all the xtra accents & what-have-you. In principle the same as Kmer. Lao also has some similarities to Thai but in a different way to kmer.

As far as programming goes, although I loved it at the time, I am now too far removed from it. I am no master of java & only have a very basic & extremely limited knowledge of it. In my work I became too distracted with a locally designed language for work as a sys prog & ended up nowhere except in frustration. It was a rare language at the time so to speak. The co. was disolved as have many over the years.

I guess that's why I studied business & marketing so I would know what I was walking into. lol

But that's not to say I am not keen to help! If I can I will. I just need some instructions preferably by pm. I wanted to say the above things 1st! Technical terms can be a stumbling block. Often they're based directly on english; spoken &/ written .

verdyp · #3 (**permalink**) December 11th, 2004

In principle, Lao and Khmer will be less complex than Thai, because they were encoded in Unicode using the logical model, which makes full-text searches easier to implement. For LimeWire, it means that Lao and Khmer can be handled like other Indian/Brahmic scripts (we don't care if the visual order differs from the logical order, or if there exists input methods that use a visual order, given that the conversion of these texts to Unicode will use a logical order.)

But it's not true for Thai, because Thai is supported in Thailand by an old and widely used TIS-620 standard, made since long in the 70's by IBM for the Thai government which made it mandatory for the representation of Thai texts. Unicode has then borrowed this situation, because it wanted to keep a roundtrip compatibility with the very large existing corpus of Thai texts in computers, files, databases, and input methods, encoded since long with TIS-620 or one of its predecessors.

Thai has always been encoded with the visual order where some letters that are logically after another one (including in phonetic, structure, or collation) must be entered before it, because it will be written graphically in a single right-to-left direction (this visual ordering comes from the legacy limitations of font and display technologies in the 70s). India has chosen to preserve the logical ordering, which looks more like the way Indian users think about their language, and how they spell it orally (there has existed some typewriters in India using the visual orer, but these were considered difficult or illogical to use; for computers, the ISCII standard was created with the correct assumption that computers would make themselves the visual ordering.)

Thai is complex because there does not exist a reliable algorithm to convert from the visual order (encoded) to the logical order (that would be useful for searches and collation). In practice, a Thai collation system comes with a large database containing most Thai words or radicals encoded in logical order. This database could be used to create useful lexical entities in LimeWire, but it is large, and may be subject to some copyright restrictions (I don't know exactly the status of the Thai database that comes with IBM's ICU, i.e. if it can be redistributed freely, like ICU itself).

If LimeWire incorporated this database for Thai users, may be this should be an optional download, because of its size.

As far as I know, no other Asian language needs such a database for collation, but such a database may be needed by a lexer to split sentences into keywords, due to the absence of mandatory spaces. But it's possible that these Asian users have learned to insert spaces or punctuation within their filenames to help the indexation of their files. We have no feedback about this from Chinese or Japanese users, so I don't know if their attempts to search files in their language is successful or not. If not, may be we should consider implementing at least some basic lexer like the one I exposed above (every 2 or 3 or 4 characters within a sequence of letters of the same Asian script).

Lord of the Rings · #4 (**permalink**) December 11th, 2004

I may try to research around to see what I can find out (of those that search in their native language) & see what their responses are. But I won't reply in a hurry it could take quite some time. Particularly this time of year (people on breaks & on holidays overseas, etc.)

Thai font licenced! Well I guess that would explain the limitation of characters. Some of mine come freely from a university apparently. So how does Java handle these fonts? Seem to be fine from here. Everytime I visit hotm'l it's in thai text. Not all correctly displayed but.

How much do you think LW would pay me if I translated a version totally to Thai & had OSX type iTunes support. lol (ummm just a joke!]

verdyp · #5 (**permalink**) December 11th, 2004

I won't reply you about the payment. A LimeWire crew would better replyto you privately.

I contribute to LimeWire as a free open-source contributer, managing most of the work with candidate translators, validating their input, testing them in LimeWire, helping LimeWire for beta-tests, or helping with proposed optimizations or corrections.

I'm not paid to do that, but I have no contractual obligation with LimeWire, but just need to respect its working etiquette. For these reasons, Limewire has granted to me a limited access to their development platform, and I received a couple of checks and some goodies as a way thank me for my past contributions.

verdyp · #6 (**permalink**) December 11th, 2004

Note that I already have some open-source lexers, but they are imlplemented in C, not Java (however this source is not difficult to port).
The only good question is the validity of these lexicon extractors, but some of them are used in wellknown search engines, such as mnoGoSearch, which comes with a lexicon extractor that:
- first extract tokens from text using separators, or returning 1-char tokens for Han and Thai characters
- then uses a dynamic programming to determine lexicons from multiple paths based on word frequencies with maximum path length (this requires a word frequency dictionnary, available for Thai and Chinese, Thai being the smallest, Mandarin/simplified Chinese containing 3 times more entries, and Traditional Chinese being twice the size of Mandarin).

The algorithm implementation is small, and efficient, but the main problem is to encode the dictionnaries in a small and efficient way. For Thai, this could fit in less than 100KB, but for Mandarin it would require about 300 to 400KB, and for Traditional Chinese about 1MB (in memory at run-time)...

murasame · #7 (**permalink**) December 12th, 2004

Poor Yukino: if he/she is still trying to follow this thread and she/he is using some kind of translating software to understand what is being talked about then he/she is pretty much done for.

verdyp · #8 (**permalink**) December 12th, 2004

How can we avoid to be technical for such discussions?
At the time when such 3-characters limit was added, it was because there was a huge and unnecessary traffic caused by searches with too many results like "the", or "mp3", or even "*".
Limewire has integrated some filters forcing its users to be more selective, so that requests will not be randomly routed throughout a large part of the network, where there are too many results, so that these requests will completely fill the available bandwidth.

The problem with those request is not that all results will not be returned (due to bandwidth limitation), but the fact that they will completely fill the space shared that would allow routing more specific and more useful requests. This limitation is inherent to the propagation model of these requests: they behave, statistically, exactly like network-wide broadcasts, consuming an extremely large bandwidth both for the broadcasted requests themselves but even worse for the distributed responses spreaded from throughout the network.

I note however, that since Gnutella has evolved, the routing of responses directly to the requester will use a lower shared bandwidth, because many intermediate nodes will not have to support this traffic: the only node that will be overwelmed by this traffic will be the requested host itself, whose input bandwidth will be completely saturated by responses. But Limewire now has an algorithm to easily limit and control this incoming flow. So these replies are less a problem than they were in the past. It remains true that these requests will still propagate as broadcast, without efficient routing. But LimeWire integrates in its router, some algorithms that limit the impact of broadcasts, by not routing a request immediately to all candidate directions.

If you look at the "what's new?" feature, you'll note that it looks like a request that can match many replies from many hosts, possibly nearly all! This feature now works marvelously, and it's possible that these "greedy" short requests would no longer be a problem in LimeWire.

These algorithms are mostly heuristics, they are not perfect. So we still need to carefully study the impact if we lower them. What is clear is that the most problematic greedy requests in the past was with requests containing only ASCII letters. The 3-characters limit was imagined at a time where only ASCII requests were possible on Gnutella, so working well only in English and some languages rarely present on the web. Under this old limit, we had too many files containing "words" with 1 or 2 ASCII letters or digits. In other terms, these requests were not selective enough.

But as we can now search for international strings, with accents on Latin letters, or using other scripts, the 3 characters minimum becomes excessive, because a search for a 1-letter or 2-letter Han ideographic word will be most often more selective than a search for a 3-letter English word. In fact Latin letters with accents or even Greek letters or Cyrillic letters are much less present on the network, even on hosts run by users using this script natively for naming a part of their shared files.

Note however that if I search for "Pô", this search may look selective for the French name of a River in the North of Italy, but in fact the effective search string will be "po" (because searches are matched and routed so that minor differences of orthography are hidden: I can search for "café" and I get results with "CAFE" or "Café" or "cafes" or results where the accute accent above e is encoded separately after the e letter as a combining diacritic, because of the way various hosts and systems encode or strip this accent in their filenames...). So a search for "Pô" is still not selective enough. This is a place for later improvement, with more dynamic behavior based on actual frequencies of results.

For now, we must keep this limit for ASCII letters (which apply to nearly all Latin letters with few exceptions like "ø" or "æ" which may be decomposed into "o" and "ae", or Latin letters like the Icelandic/old English "thorn", the Nordic "eth", or the rare "esh"), but I don't see why we should not relax it for non Latin scripts:

Notably Cyrillic letters (for Russian, Bulgarian, Serbian...), Han ideographs (for Chinese Hanzi characters, Japanese Kanjis or Korean Hanjas), Hiragana and Katakana syllables (for Japanese), and less urgently for the Greek alphabet, and the Hebrew and Arabic abugidas, and some Indian scripts.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
everybody point and laugh	wrestlingles	Open Discussion topics	2	November 10th, 2005 04:30 AM
Hallo Deutschland	DaleKaufm	Deutsch	1	June 21st, 2005 02:02 PM
What is the point if i cant transfer????	Unregistered	General Mac OSX Support	9	November 1st, 2002 09:14 AM
What is the point of the MP3 player in version 1.8?	Unregistered	Open Discussion topics	2	November 13th, 2001 07:54 AM
Point Server with P2P	allautoweb	General Gnutella / Gnutella Network Discussion	0	April 29th, 2001 06:09 PM