Gnutella Forums - View Single Post - Hallo! We want being able to point to one more question

verdyp · #11 (**permalink**) December 11th, 2004

You seem to have interesting and valuable knowledge about Asian scripts. If you have some programming experience, could you join to our team of open-source contributors to help improve further the internationalization of LimeWire?

Unfortunately, my knowledges of these scripts is only theorical, based on the works done in the Unicode standard and related works such as ICU and UniHan properties, but not based on linguistic and semantics.

One thing for which we have no clew is the support of Thai (which unfortunately has a visual ordering in Unicode because of the support of the legacy national TIS-620 standard, instead of a logical one used in other scripts, and also because Thai, like many other Asian languages, do not use any space to separate words).

In the past, I proposed to index Asian filenames by splitting them arbitrarily in units of 2 or 3 character positions, but the number of generated keywords would have been a bit too high:

Suppose that the title "ABCDEFG" is present, where each letter is a ideograph, or a Hiragana or Katakana letter or a Thai letter, the generated searchable and indexed keywords would have been:
"AB", "BC", "CD", "DE", "EF", "FG" (if these two-letter "keywords" respect the minimum UTF-8 length restrictions given in my previous message)
"ABC", "BCD", "CDE", "DEF", "EFG"
Note that there may exist situations where a SINGLE character is a significant keyword.

In LimeWire, we currently detect keyword separations either with:
- spaces and controls
- the general category of characters, so that punctuations or symbols become equivalent to spaces.
- the script type of the character: a transition in Japanese between Hiragana or Katakana or Kanji or Latin implies a keyword separation.

What we really need is a lexer. There are several open-source projects related to such lexical analysis of Asian texts (notably for implementing automatic translators, or input method editors or word processors). The problem is that they often depend on a local database that will store the lexical entities, or long lists of lexical rules. Some projects perform something else: lexical analysis is performed automatically, from an initially empty dictionnary, by statistical analysis of frequent lexical radicals, so that frequently used prefixes and suffixes can be identified (this is also useful for non Asian languages, like German, or for other Latin-written European or African languages like Hungarian or Berber).

This is a research domain which is highly protected by many patents, notably those owned by famous dictionnary editors, or web search engines like Google... Documents on this subject, which would be freely available and that would allow royaltee-free redistribution in open-source software are difficult to find... But I suppose that this has been studied since centuries within some old books whose text is now available in the public domain.

My searches within public libraries like the BDF in France have not found something significant (and getting copies of these documents is often complicate or even expensive, unless these books have been converted to numeric formats avaliable online). Also most of these books imply at least a good knowledge of the referenced languages, something I don't have... It's probably easier to do by natives of countries speaking and writing those languages, that's why we call for contributions by open-sourcers...