Gnutella Forums  

Go Back   Gnutella Forums > Current Gnutella Client Forums > LimeWire+WireShare (Cross-platform) > Open Discussion topics
Register FAQ The Twelve Commandments Members List Calendar Arcade Find the Best VPN Today's Posts

Open Discussion topics Discuss the time of day, whatever you want to. This is the hangout area. If you have LimeWire problems, post them here too.


 
 
LinkBack Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #11 (permalink)  
Old December 11th, 2004
verdyp's Avatar
LimeWire is International
 
Join Date: January 13th, 2002
Location: Nantes, FR; Rennes, FR
Posts: 306
verdyp is flying high
Default

You seem to have interesting and valuable knowledge about Asian scripts. If you have some programming experience, could you join to our team of open-source contributors to help improve further the internationalization of LimeWire?

Unfortunately, my knowledges of these scripts is only theorical, based on the works done in the Unicode standard and related works such as ICU and UniHan properties, but not based on linguistic and semantics.

One thing for which we have no clew is the support of Thai (which unfortunately has a visual ordering in Unicode because of the support of the legacy national TIS-620 standard, instead of a logical one used in other scripts, and also because Thai, like many other Asian languages, do not use any space to separate words).

In the past, I proposed to index Asian filenames by splitting them arbitrarily in units of 2 or 3 character positions, but the number of generated keywords would have been a bit too high:

Suppose that the title "ABCDEFG" is present, where each letter is a ideograph, or a Hiragana or Katakana letter or a Thai letter, the generated searchable and indexed keywords would have been:
"AB", "BC", "CD", "DE", "EF", "FG" (if these two-letter "keywords" respect the minimum UTF-8 length restrictions given in my previous message)
"ABC", "BCD", "CDE", "DEF", "EFG"
Note that there may exist situations where a SINGLE character is a significant keyword.

In LimeWire, we currently detect keyword separations either with:
- spaces and controls
- the general category of characters, so that punctuations or symbols become equivalent to spaces.
- the script type of the character: a transition in Japanese between Hiragana or Katakana or Kanji or Latin implies a keyword separation.

What we really need is a lexer. There are several open-source projects related to such lexical analysis of Asian texts (notably for implementing automatic translators, or input method editors or word processors). The problem is that they often depend on a local database that will store the lexical entities, or long lists of lexical rules. Some projects perform something else: lexical analysis is performed automatically, from an initially empty dictionnary, by statistical analysis of frequent lexical radicals, so that frequently used prefixes and suffixes can be identified (this is also useful for non Asian languages, like German, or for other Latin-written European or African languages like Hungarian or Berber).

This is a research domain which is highly protected by many patents, notably those owned by famous dictionnary editors, or web search engines like Google... Documents on this subject, which would be freely available and that would allow royaltee-free redistribution in open-source software are difficult to find... But I suppose that this has been studied since centuries within some old books whose text is now available in the public domain.

My searches within public libraries like the BDF in France have not found something significant (and getting copies of these documents is often complicate or even expensive, unless these books have been converted to numeric formats avaliable online). Also most of these books imply at least a good knowledge of the referenced languages, something I don't have... It's probably easier to do by natives of countries speaking and writing those languages, that's why we call for contributions by open-sourcers...
__________________
LimeWire is international. Help translate LimeWire to your own language.
Visit: http://www.limewire.org/translate.shtml
Reply With Quote
 


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
everybody point and laugh wrestlingles Open Discussion topics 2 November 10th, 2005 04:30 AM
Hallo Deutschland DaleKaufm Deutsch 1 June 21st, 2005 02:02 PM
What is the point if i cant transfer???? Unregistered General Mac OSX Support 9 November 1st, 2002 09:14 AM
What is the point of the MP3 player in version 1.8? Unregistered Open Discussion topics 2 November 13th, 2001 07:54 AM
Point Server with P2P allautoweb General Gnutella / Gnutella Network Discussion 0 April 29th, 2001 06:09 PM


All times are GMT -7. The time now is 07:12 AM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
SEO by vBSEO 3.6.0 ©2011, Crawlability, Inc.

Copyright © 2020 Gnutella Forums.
All Rights Reserved.