Gnutella Forums - Can someone give us Japanese feedback?

Gnutella Forums (https://www.gnutellaforums.com/)

- LimeWire Beta Archives (https://www.gnutellaforums.com/limewire-beta-archives/)

- - Can someone give us Japanese feedback? (https://www.gnutellaforums.com/limewire-beta-archives/31064-can-someone-give-us-japanese-feedback.html)

Can someone give us Japanese feedback?

We are in need of feedback on the latest Japanese version of LimeWire - mainly on Windows and OSX. Some users have reported still seeing squares in table headers rather than a full Japanese character set everywhere.

We are looking for confirmation of this problem being widespread or conversely, it being an isolated incident. Any feedback that we can get from Japanese users would be greatly appreciated. If anyone that responds could include the version of LimeWire that they are using, their OS and any special setup that they have, it would be greatly appreciated.

Thanks
-greg

That was from me BTW

Just identifying myself a little more ....

Well, it seems like a fix is on the way ...

I am trying to get some resources for better support of Chinese in LimeWire (I don't speak about the interface which does not cause problems). But in fact, this appears be a very complex issue, which is notoriously difficult to solve, including if we add a Chinese dictionnary to help lexicalize the searchable items.

Lots of search projects have adopted a simple strategy, which is however quite valid statistically (even if there are some false matches):

For LimeWire, what we can do is to generalize the concept of "Token Streams" that parse a stream of one or more tokens and return subtokens. This is the approach taken within the Mozilla Search project: it is flexible enough to allow thin analysis of indexable items, and it allows the generation of indexable lexems (using some pluggable syntaxes that allow parsing and removing common word inflexions in various languages, including some infixes like "-ge-" and "-zu-" in German).

This approach uses typed tokens, that allow keeping the context in which they were identified, to avoid loosing too much contextual information; with this typed-token approach, an indexable string would first be splitted by script type (the separators and punctuation would naturally be isolated), then in each script by script-specific sub-tokenization. Tokenization should not split the default grapheme boundaries (i.e. combining sequences in Latin/Greek/Cyrillic, or LVT syllables in Hangul)

For efficient processing, tokens should not be new String objects but should just be defined by pairs of integer offsets within a shared string buffer or array (this pair can be represented into a single long in method return values without allocating a new object for each token, or stored in fields of a reusable iterator object).

Then for Asian texts that typically don't use word separators (including Han, Hiragana, Katakana, Hangul, Thai, Lao, Khmer, Myanmarese, Tibetan) the long tokens cannot be splitted into words algorithmically; but they can become indexable if the are splitted into fixed-length tuples with sliding starts:

- Han words have an average of 1.7 ideographs, meaning that most words are 1 or 2 ideographs. Words with 1 ideographs are the most common ones and are poor indexable items; we can ignore them, and just concentrate into indexing all sequences of 2 successive ideographs (there are Han words made of 3 to 5 ideographs, but indexing the any two of them will still retreive isolate these words easily: a pair of ideograph can index distinctly more than 400,000 words, much more than what a typical Chinese user will ever learn in his life... and use in his texts, and also much more than what we index in our keyword QRP tables for efficient routing of queries)

- Hiragana and Katakana words have an average length above 3 (each letter encodes a simple syllable, with little infections, less than in European languages including English or French, so the words tend to be "longer"; however, the syllabic value of these Hiragana or Katakana syllables corresponds mostly to about 1 consonnant and 1 vowel, so the comparable length is that Hiragana and Katakana words have an equivalent average length of 6 to 7 European letters); combining voicing marks in Hiragana and Katakana modify the leading consonnant; they are very significant compared to the smaller importance of diacritics in European languages (where also multiple vowels and diphtongs or even new consonnants can be written and differentiated without using always diacritics, or by using digraphs). For these reason, voicing marks in Hiragana and Katakana must be kept and considered as if they were true letters, used like the consonnantal digraph of European languages (rr, ss, ll, ch, sh, ...)

- For Thai (encoded with the visible order), we have some difficulties here, because a simple "sliding window" would not perfectly match the logical syllabic order of the language. However, Thai sequences can effectively be preprocessed to be reordered into logical order, and then processed like other Hmong-Khmer languages (based on the Southern branch of Brahmic languages also spoken and written today in India) that are typically written without word separators. What is significant here is then the average length of words, which can be computed from the average length in syllabic clusters: routhly 2.6 because these languages have a rich system of inflections on vowels. The basic indexing of these languages should be by group of 2 syllables (keeping the delimitation of syllabic boundaries). [If syllable breaks are kept, reordering Thai syllables from physical to logical order is no more needed.] There are cases where we could better use 3 syllables, but false matches would be rare: Thai is near linguistically from Bengali (but the latter does use explicit word separation and keeps the logical encoding order), and uses similar phonetic and syllabic structure, with similar statistic occurences.

- For Korean, the sliding window strategy can also work provided that it will not break in the middle of a LVT syllable boundary, and it will index every group of 2 successive syllables: Korean words are short, like in Chinese, but with a simpler phonetic. Modern Korean tends to be a bit longer but the script itself supports these imports because it is effectively coded like an alphabet (the basic consonnant or vowel jamos) with explicit syllable breaks (an old Korean standard used the same code for leading and trailing consonnant jamos, so the syllable breaks were hard to determine, causing lots of problem to render correctly the syllabic squares ; these texts are deprecated because they are too much hard to read even by native Korean, unless at least word breaks are explicitly marked with spaces). The alternative to leading and trailing consonnants would have been to encode syllable breaks separately, and keeping the consonnants unified. For indexing, Korean should be parsed by syllabic clusters.

Another thing to consider: given that the current indexing of East and South-East Asian languages is so poor and that we'll need to change the tokenizers for them, we should also at the same time change the way we compute the keyword hashes for them. No need to change the "fast-hash" function for alphabetic scripts (using letters allocated mostly in blocks below U+800: see http://www.eki.ee/letter/ for the UCS collections comprising the Multilingual European Subset No. 3, and for which no change is neeeded).

But this hash function (that only keeps the low 8bits of each unicode codepoint) will not work well for Asian texts: we must improve the way they are hashed, notably because we will hash not true keywords but "bigrams" or "trigrams", encoded in a larger domain of codepoints. For this reason, "keywords" need to have all their bits considered. The hashing function should then take the "high byte" of each UTF-16 codeunit into account for these scripts. This will also limit the number of hash collision with other Latin keywords.

greetings,

I have been attempting to type in nippon I know one of my problems is the keyboard. Am in the process of searching for the correct. I'm not all that bright on software issues. I am running LW 4.3.3 beta english installer version. I like this version. My question is do I need to install the international version and did you receive any feedback on special setup requirements. I can read the japanese(nippon) from this http://www.gnutellaforums.com/showth...threadid=32410
but have been unable to respond inkind.

The "international" version is in fact the same as the "english-only" one, regarding the localization support in the Limewire software itself.
The difference is only that the "international" version bundles an international version of Java.
But if you have already installed the international Java JRE from the Sun web site, you don't need the "international" installer of Limewire to work in Japanese.
Just download the "english-only" installer, and you'll get the same version of Limewire, supporting the same set of languages.
All localization issues are not in Limewire itself, but depend on your OS support for international fonts and keyboard input methods, and in the version of Java you have installed.

So, on Windows:

- install or update Java to the latest version (using the control panel, you'll update to the latest release of Java 1.4, but you can install the Java 1.5 JRE which is extremely stable, although it is still labeled by Sun as "Beta", only because it lacks some new features that are not fully tested, but that Limewire does not use; Java 1.5 is still in development by Sun, but all the Java 1.4 compatible API is already extremely stable and even better than in the Java JRE 1.4 implementation)

- then install Limewire's "english-only" installer. The installer is smaller because it does not ship Java (Java is detected and the JRE 1.4 is downloaded separately if it is not already installed on your system).

The "international installer" was needed on Windows only to help the most common basic users that don't know how to install Java on their system, and want something that will run immediately on their fresh new system.

For MacOS, OSX and Linux, a Java JRE must be installed prior to installing Limewire (but these users are generally less basic users, and know better how to manage their systems...

On MacOS, MacOSX, Apple bundles on its OS or on its MacOS online update service an excellent help to install a compliant Java JRE. But such active support is not done so well on Windows.

(Microsoft does not want to support Sun's Java, even if it has removed its own non-compliant Java implementation from Windows; things should change because Java development is so much important for enterprise applications on Windows today: look at the many IT jobs that DO require today a good knowledge of Java J2EE and Oracle, and compare it to the very few job offers that need .Net skills; the standard Sun's Java is a must on Windows and other systems, whatever Microsoft thinks and declares everywhere, and it is used in so many enterprise-critical missions that you can be confident it is extremely secure, unlike the now obsolete and very insecure Microsoft JavaVM).

Note: J2EE is a superset of the J2SE environment in Java. Limewire does not require the large J2EE (used in enterprise for N-Tier applications running on application servers), and just uses the basic J2SE edition running locally on your OS.

Thank you verdyp that sounds easy enough to do just waiting on the keyboard.

Japanese feed back LW pro 4.4.1

Just received new japanese keyboard and hooked it up. In LW I was effectively able to do searches and d/ls. I only ran into a few problems, which were on my end. I am still having problems correctly reading the forum here, but again that is on my end, and a M$ issue, which I will get worked out.

If there is anything in particular you would like me to try, let me know. I will have to switch the keyboards as I normally have the DELL keyboard attached.