Gnutella Forums - View Single Post

verdyp · #4 (**permalink**) December 28th, 2004

I am trying to get some resources for better support of Chinese in LimeWire (I don't speak about the interface which does not cause problems). But in fact, this appears be a very complex issue, which is notoriously difficult to solve, including if we add a Chinese dictionnary to help lexicalize the searchable items.

Lots of search projects have adopted a simple strategy, which is however quite valid statistically (even if there are some false matches):

For LimeWire, what we can do is to generalize the concept of "Token Streams" that parse a stream of one or more tokens and return subtokens. This is the approach taken within the Mozilla Search project: it is flexible enough to allow thin analysis of indexable items, and it allows the generation of indexable lexems (using some pluggable syntaxes that allow parsing and removing common word inflexions in various languages, including some infixes like "-ge-" and "-zu-" in German).

This approach uses typed tokens, that allow keeping the context in which they were identified, to avoid loosing too much contextual information; with this typed-token approach, an indexable string would first be splitted by script type (the separators and punctuation would naturally be isolated), then in each script by script-specific sub-tokenization. Tokenization should not split the default grapheme boundaries (i.e. combining sequences in Latin/Greek/Cyrillic, or LVT syllables in Hangul)

For efficient processing, tokens should not be new String objects but should just be defined by pairs of integer offsets within a shared string buffer or array (this pair can be represented into a single long in method return values without allocating a new object for each token, or stored in fields of a reusable iterator object).

Then for Asian texts that typically don't use word separators (including Han, Hiragana, Katakana, Hangul, Thai, Lao, Khmer, Myanmarese, Tibetan) the long tokens cannot be splitted into words algorithmically; but they can become indexable if the are splitted into fixed-length tuples with sliding starts:

- Han words have an average of 1.7 ideographs, meaning that most words are 1 or 2 ideographs. Words with 1 ideographs are the most common ones and are poor indexable items; we can ignore them, and just concentrate into indexing all sequences of 2 successive ideographs (there are Han words made of 3 to 5 ideographs, but indexing the any two of them will still retreive isolate these words easily: a pair of ideograph can index distinctly more than 400,000 words, much more than what a typical Chinese user will ever learn in his life... and use in his texts, and also much more than what we index in our keyword QRP tables for efficient routing of queries)

- Hiragana and Katakana words have an average length above 3 (each letter encodes a simple syllable, with little infections, less than in European languages including English or French, so the words tend to be "longer"; however, the syllabic value of these Hiragana or Katakana syllables corresponds mostly to about 1 consonnant and 1 vowel, so the comparable length is that Hiragana and Katakana words have an equivalent average length of 6 to 7 European letters); combining voicing marks in Hiragana and Katakana modify the leading consonnant; they are very significant compared to the smaller importance of diacritics in European languages (where also multiple vowels and diphtongs or even new consonnants can be written and differentiated without using always diacritics, or by using digraphs). For these reason, voicing marks in Hiragana and Katakana must be kept and considered as if they were true letters, used like the consonnantal digraph of European languages (rr, ss, ll, ch, sh, ...)

- For Thai (encoded with the visible order), we have some difficulties here, because a simple "sliding window" would not perfectly match the logical syllabic order of the language. However, Thai sequences can effectively be preprocessed to be reordered into logical order, and then processed like other Hmong-Khmer languages (based on the Southern branch of Brahmic languages also spoken and written today in India) that are typically written without word separators. What is significant here is then the average length of words, which can be computed from the average length in syllabic clusters: routhly 2.6 because these languages have a rich system of inflections on vowels. The basic indexing of these languages should be by group of 2 syllables (keeping the delimitation of syllabic boundaries). [If syllable breaks are kept, reordering Thai syllables from physical to logical order is no more needed.] There are cases where we could better use 3 syllables, but false matches would be rare: Thai is near linguistically from Bengali (but the latter does use explicit word separation and keeps the logical encoding order), and uses similar phonetic and syllabic structure, with similar statistic occurences.

- For Korean, the sliding window strategy can also work provided that it will not break in the middle of a LVT syllable boundary, and it will index every group of 2 successive syllables: Korean words are short, like in Chinese, but with a simpler phonetic. Modern Korean tends to be a bit longer but the script itself supports these imports because it is effectively coded like an alphabet (the basic consonnant or vowel jamos) with explicit syllable breaks (an old Korean standard used the same code for leading and trailing consonnant jamos, so the syllable breaks were hard to determine, causing lots of problem to render correctly the syllabic squares ; these texts are deprecated because they are too much hard to read even by native Korean, unless at least word breaks are explicitly marked with spaces). The alternative to leading and trailing consonnants would have been to encode syllable breaks separately, and keeping the consonnants unified. For indexing, Korean should be parsed by syllabic clusters.

Another thing to consider: given that the current indexing of East and South-East Asian languages is so poor and that we'll need to change the tokenizers for them, we should also at the same time change the way we compute the keyword hashes for them. No need to change the "fast-hash" function for alphabetic scripts (using letters allocated mostly in blocks below U+800: see http://www.eki.ee/letter/ for the UCS collections comprising the Multilingual European Subset No. 3, and for which no change is neeeded).

But this hash function (that only keeps the low 8bits of each unicode codepoint) will not work well for Asian texts: we must improve the way they are hashed, notably because we will hash not true keywords but "bigrams" or "trigrams", encoded in a larger domain of codepoints. For this reason, "keywords" need to have all their bits considered. The hashing function should then take the "high byte" of each UTF-16 codeunit into account for these scripts. This will also limit the number of hash collision with other Latin keywords.