Gnutella Forums - View Single Post - Hallo! We want being able to point to one more question

verdyp · #13 (**permalink**) December 11th, 2004

In principle, Lao and Khmer will be less complex than Thai, because they were encoded in Unicode using the logical model, which makes full-text searches easier to implement. For LimeWire, it means that Lao and Khmer can be handled like other Indian/Brahmic scripts (we don't care if the visual order differs from the logical order, or if there exists input methods that use a visual order, given that the conversion of these texts to Unicode will use a logical order.)

But it's not true for Thai, because Thai is supported in Thailand by an old and widely used TIS-620 standard, made since long in the 70's by IBM for the Thai government which made it mandatory for the representation of Thai texts. Unicode has then borrowed this situation, because it wanted to keep a roundtrip compatibility with the very large existing corpus of Thai texts in computers, files, databases, and input methods, encoded since long with TIS-620 or one of its predecessors.

Thai has always been encoded with the visual order where some letters that are logically after another one (including in phonetic, structure, or collation) must be entered before it, because it will be written graphically in a single right-to-left direction (this visual ordering comes from the legacy limitations of font and display technologies in the 70s). India has chosen to preserve the logical ordering, which looks more like the way Indian users think about their language, and how they spell it orally (there has existed some typewriters in India using the visual orer, but these were considered difficult or illogical to use; for computers, the ISCII standard was created with the correct assumption that computers would make themselves the visual ordering.)

Thai is complex because there does not exist a reliable algorithm to convert from the visual order (encoded) to the logical order (that would be useful for searches and collation). In practice, a Thai collation system comes with a large database containing most Thai words or radicals encoded in logical order. This database could be used to create useful lexical entities in LimeWire, but it is large, and may be subject to some copyright restrictions (I don't know exactly the status of the Thai database that comes with IBM's ICU, i.e. if it can be redistributed freely, like ICU itself).

If LimeWire incorporated this database for Thai users, may be this should be an optional download, because of its size.

As far as I know, no other Asian language needs such a database for collation, but such a database may be needed by a lexer to split sentences into keywords, due to the absence of mandatory spaces. But it's possible that these Asian users have learned to insert spaces or punctuation within their filenames to help the indexation of their files. We have no feedback about this from Chinese or Japanese users, so I don't know if their attempts to search files in their language is successful or not. If not, may be we should consider implementing at least some basic lexer like the one I exposed above (every 2 or 3 or 4 characters within a sequence of letters of the same Asian script).