Note that I already have some open-source lexers, but they are imlplemented in C, not Java (however this source is not difficult to port).
The only good question is the validity of these lexicon extractors, but some of them are used in wellknown search engines, such as mnoGoSearch, which comes with a lexicon extractor that:
- first extract tokens from text using separators, or returning 1-char tokens for Han and Thai characters
- then uses a dynamic programming to determine lexicons from multiple paths based on word frequencies with maximum path length (this requires a word frequency dictionnary, available for Thai and Chinese, Thai being the smallest, Mandarin/simplified Chinese containing 3 times more entries, and Traditional Chinese being twice the size of Mandarin).
The algorithm implementation is small, and efficient, but the main problem is to encode the dictionnaries in a small and efficient way. For Thai, this could fit in less than 100KB, but for Mandarin it would require about 300 to 400KB, and for Traditional Chinese about 1MB (in memory at run-time)... |