If you need a complete NLP analyzer, exit http://nlp.stanford.edu
If you need a simple one-time solution for the Chinese, here is what I used.
First load the Chinese dictionary into Trie (Prefix-Tree) to reduce the amount of memory. Then I went through sentences when a character at that time was observing jammed substrings in a dictionary. If they did, I would analyze it as a token. The algorithm could probably be greatly improved, but it served me well. :)
public class ChineseWordTokenizer implements WordTokenizer { private static final int MAX_MISSES = 6; // example implementation: http://www.kennycason.com/posts/2012-03-20-java-trie-prefix-tree.html private StringTrie library; private boolean loadTraditional; public ChineseWordTokenizer() { this(true); } public ChineseWordTokenizer(boolean loadTraditional) { loadLibrary(); this.loadTraditional = loadTraditional; } @Override public String[] parse(String sentence) { final List<String> words = new ArrayList<>(); String word; for (int i = 0; i < sentence.length(); i++) { int len = 1; boolean loop = false; int misses = 0; int lastCorrectLen = 1; boolean somethingFound = false; do { word = sentence.substring(i, i + len); if (library.contains(word)) { somethingFound = true; lastCorrectLen = len; loop = true; } else { misses++; loop = misses < MAX_MISSES; } len++; if(i + len > sentence.length()) {; loop = false; } } while (loop); if(somethingFound) { word = sentence.substring(i, i + lastCorrectLen); if (StringUtils.isNotBlank(word)) { words.add(word); i += lastCorrectLen - 1; } } } return words.toArray(new String[words.size()]); } private void loadLibrary() { library = new StringTrie(); library.loadFile("classify/nlp/dict/chinese_simple.list"); if(loadTraditional) { library.loadFile("classify/nlp/dict/chinese_traditional.list"); } } }
Here is the Unit Test
public class TestChineseWordTokenizer { @Test public void test() { long time = System.currentTimeMillis(); WordTokenizer tokenizer = new ChineseWordTokenizer(); System.out.println("load time: " + (System.currentTimeMillis() - time) + " ms"); String[] words = tokenizer.tokenize("弹道导弹"); print(words); assertEquals(1, words.length); words = tokenizer.tokenize("美国人的文化.dog"); print(words); assertEquals(3, words.length); words = tokenizer.tokenize("我是美国人"); print(words); assertEquals(3, words.length); words = tokenizer.tokenize("政府依照法律行使执法权,如果超出法律赋予的权限范围,就是"滥用职权";如果没有完全行使执法权,就是"不作为"。两者都是政府的错误。"); print(words); words = tokenizer.tokenize("国家都有自己的政府。政府是税收的主体,可以实现福利的合理利用。"); print(words); } private void print(String[] words) { System.out.print("[ "); for(String word : words) { System.out.print(word + " "); } System.out.println("]"); } }
And here are the results
Load Complete: 102135 Entries load time: 236 ms [ 弹道导弹 ] [ 美国人 的 文化 ] [ 我 是 美国人 ] [ 政府 依照 法律 行使 执法 权 如果 超出 法律 赋予 的 权限 范围 就是 滥用职权 如果 没有 完全 行使 执法 权 就是 不 作为 两者 都 是 政府 的 错误 ] [ 国家 都 有 自己 的 政府 政府 是 税收 的 主体 可以 实现 福利 的 合理 利用 ]
source share