How to sign a document in Chinese

I get a document written in Chinese for which I have to tokenize and store it in a database table. I tried to use CJKBigramFilter from Lucene, but all it does is combine 2 characters together for which the value is different from what the document has. Suppose this is a line in the file "Hello My name is Pradeep", which in the Chinese tradition is "你好 我 的 名字 是 普拉迪普". When I tokenize it, it is converted into two words with the letter below. 你好 - Hello 名字 - Name 好 我 - Well, I 字 是 - The word is 我 的 - My 拉迪 - Radi 是 普 - Is S and P 普拉 - Pula 的 名 - In the name of 迪普 - Dipp. All I want is to convert to the same English translation. I use Lucene for this ... if you have any other favorable source of opne, please direct me to this. Thanks at Advance

+4
source share
2 answers

Although it may be too late, you can try U-Tokenizer, which is an online API, available for free. See http://tokenizer.tool.uniwits.com/

+3
source

If you need a complete NLP analyzer, exit http://nlp.stanford.edu

If you need a simple one-time solution for the Chinese, here is what I used.

First load the Chinese dictionary into Trie (Prefix-Tree) to reduce the amount of memory. Then I went through sentences when a character at that time was observing jammed substrings in a dictionary. If they did, I would analyze it as a token. The algorithm could probably be greatly improved, but it served me well. :)

public class ChineseWordTokenizer implements WordTokenizer { private static final int MAX_MISSES = 6; // example implementation: http://www.kennycason.com/posts/2012-03-20-java-trie-prefix-tree.html private StringTrie library; private boolean loadTraditional; public ChineseWordTokenizer() { this(true); } public ChineseWordTokenizer(boolean loadTraditional) { loadLibrary(); this.loadTraditional = loadTraditional; } @Override public String[] parse(String sentence) { final List<String> words = new ArrayList<>(); String word; for (int i = 0; i < sentence.length(); i++) { int len = 1; boolean loop = false; int misses = 0; int lastCorrectLen = 1; boolean somethingFound = false; do { word = sentence.substring(i, i + len); if (library.contains(word)) { somethingFound = true; lastCorrectLen = len; loop = true; } else { misses++; loop = misses < MAX_MISSES; } len++; if(i + len > sentence.length()) {; loop = false; } } while (loop); if(somethingFound) { word = sentence.substring(i, i + lastCorrectLen); if (StringUtils.isNotBlank(word)) { words.add(word); i += lastCorrectLen - 1; } } } return words.toArray(new String[words.size()]); } private void loadLibrary() { library = new StringTrie(); library.loadFile("classify/nlp/dict/chinese_simple.list"); if(loadTraditional) { library.loadFile("classify/nlp/dict/chinese_traditional.list"); } } } 

Here is the Unit Test

 public class TestChineseWordTokenizer { @Test public void test() { long time = System.currentTimeMillis(); WordTokenizer tokenizer = new ChineseWordTokenizer(); System.out.println("load time: " + (System.currentTimeMillis() - time) + " ms"); String[] words = tokenizer.tokenize("弹道导弹"); print(words); assertEquals(1, words.length); words = tokenizer.tokenize("美国人的文化.dog"); print(words); assertEquals(3, words.length); words = tokenizer.tokenize("我是美国人"); print(words); assertEquals(3, words.length); words = tokenizer.tokenize("政府依照法律行使执法权,如果超出法律赋予的权限范围,就是"滥用职权";如果没有完全行使执法权,就是"不作为"。两者都是政府的错误。"); print(words); words = tokenizer.tokenize("国家都有自己的政府。政府是税收的主体,可以实现福利的合理利用。"); print(words); } private void print(String[] words) { System.out.print("[ "); for(String word : words) { System.out.print(word + " "); } System.out.println("]"); } } 

And here are the results

 Load Complete: 102135 Entries load time: 236 ms [ 弹道导弹 ] [ 美国人 的 文化 ] [ 我 是 美国人 ] [ 政府 依照 法律 行使 执法 权 如果 超出 法律 赋予 的 权限 范围 就是 滥用职权 如果 没有 完全 行使 执法 权 就是 不 作为 两者 都 是 政府 的 错误 ] [ 国家 都 有 自己 的 政府 政府 是 税收 的 主体 可以 实现 福利 的 合理 利用 ] 
+3
source

Source: https://habr.com/ru/post/1434935/


All Articles