Get word count from a string in Unicode (in any language)

I want to get the number of words from a string. It is so simple. The trick is that the string can be in an unpredictable language.

So, I need a signature function int getWordCount(String) with the following sample output -

 getWordCount("供应商代发发货") => 7 getWordCount("This is a sentence") => 4 

Any help on how to proceed would be appreciated :)

+6
source share
5 answers

The concept of a word can be trivial or complex. Here is the Apache Stanbol Toolkit:

Word Tokenization: Single-word detection is required by Stanbol Enhancer for word processing. Although this is trivial for most languages, it is quite a challenge for some oriental languages, for example. Chinese, Japanese, Korean. Unless otherwise configured, Stanbol will use spaces to tokenize words.

So, if the concept of a word is linguistic, not syntactic, you should use the NLP toolkit

My preferred Java solution: Apache Open NLP

Note: I used http://www.mdbg.net/chindict/chindict.php?page=worddict tokenize your example. This means that there are 4 words not seven. I cut and pasted (rather fragmented):

Source code simplified PinYin English definition Add new word to dictionary Traditional HSK 商 商 供应 商 Gong Ying Shang

the supplier

供應 商 代
代 Dai

replace / act on behalf of others / replace / generation / dynasty / age / period / (historical) era / (geological) aeons


发 F.A.

send / show (one feeling) / release / develop / classifier for shots (rounds)

發 HSK 4

发 fÀ

Hair / Taiwan Ave [FA3]

髮 发货
发货 F.A. Huo

send / send goods

發貨

These first three characters form one word.

+5
source

The standard API provides BreakIterator for this kind of boundary analysis, but support for the Oracle Java 7 language standard does not violate the sample string.

When I used ICU4J v51.1 BreakIterator he broke the sample into [供应, 商代, 发, 发, 货] .

 // import com.ibm.icu.text.BreakIterator; String sentence = "\u4f9b\u5e94\u5546\u4ee3\u53d1\u53d1\u8d27"; BreakIterator iterator = BreakIterator.getWordInstance(Locale.CHINESE); iterator.setText(sentence); List<String> words = new ArrayList<>(); int start = iterator.first(); int end = iterator.next(); while (end != BreakIterator.DONE) { words.add(sentence.substring(start, end)); start = end; end = iterator.next(); } System.out.println(words); 

Note. I used Google Translate to suggest that "供应 商代 发 发货" was Chinese. Obviously, I do not speak the language, so I can not comment on the correct conclusion.

+6
source

Assuming that each language has one (or more) word separators, and you can create a regular expression for this separator, then the problem can be solved as follows:

  public String separatorForLanguage(char unicodeChar){ // Find out in which language unicodeChar falls return ""; // return regex of separator of that language } public int wordCount(String sentance){ char unicodeChar = sentance.charAt(0); String separator = separatorForLanguage(unicodeChar); int count = sentance.split(separator).length; if (separator.isEmpty()) { count--; } return count; } 
+2
source

Here is a snippet in java

 public static int getWordCount(String string) { Pattern pattern = Pattern.compile("[\\w']+|[\\u3400-\\u4DB5\\u4E00-\\u9FCC]"); Matcher matcher = pattern.matcher(string); int count = 0; while(matcher.find()) count++; return count; } 

Example

 //count is 5 int wordCount = getWordCount("this is popcorny 電腦"); 
+2
source

English version

For the English version, you can make a fairly simple regular expression. I might have missed some custom separators, but:

 public static int getWordCount(String str) { return str.split("[\\s,;-]+").length; } 

Regex explanation:

Divide if you find any of the group [] :

 [ \\s Any whitespace character or , A comma ; or a semi-colon ] + Followed by any patterns in the group any number of times 

Chinese version

For the Chinese version, you need to determine what separators are. If you get the Unicode char code of Chinese delimiters and add them to the above regex, you will get the desired results.

Test

 System.out.println(getWordCount("This is a sentence"));// 4 System.out.println(getWordCount("This is a sentence")); // 4 System.out.println(getWordCount("This is a ,,sentence")); // 4 
+1
source

Source: https://habr.com/ru/post/945374/