Get word count from a string in Unicode (in any language)

Question

Get word count from a string in Unicode (in any language)

I want to get the number of words from a string. It is so simple. The trick is that the string can be in an unpredictable language.

So, I need a signature function int getWordCount(String) with the following sample output -

 getWordCount("供应商代发发货") => 7 getWordCount("This is a sentence") => 4

Any help on how to proceed would be appreciated :)

+6

java string unicode word-count multilingual

jaibatrik May 19, '13 at 17:36

source share

5 answers

The standard API provides BreakIterator for this kind of boundary analysis, but support for the Oracle Java 7 language standard does not violate the sample string.

When I used ICU4J v51.1 BreakIterator he broke the sample into [供应, 商代, 发, 发, 货] .

 // import com.ibm.icu.text.BreakIterator; String sentence = "\u4f9b\u5e94\u5546\u4ee3\u53d1\u53d1\u8d27"; BreakIterator iterator = BreakIterator.getWordInstance(Locale.CHINESE); iterator.setText(sentence); List<String> words = new ArrayList<>(); int start = iterator.first(); int end = iterator.next(); while (end != BreakIterator.DONE) { words.add(sentence.substring(start, end)); start = end; end = iterator.next(); } System.out.println(words);

Note. I used Google Translate to suggest that "供应商代发发货" was Chinese. Obviously, I do not speak the language, so I can not comment on the correct conclusion.

+6

Mcdowell May 19 '13 at 18:18

source share

Assuming that each language has one (or more) word separators, and you can create a regular expression for this separator, then the problem can be solved as follows:

  public String separatorForLanguage(char unicodeChar){ // Find out in which language unicodeChar falls return ""; // return regex of separator of that language } public int wordCount(String sentance){ char unicodeChar = sentance.charAt(0); String separator = separatorForLanguage(unicodeChar); int count = sentance.split(separator).length; if (separator.isEmpty()) { count--; } return count; }

+2

Mohayemin May 19, '13 at 18:07

source share

Here is a snippet in java

 public static int getWordCount(String string) { Pattern pattern = Pattern.compile("[\\w']+|[\\u3400-\\u4DB5\\u4E00-\\u9FCC]"); Matcher matcher = pattern.matcher(string); int count = 0; while(matcher.find()) count++; return count; }

Example

 //count is 5 int wordCount = getWordCount("this is popcorny 電腦");

+2

popcorny Aug 1 '13 at 10:24

source share

English version

For the English version, you can make a fairly simple regular expression. I might have missed some custom separators, but:

 public static int getWordCount(String str) { return str.split("[\\s,;-]+").length; }

Regex explanation:

Divide if you find any of the group [] :

 [ \\s Any whitespace character or , A comma ; or a semi-colon ] + Followed by any patterns in the group any number of times

Chinese version

For the Chinese version, you need to determine what separators are. If you get the Unicode char code of Chinese delimiters and add them to the above regex, you will get the desired results.

Test

 System.out.println(getWordCount("This is a sentence"));// 4 System.out.println(getWordCount("This is a sentence")); // 4 System.out.println(getWordCount("This is a ,,sentence")); // 4

+1

flavian May 19, '13 at 17:42

source share

peter.murray.rust · Accepted Answer · 2013-05-19T18:00:13+0000

The concept of a word can be trivial or complex. Here is the Apache Stanbol Toolkit:

Word Tokenization: Single-word detection is required by Stanbol Enhancer for word processing. Although this is trivial for most languages, it is quite a challenge for some oriental languages, for example. Chinese, Japanese, Korean. Unless otherwise configured, Stanbol will use spaces to tokenize words.

So, if the concept of a word is linguistic, not syntactic, you should use the NLP toolkit

My preferred Java solution: Apache Open NLP

Note: I used http://www.mdbg.net/chindict/chindict.php?page=worddict tokenize your example. This means that there are 4 words not seven. I cut and pasted (rather fragmented):

Source code simplified PinYin English definition Add new word to dictionary Traditional HSK 商商供应商 Gong Ying Shang

the supplier

供應商代
代 Dai

replace / act on behalf of others / replace / generation / dynasty / age / period / (historical) era / (geological) aeons

发
发 F.A.

send / show (one feeling) / release / develop / classifier for shots (rounds)

發 HSK 4

发 fÀ

Hair / Taiwan Ave [FA3]

髮发货
发货 F.A. Huo

send / send goods

發貨

These first three characters form one word.

Get word count from a string in Unicode (in any language)

More articles: