How to count Japanese words in Go-lang

A walk around the Go-Tour gives you the good impression that Unicode is supported out of the box.

Counting words that do not use standard delimiters, such as spaces, especially in Japanese and Chinese, is painful in other programming languages ​​(php), so I’m curious to know if words written in Japanese (for example: katakana) can be read using a programming language Go

If so, how?

+6
source share
1 answer

Answer Yes . "You can recount words written in Japanese (for example: katakana) using the Go programming language ." But first you need to improve your question.

Someone reading your phrase “standard delimiters such as spaces” might think that word counting is a well-defined operation. This is not so, even for languages ​​such as English. In the phrase "testing 1 2 3 testing" the line "1 2 3" means one word or three or zero? Is the answer great for "testing 123 testing"? How many words are there in the phrase "testing <mytag class="numbers"> 1 2 3 </mytag> testing"?

Someone may also believe that the Japanese language has the concept of "word", similar to English, but with a different syntactic convention. This is not true for many languages ​​such as Japanese, Chinese, and Thai.

So, first you have to improve your question by indicating what words are in the Latin script for languages ​​like English.

Do you need a simple lexical definition based on the presence of interval characters? Then consider using Unicode TR 29 Version 4.1.0 - Text Boundaries , Section 4 Word Boundaries . This defines “word boundaries” in terms of regular expressions and Unicode character properties. The GMX-V industry localization standard , the Word Boundaries section , uses TR 29.

Once you have your own definition, I'm sure you can implement it using Go packages like unicode and text/scanner . I didn’t do it myself. From a quick review of the official list of packages, it seems that existing packages do not have TR 29 implementations. But your question asks if this is possible, but not already implemented by the official package.

Next, for the Japanese: do you need a simple lexical definition of a word? If so, Unicode TR 29 supplies it. They say

For Thai, Lao, Khmer, Myanmar, and other scripts that usually do not use spaces between words, a good implementation should not depend on the specification of the default word boundaries. It should use a more complex mechanism, which is also necessary for line breaks. Ideographic scenarios such as Japanese and Chinese are even more complex. Where the text in the mangul is written without spaces, the same thing. However, in the absence of a more complex mechanism, the rules specified in this appendix provide well-defined default values.

If you need a linguistically sophisticated definition of a word in a Japanese context, you need to start addressing the issues raised by @Jhilke Dai, Sergio Tulentsev and others. You will need to develop your own word specification. Then you will need to implement it. I am sure that you will not find such an implementation in the official Go package since July 2014. However, I am also sure that if you can develop a clear specification, it is "possible" to implement it in Go.

Now: how many words does this answer have? How did you count them?

+1
source

Source: https://habr.com/ru/post/971770/


All Articles