I am having problems using ICI BreakIterator on an entire UTF-8 Khmer text file (Cambodian) to break words for line breaks (Khmer has no spaces between words such as Thai).
I used the sample provided to me and changed it to read the text file line by line, but the problem is that the line contains only one word, BreakIterator does not work well, because we configured it, try to find at least 3 words in a row (this is necessary for Khmer, and without it BreakIterator is not so accurate).
Can someone help me figure out how to overcome this problem? I thought the easiest way is to read the entire text file in the buffer, but I cannot get it to work fine.
Here is all the code I have that breaks the words from a text file into lines:
/* Written by George Rhoten, and SBBIC to test how word segmentation works. Code inspired by the break ICU sample. Here is an example to run this code in Ubuntu. ./a.out input.txt output.txt Encode input.txt as UTF-8. The output text is UTF-8. */
source share