How to use ICI BreakIterator with Unicode text file in C ++

I am having problems using ICI BreakIterator on an entire UTF-8 Khmer text file (Cambodian) to break words for line breaks (Khmer has no spaces between words such as Thai).

I used the sample provided to me and changed it to read the text file line by line, but the problem is that the line contains only one word, BreakIterator does not work well, because we configured it, try to find at least 3 words in a row (this is necessary for Khmer, and without it BreakIterator is not so accurate).

Can someone help me figure out how to overcome this problem? I thought the easiest way is to read the entire text file in the buffer, but I cannot get it to work fine.

Here is all the code I have that breaks the words from a text file into lines:

/* Written by George Rhoten, and SBBIC to test how word segmentation works. Code inspired by the break ICU sample. Here is an example to run this code in Ubuntu. ./a.out input.txt output.txt Encode input.txt as UTF-8. The output text is UTF-8. */ #include <string> #include <iostream> #include <fstream> #include <stdio.h> #include <unicode/brkiter.h> #include <unicode/ucnv.h> #include <stdlib.h> #define ZW_SPACE "\xE2\x80\x8B" void printUnicodeString(const UnicodeString &s) { int32_t len = s.length() * U8_MAX_LENGTH + 1; char *charBuf = new char[len]; len = s.extract(0, s.length(), charBuf, len, NULL); charBuf[len] = 0; printf("%s", charBuf); delete charBuf; } int main(int argc, char **argv) { //Please provide an input file name as well as an output file name (ex. ./a.out input.txt output.txt) //Cannot find the input file you specified ("$input"). Please provide an input file name as well as an output file name (ex. ./a.out input.txt output.txt) //Cannot write to output file. Please check folder permissions. std::ifstream input(argv[1]); //std::ifstream input("read.txt"); std::string line; std::ofstream o(argv[2]); //std::ofstream o("output.txt"); //If input file cannot be found ERROR if (!input) { std::cout<<"Cannot find the input file you specified ("<<argv[1]<<").\nPlease provide an input file name as well as an output file name\n(example: ./a.out input.txt output.txt)"<<std::endl; goto stop; } //If output file cannot be created ERROR if (!o) { std::cout<<"Cannot write to output file ("<<argv[2]<<").\nPlease check folder permissions."<<std::endl; goto stop; } //If no input file is given on command line ERROR if(argv[1]==0) { std::cout<<"Please provide an input file name as well as an output file name\n(example: ./a.out input.txt output.txt)"<<std::endl; goto stop; } //If no output file is given on command line ERROR if(argv[2]==0) { std::cout<<"Please provide output file name as well as an input file name\n(example: ./a.out input.txt output.txt)"<<std::endl; goto stop; } while(std::getline(input,line)) { //Convert standard string to icu UnicodeString UnicodeString Nathan = UnicodeString::fromUTF8(StringPiece(line.c_str())); /* Creating and using text boundaries */ ucnv_setDefaultName("UTF-8"); UnicodeString stringToExamine(Nathan); if (argc > 1) { // Override the default charset. stringToExamine = UnicodeString(Nathan); if (stringToExamine.charAt(0) == 0xFEFF) { // Remove the BOM stringToExamine = UnicodeString(stringToExamine, 1); } } //printUnicodeString(stringToExamine); //puts(""); //print each sentence in forward and reverse order UErrorCode status = U_ZERO_ERROR; BreakIterator* boundary = BreakIterator::createLineInstance(NULL, status); if (U_FAILURE(status)) { printf("Failed to create sentence break iterator. status = %s", u_errorName(status)); exit(1); } //print each word in order boundary->setText(stringToExamine); int32_t start = boundary->first(); int32_t end = boundary->next(); while (end != BreakIterator::DONE) { if (start != 0) { printf(ZW_SPACE); //output ZWSpace to output file(output.txt) o << ZW_SPACE; //filestr<<ZW_SPACE; //filestr.close(); } //Set variable NathanOut to current word and print to console UnicodeString NathanOut = UnicodeString(stringToExamine, start, end-start); printUnicodeString(NathanOut); //Convert UnicodeString to normal string std::string cs; NathanOut.toUTF8String(cs); //Output the string to file(output.txt) o << cs; //print output to console printUnicodeString(NathanOut); start = end; end = boundary->next(); } delete boundary; }//end of while stop: return 0; } 
+4
source share

Source: https://habr.com/ru/post/1395728/


All Articles