How to apply the <cctype> functions in text files with different encoding in C ++

I would like to split several files (about 1000) into words and remove numbers and punctuation marks. Then I will process these tokenized words ... However, the files are mostly in German and are encoded in different types:

  • ISO-8859-1
  • ISO Latin-1
  • Ascii
  • Utf-8

The problem I am facing is that I cannot find the correct way to apply the character conversion function , for example tolower(), and I also get some weird icons in the terminal when I use std::coutin Ubuntu linux.

For example, in files without UTF-8, the word is französischedisplayed as franz sische, fürlike f r, etc. In addition, words Örebroor are Österreichignored tolower(). From what I know, it is inserted "Unicode replacement character" (U+FFFD)for any character that the program cannot decode correctly when trying to process Unicode.

When I open UTF-8 files, I don’t get any strange characters, but I still can’t convert special uppercase characters like Ölowercase ... I used std::setlocale(LC_ALL, "de_DE.iso88591");some other options that I found on stackoverflow, but I still don’t get the desired result.

My guess on how I should solve this is:

  • Check the encoding of the file to be opened
  • open the file according to its specific encoding
  • UTF-8
  • tolower() ..

algorithm ?

? ?

1. ( , )? ( Linux, , de_DE, -locale -a)

2. - ? - , ++?

Linux:

LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=

C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX

, , , atm.

void processFiles() {
    std::string filename = "17454-8.txt";
    std::ifstream inFile;
    inFile.open(filename);
    if (!inFile) {
        std::cerr << "Failed to open file" << std::endl;
        exit(1);
    }

    //calculate file size
    std::string s = "";
    s.reserve(filesize(filename) + std::ifstream::pos_type(1));
    std::string line;
    while( (inFile.good()) && std::getline(inFile, line) ) {
        s.append(line + "\n");
    }
    inFile.close();

    std::cout << s << std::endl;
    //remove punctuation, numbers, tolower,
    //TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
    std::setlocale(LC_ALL, "de_DE.iso88591");
    for (unsigned int i = 0; i < s.length(); ++i) {
        if (std::ispunct(s[i]) || std::isdigit(s[i]))
            s[i] = ' ';
        if (std::isupper(s[i]))
            s[i]=std::tolower(s[i]);
    }
    //std::cout << s << std::endl;
    //tokenize string
    std::istringstream iss(s);
    tokens.clear();
    tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
    for (auto & i : tokens)
        std::cout << i << std::endl;

        //PROCESS TOKENS
    return;
}
+1
1

" " . - 32- .

. ASCII 7 , 128 . 8- Microsoft 128 , " ". MS UTF-16 2 . Unicode, UTF-16 , , Unicode "Latin-1" "ISO-8859-1" ..

Linux ( ) UTF-8, . 128 , ASCII, . UTF8 4 . onfo Wikipedia.

MS UTF-16 , Linux, , UFT-32 .

, . , . std:: basic_ios:: imbue , , SO >

tolower, , .

#include <iostream>
#include <locale>

int main() {
    wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
    wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
    std::cout << "s = " << s << std::endl;
    std::cout << "sL= " << sL << std::endl;

    return 0;
}

:

s = 214
sL= 246

SO , iconv Linux iconv W32.

Linux LC_ALL, LANG LANGUAGE, :

//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"

//English 
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"
+2

Source: https://habr.com/ru/post/1695938/


All Articles