How to apply tolower () to the capital letters of Germany Ä, Ö, Ü, ẞ in C ++

I am a little confused since I opened the question , I would like to be more specific here.

I have many files that contain German letters, mainly encoded iso-8859-15 or UTF-8 . To process them, you need to convert all letters to lowercase.

For example, I have a file (encoded in iso-8859-15 ) that contains:

R. Rose in M. Das Sogen. Baptisterium zu Winland, eins der im Art. "Baukunst" (S. 496) erwähnten Rundgebäude in Greenland, soll nach Palfreys "History of the New England" eine von dem Gouverneur Arnold um 1670 erbaute Windmühle sein. VgL. Rush. Storm on the den "Jahrbüchern der königlichen Gesellschaft für nordische Altertumskunde in Copenhagen" 1887, S. 296.

Ää Öö Üü ẞß Örebro

The text Ää Öö Üü ẞß Örebroshould read as follows: ää öö üü ßß örebro.

However, tolower()it does not seem to apply to uppercase letters such as Ä, Ö, Ü, ẞ, although I tried to extrude the locale as indicated on this SO page

Here is the same code as in my other question:

std::vector<std::string> tokens;
std::string filename = "10223-8.txt";
//std::string filename = "test-UTF8.txt";
std::ifstream inFile;

//std::setlocale(LC_ALL, "en_US.iso88591");
//std::setlocale(LC_ALL, "de_DE.iso88591");
//std::setlocale(LC_ALL, "en_US.iso88591");
//std::locale::global(std::locale(""));

inFile.open(filename);
if (!inFile) { std::cerr << "Failed to open file" << std::endl; exit(1); }

std::string s = "";
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
    s.append(line + "\n");
}
inFile.close();

std::cout << s << std::endl;

//std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
    if (std::ispunct(s[i]) || std::isdigit(s[i]))
            s[i] = ' ';
    if (std::isupper(s[i]))
            s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i], std::locale("de_DE.utf8"))
}

std::cout << s << std::endl;

//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};

//PROCESS TOKENS...

This is really disappointing, and there are not many paradigms regarding use <locale>.

, , :

  • - (isupper(), ispunct()...)?
  • de_DE Linux env ?
  • , , std::string, (iso-8859-15 UTF-8)?

EDIT: UTF-8. iso-8859-15, , : ​​++

+4
1

std::ctype::tolower, std::tolower:

#include <iostream>
#include <locale>

int main() {
    std::locale::global(std::locale("de_DE.UTF-8"));
    std::wcout.imbue(std::locale());
    auto& f = std::use_facet<std::ctype<wchar_t>>(std::locale());
    std::wstring str = L"Ää Öö Üü ẞß Örebro";
    f.tolower(&str[0], &str[0] + str.size());
    std::wcout << "'" << str << "'\n";
}

, , ():

std::locale loc("de_DE.UTF-8");
std::wcout.imbue(loc);
auto& f = std::use_facet<std::ctype<wchar_t>>(loc);

"". , -ß (, ).

, : 1 1. Unicode "ß" "SS". std::ctype::toupper .

+1

Source: https://habr.com/ru/post/1695936/


All Articles