How to apply the <cctype> functions in text files with different encoding in C ++
I would like to split several files (about 1000) into words and remove numbers and punctuation marks. Then I will process these tokenized words ... However, the files are mostly in German and are encoded in different types:
- ISO-8859-1
- ISO Latin-1
- Ascii
- Utf-8
The problem I am facing is that I cannot find the correct way to apply the character conversion function , for example tolower(), and I also get some weird icons in the terminal when I use std::coutin Ubuntu linux.
For example, in files without UTF-8, the word is französischedisplayed as franz sische, fürlike
f r, etc. In addition, words Örebroor are Österreichignored tolower(). From what I know, it is inserted "Unicode replacement character" (U+FFFD)for any character that the program cannot decode correctly when trying to process Unicode.
When I open UTF-8 files, I don’t get any strange characters, but I still can’t convert special uppercase characters like Ölowercase ... I used std::setlocale(LC_ALL, "de_DE.iso88591");some other options that I found on stackoverflow, but I still don’t get the desired result.
My guess on how I should solve this is:
- Check the encoding of the file to be opened
- open the file according to its specific encoding
- UTF-8
tolower()..
algorithm ?
? ?
1. ( , )? ( Linux, , de_DE, -locale -a)
2. - ? - , ++?
Linux:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
, , , atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}
" " . - 32- .
. ASCII 7 , 128 . 8- Microsoft 128 , " ". MS UTF-16 2 . Unicode, UTF-16 , , Unicode "Latin-1" "ISO-8859-1" ..
Linux ( ) UTF-8, . 128 , ASCII, . UTF8 4 . onfo Wikipedia.
MS UTF-16 , Linux, , UFT-32 .
, . , . std:: basic_ios:: imbue , , SO >
tolower, , .
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
:
s = 214
sL= 246
Linux LC_ALL, LANG LANGUAGE, :
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"