Why do characters become useless? libcurl C ++ Utf-8 encoded html;

First of all, sorry for my poor English. I did my research, but there are no related answers to solve my problem. I understood and learned about CodePages Utf 8 and other materials in c or C ++, and also know that strings can contain utf8. My console-encoded winxp english development machine is set to 1254 (windows turkish), and I can use Turkish extended characters (İığşçüö) in std :: string, count them and send them to mysqlpp api to write dbs. No problems. But when I want to use curl to get some html and write it to std :: string, my problem will start.

#include <iostream> #include <windows.h> #include <wincon.h> #include <curl.h> #include <string> int main() { SetConsoleCP(1254); SetConsoleOutputCP(1254); std::string s; std::cin>>s; std::cout<<s<<std::endl; return 0; } 

When I start them and type ğşçöüİı, the conclusion is the same ğşçöüİı;

 #include <iostream> #include <windows.h> #include <wincon.h> #include <curl.h> #include <string.h> size_t writer(char *data, size_t size, size_t nmemb, std::string *buffer); { int res; if(buffer!=NULL) { buffer->append(data,size*nmemb); res=size*nmemb; } return res; } int main() { SetConsoleOutputCP(1254); std::string html; CURL *curl; CURLcode result; curl=curl_easy_init(); if(curl) { curl_easy_setopt(curl, CURLOPT_URL, "http://site.com"); curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writer); curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html); result=curl_easy_perform(curl); if(result==CURLE_OK) { std::cout<<html<<std::endl; } } return 0; } 

When I compile and run;

if the html contains' ı ', it is printed on cmd' Ä ± ',' ö 'prints' Ķ', '' 'pirnts out' ÄŸ ',' İ 'prints' Ä˚', etc.

if I change CodePage to 65000,

 ... SetConsoleOutputCP(65000);//For utf8 ... 

Then the result will be the same, so the cmd CodePage is not the cause of the problem.

The response from the HTTP headers indicates that the encoding set for the utf-8 and html metadata is the same.

As I understand it, the source of the problem is the "writer" or "curl" function. Input data is analyzed for characters, so extended characters, such as ı, İ, ğ, are processed up to 2 characters and written to the char std :: string array in this way, so a code header equivalent to these half-characters is printed or used somewhere- either in code (e.g. mysqlpp to write this line in db).

I do not know how to solve this or what to do in the function of a writer or elsewhere. Am I thinking right? if yes. What can I do on this issue? Or a source of problems elsewhere?

Im using mingw32 Windows Xp 32bit Code :: Blocks ide.

+4
source share
2 answers

The correct encoding for UTF-8 is 65001 , not 65000.

Also, have you checked whether the code page can be set? The SetConsoleOutputCP function indicates success or failure by its return value.

+1
source

The returned string is utf-8, so you should set the console code page to 65001 (as recommended by sth). Or convert the string to 1254 and use the code page 1254 to display the console as before.

0
source

Source: https://habr.com/ru/post/1383308/


All Articles