File_get_contents () Breaks UTF-8 characters

I download HTML from an external server. HTML markup is UTF-8 encoded and contains characters such as ľ, š, č, ť, ž, etc. When I load the HTML with the_get_contents () file as follows:

$html = file_get_contents('http://example.com/foreign.html'); 

He messed up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of the correct UTF-8 characters.

How can i solve this?

UPDATE:

I tried to save the HTML file to a file and output it with UTF-8 encoding. Both do not work, so it means that file_get_contents () is already returning broken HTML.

UPDATE2:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta http-equiv="Content-Style-Type" content="text/css" /> <meta http-equiv="Content-Language" content="sk" /> <title>Test</title> </head> <body> <?php $html = file_get_contents('http://example.com'); echo htmlentities($html); ?> </body> </html> 
+58
php utf-8 file-get-contents
Feb 10 2018-10-10
source share
7 answers

Good. I found that file_get_contents () does not cause this problem. There is another reason that I am talking about in another question. Stupid me.

See this question: Why does the DOM change encoding?

+6
Feb 10 '10 at 13:05
source share

I had a similar problem with the Polish language

I tried:

 $fileEndEnd = mb_convert_encoding($fileEndEnd, 'UTF-8', mb_detect_encoding($fileEndEnd, 'UTF-8', true)); 

I tried:

 $fileEndEnd = utf8_encode ( $fileEndEnd ); 

I tried:

 $fileEndEnd = iconv( "UTF-8", "UTF-8", $fileEndEnd ); 

And then -

 $fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8"); 

This last one worked great !!!!!!

+102
03 Mar. '13 at 8:20
source share

Solution suggested in comments on PHP manual input for file_get_contents

 function file_get_contents_utf8($fn) { $content = file_get_contents($fn); return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)); } 

You can also try your luck with http://php.net/manual/en/function.mb-internal-encoding.php

+69
Feb 10 2018-10-10
source share

I think you just have a double character type conversion: D

This may be because you opened the html document in the html document. So you have something like this at the end

 <!DOCTYPE html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title></title> </head> <body> <!DOCTYPE html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Test</title>....... 

Therefore, using mb_detect_encoding can lead to other problems.

+5
Nov 10 '12 at 18:59
source share

Try it too

  $url = 'http://www.domain.com/'; $html = file_get_contents($url); //Change encoding to UTF-8 from ISO-8859-1 $html = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $html); 
+2
Nov 19 '14 at 1:55
source share

In Turkish, mb_convert_encoding or any other character set conversion does not work.

And also urlencode did not work due to char space converted to + char. It must be% 20 to encode percentages.

It worked!

  $url = rawurlencode($url); $url = str_replace("%3A", ":", $url); $url = str_replace("%2F", "/", $url); $data = file_get_contents($url); 
+2
Oct 26 '16 at 8:24
source share

I work with 35,000 rows of data.

 $f=fopen("veri1.txt","r"); $i=0; while(!feof($f)){ $i++; $line=mb_convert_encoding(fgets($f), 'HTML-ENTITIES', "UTF-8"); echo $line; } 

This code converts my weird characters to normal.

0
Nov 15 '17 at 10:49
source share



All Articles