File_get_contents () Breaks UTF-8 characters

Question

File_get_contents () Breaks UTF-8 characters

I download HTML from an external server. HTML markup is UTF-8 encoded and contains characters such as ľ, š, č, ť, ž, etc. When I load the HTML with the_get_contents () file as follows:

$html = file_get_contents('http://example.com/foreign.html');

He messed up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of the correct UTF-8 characters.

How can i solve this?

UPDATE:

I tried to save the HTML file to a file and output it with UTF-8 encoding. Both do not work, so it means that file_get_contents () is already returning broken HTML.

UPDATE2:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta http-equiv="Content-Style-Type" content="text/css" /> <meta http-equiv="Content-Language" content="sk" /> <title>Test</title> </head> <body> <?php $html = file_get_contents('http://example.com'); echo htmlentities($html); ?> </body> </html>

+58

php utf-8 file-get-contents

Richard Knop Feb 10 2018-10-10

source share

7 answers

I had a similar problem with the Polish language

I tried:

 $fileEndEnd = mb_convert_encoding($fileEndEnd, 'UTF-8', mb_detect_encoding($fileEndEnd, 'UTF-8', true));

I tried:

 $fileEndEnd = utf8_encode ( $fileEndEnd );

I tried:

 $fileEndEnd = iconv( "UTF-8", "UTF-8", $fileEndEnd );

And then -

 $fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");

This last one worked great !!!!!!

+102

ugniesdebesys 03 Mar. '13 at 8:20

source share

Solution suggested in comments on PHP manual input for file_get_contents

 function file_get_contents_utf8($fn) { $content = file_get_contents($fn); return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)); }

You can also try your luck with http://php.net/manual/en/function.mb-internal-encoding.php

+69

Gordon Feb 10 2018-10-10

source share

I think you just have a double character type conversion: D

This may be because you opened the html document in the html document. So you have something like this at the end

 <!DOCTYPE html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title></title> </head> <body> <!DOCTYPE html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Test</title>.......

Therefore, using mb_detect_encoding can lead to other problems.

+5

Dr. Dama Nov 10 '12 at 18:59

source share

Try it too

  $url = 'http://www.domain.com/'; $html = file_get_contents($url); //Change encoding to UTF-8 from ISO-8859-1 $html = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $html);

+2

Mohamm6d Nov 19 '14 at 1:55

source share

In Turkish, mb_convert_encoding or any other character set conversion does not work.

And also urlencode did not work due to char space converted to + char. It must be% 20 to encode percentages.

It worked!

  $url = rawurlencode($url); $url = str_replace("%3A", ":", $url); $url = str_replace("%2F", "/", $url); $data = file_get_contents($url);

+2

Mustafa Ergüven Oct 26 '16 at 8:24

source share

I work with 35,000 rows of data.

 $f=fopen("veri1.txt","r"); $i=0; while(!feof($f)){ $i++; $line=mb_convert_encoding(fgets($f), 'HTML-ENTITIES', "UTF-8"); echo $line; }

This code converts my weird characters to normal.

0

matasoy Nov 15 '17 at 10:49

source share

Richard Knop · Accepted Answer · 2010-02-10 13:05

Good. I found that file_get_contents () does not cause this problem. There is another reason that I am talking about in another question. Stupid me.

See this question: Why does the DOM change encoding?

File_get_contents () Breaks UTF-8 characters

More articles: