I am scraping from the UTF-8 site using Goutte , which internally uses Guzzle. The site declares a UTF-8 meta tag, thus:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
However, the header of the content type is as follows:
Content-Type: text/html
and not:
Content-Type: text/html; charset=utf-8
Thus, when I scratch, Goutte does not see that it is UTF-8, and it takes data incorrectly. The remote site is not under my control, so I can not solve the problem there! Here is a set of scripts to replicate the problem. Firstly, the scraper:
<?php require_once realpath(__DIR__ . '/..') . '/vendor/goutte/goutte.phar'; $url = 'http://crawler-tests.local/utf-8.php'; use Goutte\Client; $client = new Client(); $crawler = $client->request('get', $url); $text = $crawler->text(); echo 'Whole page: ' . $text . "\n";
Now the test page will be hosted on the web server:
<?php <!DOCTYPE html> <html> <head> <title>UTF-8 test</title> <meta charset="utf-8" /> </head> <body> <p>When the Content-Header header is incomplete, the pound sign breaks: £15,216</p> </body> </html>
Here is the output of the Goutte test:
Entire page: UTF-8 test When the Content-Header header is incomplete, the pound sign breaks: Â £ 15,216
As you can see from the comments in the last script, declaring the character set in the header correctly, it fixes everything. I hunted in Goutte to see if there is something similar in that it will force a character set, but to no avail. Any ideas?