Can Goutte / Guzzle be put into UTF-8 mode?

Question

Can Goutte / Guzzle be put into UTF-8 mode?

I am scraping from the UTF-8 site using Goutte , which internally uses Guzzle. The site declares a UTF-8 meta tag, thus:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

However, the header of the content type is as follows:

 Content-Type: text/html

and not:

 Content-Type: text/html; charset=utf-8

Thus, when I scratch, Goutte does not see that it is UTF-8, and it takes data incorrectly. The remote site is not under my control, so I can not solve the problem there! Here is a set of scripts to replicate the problem. Firstly, the scraper:

 <?php require_once realpath(__DIR__ . '/..') . '/vendor/goutte/goutte.phar'; $url = 'http://crawler-tests.local/utf-8.php'; use Goutte\Client; $client = new Client(); $crawler = $client->request('get', $url); $text = $crawler->text(); echo 'Whole page: ' . $text . "\n";

Now the test page will be hosted on the web server:

 <?php // Correct #header('Content-Type: text/html; charset=utf-8'); // Incorrect header('Content-Type: text/html'); ?> <!DOCTYPE html> <html> <head> <title>UTF-8 test</title> <meta charset="utf-8" /> </head> <body> <p>When the Content-Header header is incomplete, the pound sign breaks: £15,216</p> </body> </html>

Here is the output of the Goutte test:

Entire page: UTF-8 test When the Content-Header header is incomplete, the pound sign breaks: Â £ 15,216

As you can see from the comments in the last script, declaring the character set in the header correctly, it fixes everything. I hunted in Goutte to see if there is something similar in that it will force a character set, but to no avail. Any ideas?

+6

php web-scraping goutte guzzle symfony-components

halfer Sep 13 '13 at 9:03

source share

3 answers

It seems I got here in two mistakes, one of which was identified by Peter. Another way that I use the Symfony Crawler class separately to learn HTML snippets.

I did this (for parsing HTML for a table row):

 $subCrawler = new Crawler($rowHtml);

Adding HTML through the constructor, however, does not indicate the way in which the character set can be specified, and I believe that ISO-8859-1 is again the default value.

Just using addHtmlContent is correct; the second parameter specifies the character set, and by default it corresponds to UTF-8 if it is not specified.

 $subCrawler = new Crawler(); $subCrawler->addHtmlContent($rowHtml);

+9

halfer Sep 14 '13 at 13:39

source share

Crawler tries to detect the encoding from the <meta charset tag, but it is often missing, and then Crawler uses the default encoding (ISO-8859-1) - this is the source of the problem described in this thread.

When we pass the contents of the Crawler through the constructor, we skip the Content-Type header, which usually contains the encoding.

Here we can handle this:

 $crawler = new Crawler(); $crawler->addContent( $response->getBody()->getContents(), $response->getHeaderLine('Content-Type') );

In this solution, we use the correct encoding from the server response and do not bind our solution to any one encoding, and, of course, after that we do not need to decode each received line from Crawler (using utf8_decode() or somehow).

+1

mushroom Nov 21 '17 at 11:25

source share

Peter · Accepted Answer · 2013-09-13T19:10:32+0000

The problem is with symfony / browser-kit and symfony / domcrawler. In the Kit Client browser, HTML meta tags are not viewed to determine only the encoding, only the content title. When the response body is passed to domcrawler, it is treated as the default encoding of ISO-8859-1 . After examining the meta tags, the solution should be returned and DomDocument rebuilt, but this will never happen.

A simple solution is to wrap $crawler->text() with utf8_decode() :

 $text = utf8_decode($crawler->text());

This works if input is UTF-8. I believe that for other encodings you can do something similar with iconv() . However, you should keep this in mind every time you call text() .

A more general approach is to make Domcrawler believe that it is dealing with UTF-8. To this end, I came up with a Guzzle plugin that overwrites (or adds) the encoding in the response header of a content type. You can find it at https://gist.github.com/pschultz/6554265 . Usage looks like this:

 <?php use Goutte\Client; $plugin = new ForceCharsetPlugin(); $plugin->setForcedCharset('utf-8'); $client = new Client(); $client->getClient()->addSubscriber($plugin); $crawler = $client->request('get', $url); echo $crawler->text();

Can Goutte / Guzzle be put into UTF-8 mode?

More articles: