Friday, 13 September 2013

Can Goutte/Guzzle be forced into UTF-8 mode?

Can Goutte/Guzzle be forced into UTF-8 mode?

I'm scraping from a UTF-8 site, using Goutte, which internally uses
Guzzle. The site declares a meta tag of UTF-8, thus:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
However, the content type header is thus:
Content-Type: text/html
and not:
Content-Type: text/html; charset=utf-8
Thus, when I scrape, Goutte does not spot that it is UTF-8, and grabs data
incorrectly. The remote site is not under my control, so I can't fix the
problem there! Here's a set of scripts to replicate the problem. First,
the scraper:
<?php
require_once realpath(__DIR__ . '/..') . '/vendor/goutte/goutte.phar';
$url = 'http://crawler-tests.local/utf-8.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('get', $url);
$text = $crawler->text();
echo 'Whole page: ' . $text . "\n";
Now a test page to be placed on a web server:
<?php
// Correct
#header('Content-Type: text/html; charset=utf-8');
// Incorrect
header('Content-Type: text/html');
?>
<!DOCTYPE html>
<html>
<head>
<title>UTF-8 test</title>
<meta charset="utf-8" />
</head>
<body>
<p>When the Content-Header header is incomplete, the pound sign
breaks:
£15,216</p>
</body>
</html>
As you can see from the comments in the last script, properly declaring
the character set in the header fixes things. I've hunted around in Goutte
to see if there is anything that looks like it would force the character
set, but to no avail. Any ideas?

No comments:

Post a Comment