Getting url data with curl method giving unexpected results in characters

I ran into some problems. The problem of getting url data using the curl method specifically for a website is in another language such as Arabic, etc. My curl function

function file_get_contents_curl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); $data = curl_exec($ch); $info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE); //checking mime types if(strstr($info,'text/html')) { curl_close($ch); return $data; } else { return false; } } 

And how do I get the data

 $html = file_get_contents_curl($checkurl); $grid =''; if($html) { $doc = new DOMDocument(); @$doc->loadHTML($html); $nodes = $doc->getElementsByTagName('title'); @$title = $nodes->item(0)->nodeValue; @$metas = $doc->getElementsByTagName('meta'); for ($i = 0; $i < $metas->length; $i++) { $meta = $metas->item($i); if($meta->getAttribute('name') == 'description') $description = $meta->getAttribute('content'); } 

I get all the data correctly from some arabic websites like http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873 and when I give this youtube url http: // www. youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAAFAA
it shows characters .. what setting should I do to show exactly the same description of the name.

+6
source share
3 answers

Introduction

Getting Arabic can be very difficult, but these are some basic steps needed to ensure

  • Your document should output UTF-8
  • Your DOMDocument should read in UTF-8 fromat

Problem

Upon receipt of Youtube information, information in the "UTF-8" format has already been provided, and the search process adds the UTF-8 add UTF-8 . I don't know why this is happening, but a simple utf8_decode fix the problem.

Example

 header('Content-Type: text/html; charset=UTF-8'); echo displayMeta("http://www.emaratalyoum.com/multimedia/videos/2012-04-08-1.474873"); echo displayMeta("http://www.youtube.com/watch?v=Eyxljw31TtU&feature=g-logo&context=G2c4f841FOAAAAAAAFAA"); 

Exit

emaratalyoum.com

 التقطت عدسات الكاميرا حارس مرمى ريال مدريد إيكر كاسياس في موقف محرج قبل لحظات من بداية مباراة النادي الملكي مع أبويل القبرصي في ذهاب دور الثمانية لدوري أبطال 

youtube.com

 أوروبا.ففي النفق المؤدي إلى الملعب، قام كاسياس بوضع إصبعه في أنفه، وبعدها قام بمسح يده في وجه أحدبنات سعوديات: أريد "شايب يدللني ولا شاب يعللني" 

Function used

displayMeta STRONG>

 function displayMeta($checkurl) { $html = file_get_contents_curl($checkurl); $grid = ''; if ($html) { $doc = new DOMDocument("1.0","UTF-8"); @$doc->loadHTML($html); $nodes = $doc->getElementsByTagName('title'); $title = $nodes->item(0)->nodeValue; $metas = $doc->getElementsByTagName('meta'); for($i = 0; $i < $metas->length; $i ++) { $meta = $metas->item($i); if ($meta->getAttribute('name') == 'description') { $description = $meta->getAttribute('content'); if (stripos(parse_url($checkurl, PHP_URL_HOST), "youtube") !== false) return utf8_decode($description); else { return $description; } } } } } 

* file_get_contents_curl *

 function file_get_contents_curl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); $data = curl_exec($ch); $info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE); // checking mime types if (strstr($info, 'text/html')) { curl_close($ch); return $data; } else { return false; } } 
+6
source

I believe this will work ... utf8_decode () is your attribute ..

 function file_get_contents_curl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); $data = curl_exec($ch); $info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE); //checking mime types if(strstr($info,'text/html')) { curl_close($ch); return $data; } else { return false; } } $html = file_get_contents_curl($checkurl); $grid =''; if($html) { $doc = new DOMDocument(); @$doc->loadHTML($html); $nodes = $doc->getElementsByTagName('title'); @$title = $nodes->item(0)->nodeValue; @$metas = $doc->getElementsByTagName('meta'); for ($i = 0; $i < $metas->length; $i++) { $meta = $metas->item($i); if($meta->getAttribute('name') == 'description') $description = utf8_decode($meta->getAttribute('content')); } 
+1
source

What happens here is that you drop the found Content-Type header, which cURL returns in your function file_get_contents_curl() ; DOMDocument needs this information to understand the character set that was used on the page.

A somewhat ugly hack, but the most common one, is the return page prefix with the <meta> tag containing the returned character set from the response headers:

 if (strstr($info, 'text/html')) { curl_close($ch); return '<meta http-equiv="Content-Type" content="' . $info . '" />' . $data; } 

The DOMDocument will accept an erroneous meta tag and automatically perform the appropriate transformations.

+1
source

Source: https://habr.com/ru/post/913004/


All Articles