How to parse the <media: content> tag in RSS using simplexml
The structure of my RSS is from http://rss.cnn.com/rss/edition.rss :
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://rss.cnn.com/~d/styles/itemcontent.css"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<title><![CDATA[CNN.com - RSS Channel - Intl Homepage - News]]></title>
<description><![CDATA[CNN.com delivers up-to-the-minute news and information on the latest top stories, weather, entertainment, politics and more.]]></description>
<link>http://www.cnn.com/intl_index.html</link>
...
<item>
<title><![CDATA[Russia responds to claims it has damaging material on Trump]]></title>
<description><![CDATA[The Kremlin denied it has compromising information about US President-elect Donald Trump, describing the allegations as "pulp fiction".]]></description>
<link>http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</link>
<guid isPermaLink="true">http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</guid>
<pubDate>Wed, 11 Jan 2017 14:44:49 GMT</pubDate>
<media:group>
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg" height="619" width="1100" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-large-11.jpg" height="300" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-large-gallery.jpg" height="552" width="414" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-video-synd-2.jpg" height="480" width="640" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-live-video.jpg" height="324" width="576" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-gallery.jpg" height="360" width="270" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-story-body.jpg" height="169" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-assign.jpg" height="186" width="248" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-hp-video.jpg" height="144" width="256" />
</media:group>
</item>
...
</channel>
</rss>
If you parse this XML with simplexml as follows:
$rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
$rssjson = json_encode($rss);
$rssarray = json_decode($rssjson, TRUE);
you will see what is <media:content>simply missing from the elements $rssarray. So I found a tutorial with the solution "namespace". However, in the example, the author uses:
foreach ($xml->channel->item as $item) { ... }
but I use (cannot use foreach for some reason):
$rssjson = json_encode($rss);
$rssarray = json_decode($rssjson, TRUE);
So, I changed the solution for my case as follows:
$rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
$namespaces = $rss->getNamespaces(true); // get namespaces
$rssjson = json_encode($rss);
$rssarray = json_decode($rssjson, TRUE);
if (isset($rssarray['channel']['item'])) {
foreach ($rssarray['channel']['item'] as $key => $item) {
$media_content = $rss->channel->item[$key]->children($namespaces['media']);
foreach($media_content as $tag) {
$tagjson = json_encode($tag);
$tagarray = json_decode($tagjson, TRUE);
}
}
}
But that will not work. For each element, I get in $tagarrayas an result an array with this structure:
Array(
'content' => array(
'0' => array(null),
'1' => array(null),
...
'11' => array(null),
)
)
, <media:content>, . url . ?
+4
1
:
<media:content ... />
^^
, SimpleXMLElement:: attributes(), :
$rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
$namespaces = $rss->getNamespaces(true);
$media_content = $rss->channel->item[0]->children($namespaces['media']);
foreach($media_content->group->content as $i){
var_dump((string)$i->attributes()->url);
}
, JSON. SimpleXML ( PHP), , PHP, print_r() json_encode(). , :
var_dump($i, json_encode($i), (string)$i->attributes()->url);
object(SimpleXMLElement)#2 (0) {
}
string(2) "{}"
string(91) "http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg"
...
+2