How to parse an XML sitemap using PHP Curl and individually load each URL

Question

How to parse an XML sitemap using PHP Curl and individually load each URL

I am trying to write a script that will read the remote sitemap.xml file and parse the url inside it, and then load each in turn to pre-cache them for faster viewing.

The reason for this: the system we are developing writes DITA XML to the browser on the fly and the first time the page loads, the wait can be from 8 to 10 seconds. Subsequent loads after this can be as little as 1 second. Obviously, for a better UX, pre-cached pages are a bonus.

Every time we prepare a new publication on this server or perform any testing / correction, we must clear the cache, so the idea is to write a script that will analyze the site map and load each URL.

After a little reading, I decided that the best way is to use PHP and Curl. Whether this is a good idea or not, I don’t know. I am more familiar with Perl, but neither PHP nor Perl is currently installed on the system, so I thought it would be nice to plunge into the PHP pool.

The code I shot from "teh internets" still reads the sitemap.xml file and writes it to an XML file on our server, and also displays it in a browser. As far as I can tell, this just flushes the whole file in one go?

<?php $ver = "Sitemap Parser version 0.2"; echo "<p><strong>". $ver . "</strong></p>"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://ourdomain.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $xml = curl_exec ($ch); curl_close ($ch); if (@simplexml_load_string($xml)) { $fp = fopen('feed.xml', 'w'); fwrite($fp, $xml); echo $xml; fclose($fp); } ?>

Instead of dumping the entire document to a file or to the screen, it would be better to go through the xml structure and just grab the required url.

xml is in this format:

 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9&#x9;http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4</loc> <lastmod>2011-03-31T11:25:01.984+01:00</lastmod> <changefreq>monthly</changefreq> <priority>1.0</priority> </url> <url> <loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_9</loc> <lastmod>2011-03-31T11:25:04.734+01:00</lastmod> <changefreq>monthly</changefreq> <priority>1.0</priority> </url>

I tried using SimpleXML:

 curl_setopt($ch, CURLOPT_URL, 'http://onlineservices.letterpart.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $data = curl_exec ($ch); curl_close ($ch); $xml = new SimpleXMLElement($data); $url = $xml->url->loc; echo $url;

and it printed the first screen url, which was great news!

http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4

My next step was to try to read all the words in the document so I would try:

 foreach ($xml->url) { $url = $xml->url->loc; echo $url; }

hoping this will capture every loc in the url but won't create anything, and I'm stuck here.

Please, can someone help me grab a child of several parents, and then the best way to load this page and cache it, which I assume is just GET?

Hope I have provided enough information. If something is missing for me (besides the ability to write PHP, say :-)

Thanks.

+6

php curl simplexml

Hedley phillips May 12, '11 at 11:24

source share

3 answers

You don't need to use curl, use simplexml_load_file($sitemap_URL) ... or use simplexml_load_string () with file_get_contents () with stream_context_create (), for something more complicated than GET.

... And there is no need to move the DOM.

Parse as an array with one string!

Like the http://www.sitemaps.org/protocol.html XML description, this is a simple tree with a good view of the array.

You can use json XML reader ,

 $array = json_decode(json_encode(simplexml_load_file($sitemap_URL) ), TRUE);

Therefore use for example. foreach($array['image:image'] as $r) to go through it (check var_dump($array) ) ... see also oop5.iterations .

PS: you can also make the previous choice of node XPath in simplexml.

+2

Peter Krauss Oct 6 '16 at 7:10

source share

You can also use the PHP Simple Large XML Parser ( http://www.phpclasses.org/package/5667-PHP-Parse-XML-documents-and-return-arrays-of-elements.html ) mainly in the case where site map size is too large.

+1

user1500341 Sep 28 '13 at 6:32

source share

onteria_ · Accepted Answer · 2011-05-12T11:30:37+0000

It seems you have no value for storing the foreach result:

 foreach ($xml->url as $url_list) { $url = $url_list->loc; echo $url; }

How to parse an XML sitemap using PHP Curl and individually load each URL

Parse as an array with one string!

More articles: