I am trying to write a script that will read the remote sitemap.xml file and parse the url inside it, and then load each in turn to pre-cache them for faster viewing.
The reason for this: the system we are developing writes DITA XML to the browser on the fly and the first time the page loads, the wait can be from 8 to 10 seconds. Subsequent loads after this can be as little as 1 second. Obviously, for a better UX, pre-cached pages are a bonus.
Every time we prepare a new publication on this server or perform any testing / correction, we must clear the cache, so the idea is to write a script that will analyze the site map and load each URL.
After a little reading, I decided that the best way is to use PHP and Curl. Whether this is a good idea or not, I donβt know. I am more familiar with Perl, but neither PHP nor Perl is currently installed on the system, so I thought it would be nice to plunge into the PHP pool.
The code I shot from "teh internets" still reads the sitemap.xml file and writes it to an XML file on our server, and also displays it in a browser. As far as I can tell, this just flushes the whole file in one go?
<?php $ver = "Sitemap Parser version 0.2"; echo "<p><strong>". $ver . "</strong></p>"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://ourdomain.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $xml = curl_exec ($ch); curl_close ($ch); if (@simplexml_load_string($xml)) { $fp = fopen('feed.xml', 'w'); fwrite($fp, $xml); echo $xml; fclose($fp); } ?>
Instead of dumping the entire document to a file or to the screen, it would be better to go through the xml structure and just grab the required url.
xml is in this format:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9	http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4</loc> <lastmod>2011-03-31T11:25:01.984+01:00</lastmod> <changefreq>monthly</changefreq> <priority>1.0</priority> </url> <url> <loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_9</loc> <lastmod>2011-03-31T11:25:04.734+01:00</lastmod> <changefreq>monthly</changefreq> <priority>1.0</priority> </url>
I tried using SimpleXML:
curl_setopt($ch, CURLOPT_URL, 'http://onlineservices.letterpart.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $data = curl_exec ($ch); curl_close ($ch); $xml = new SimpleXMLElement($data); $url = $xml->url->loc; echo $url;
and it printed the first screen url, which was great news!
http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4
My next step was to try to read all the words in the document so I would try:
foreach ($xml->url) { $url = $xml->url->loc; echo $url; }
hoping this will capture every loc in the url but won't create anything, and I'm stuck here.
Please, can someone help me grab a child of several parents, and then the best way to load this page and cache it, which I assume is just GET?
Hope I have provided enough information. If something is missing for me (besides the ability to write PHP, say :-)
Thanks.