How to parse XML on Wikipedia using PHP?

How to parse XML on Wikipedia using PHP? I tried it with a simple beer, but got nothing. Here is the link I want to get.

http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml

Change code:

 <?php define("EMAIL_ADDRESS", " youlichika@hotmail.com "); $ch = curl_init(); $cv = curl_version(); $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity"); curl_setopt($ch, CURLOPT_HEADER, FALSE); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_HTTPGET, TRUE); curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); $xml = curl_exec($ch); $xml_reader = new XMLReader(); $xml_reader->xml($xml, "UTF-8"); echo $xml->api->query->pages->page->rev; ?> 
+3
source share
3 answers

I usually use a combination of XMLReader and XMLReader to parse the XML generated by the MediaWiki API.

Please note that you must specify your email address in the User-Agent header, otherwise the API script will respond with an HTTP 403 ban.

This is how I initialize the CURL handle:

 define("EMAIL_ADDRESS", " my@email.com "); $ch = curl_init(); $cv = curl_version(); $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity"); curl_setopt($ch, CURLOPT_HEADER, FALSE); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 

Then you can use this code that captures XML and creates a new XMLReader object in $xml_reader :

 curl_setopt($ch, CURLOPT_HTTPGET, TRUE); curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); $xml = curl_exec($ch); $xml_reader = new XMLReader(); $xml_reader->xml($xml, "UTF-8"); 

EDIT: Here is a working example:

 <?php define("EMAIL_ADDRESS", " youlichika@hotmail.com "); $ch = curl_init(); $cv = curl_version(); $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity"); curl_setopt($ch, CURLOPT_HEADER, FALSE); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_HTTPGET, TRUE); curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); $xml = curl_exec($ch); $xml_reader = new XMLReader(); $xml_reader->xml($xml, "UTF-8"); function extract_first_rev(XMLReader $xml_reader) { while ($xml_reader->read()) { if ($xml_reader->nodeType == XMLReader::ELEMENT) { if ($xml_reader->name == "rev") { $content = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES); return $content; } } else if ($xml_reader->nodeType == XMLReader::END_ELEMENT) { if ($xml_reader->name == "page") { throw new Exception("Unexpectedly found `</page>`"); } } } throw new Exception("Reached the end of the XML document without finding revision content"); } $latest_rev = array(); while ($xml_reader->read()) { if ($xml_reader->nodeType == XMLReader::ELEMENT) { if ($xml_reader->name == "page") { $latest_rev[$xml_reader->getAttribute("title")] = extract_first_rev($xml_reader); } } } function parse($rev) { global $ch; curl_setopt($ch, CURLOPT_HTTPGET, TRUE); curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=parse&text=" . rawurlencode($rev) . "&prop=text&format=xml"); sleep(3); $xml = curl_exec($ch); $xml_reader = new XMLReader(); $xml_reader->xml($xml, "UTF-8"); while ($xml_reader->read()) { if ($xml_reader->nodeType == XMLReader::ELEMENT) { if ($xml_reader->name == "text") { $html = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES); return $html; } } } throw new Exception("Failed to parse"); } foreach ($latest_rev as $title => $latest_rev) { echo parse($latest_rev) . "\n"; } 
+7
source

You can use simplexml :

 $xml = simplexml_load_file($url); 

See an example here: http://php.net/manual/en/simplexml.examples-basic.php

Or Dom :

 $xml = new DomDocument; $xml->load($url); 

Or XmlReader for huge XML documents that you don’t want to read completely in memory.

+1
source

You should look at the php class XMLReader .

+1
source

Source: https://habr.com/ru/post/1387861/


All Articles