Fetching parts of a loaded page in PHP (RegEx)

I have a mailing system that I am trying to include in a PHP site. A PHP site loads a content area and also loads scripts to the top of the page. This works great for the code generated for the site, but now I have a newsletter that I am trying to include.

I was originally going to use iFrame, but the number of AJAX and jQuery calls makes this pretty complicated.

So I thought I could use cURL to load the newsletter page as a variable. Then I was going to use RegEx to capture content between body tags and put it in the content area. Finally, I was going to use RegEx again to search in my head and capture any scripts.

$ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $config_live_site."lib/alerts/user/ update.php?email=test@test.com.au "); # URL to post to curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 ); # return into a variable curl_setopt($ch, CURLOPT_HEADER, 0); $loaded_result = curl_exec( $ch ); # run! curl_close($ch); // Capture the body content and place in $_content if (preg_match('%<body>([\s\S]*)</body>%', $loaded_result, $regs)) { $_content .= $regs[1]; } else { $_content .= "<p>No content to display.</p>"; } // Capture the scripts and place in the head if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $loaded_result, $regs)) { $headDetails .= $regs[0]; } 

This works most of the time, but if there is a script in the document body, it is committed to the last / script '.

My question is double, I think ...

a. Is there a better general approach (my deadline is very short, so it should be a quick solution without editing the newsletter too much)?

Q. Which RegEx will I need to use to just grab the first script?

+4
source share
4 answers

Think you need to add ? into the regex script after * so that it is not greedy. A greedy regular expression fits as much as possible (everything between the first opening tag and the last closing), no greedy match as little as possible (only that between the opening tag and the first closing tag). Try:

 %(<script type="text/javascript">[\s\S]*?</script>)% 

As already mentioned, change it to preg_match_all , and you should just map the individual sections of the script, not all between the first and last script tags.

+2
source

A: I see no problem using regular expressions to extract the bits you need from HTML pages that are not necessarily valid. In fact, some of the spidering solutions I worked with did just that.

B: Use preg_match_all () instead of preg_match (). preg_match () only captures the first match, while preg_match_all () will continue to the end of the line and return all matches.

+1
source

A quick and dirty answer may be: remove the contents of the body immediately after its capture. Then continue

 if (preg_match('%<head>([\s\S]*)</head>%', $loaded_result, $regs)) { $_header .= $regs[1]; } else { $_header .= "<p>No content to display.</p>"; } 

then apply the regex only to the header

 if (preg_match('%(<script type="text/javascript">[\s\S]*</script>)%', $_header, $regs)) { $headDetails .= $regs[0]; } 

If the html obtained from curl is well-formed, you should use simplexml to perform your extraction. As the name suggests, it is very simple to use.

 $xml = simplexml_load_string($loaded_content); $body = $xml->body->asXML(); $scripts = $xml->xpath('//head/script'); foreach ($scripts as $script) { $_scripts .= $script->asXML(); } 

If your html is not correctly formed, then you need to resort to accuracy to normalize it (or, better, fix scripts that display invalid html content)

0
source
 $doc = new DOMDocument(); $doc->loadHTML($loaded_result); $xpath = new DOMXpath($doc); $kod = $xpath->query("//head/script"); $i = 0; foreach($kod as $node){ echo 'im the script nΒΊ'.(++$i).' in the head and this is my content: '; echo $doc->saveXML($node)."\n"; } 
-1
source

Source: https://habr.com/ru/post/1300528/


All Articles