Extracting the body of an HTML document using PHP

Question

Extracting the body of an HTML document using PHP

I know that it is better to use the DOM for this purpose, but try extracting the text as follows:

<?php $html=<<<EOD <html> <head> </head> <body> <p>Some text</p> </body> </html> EOD; preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE); if (empty($matches)) exit; $matched_body_start_tag = $matches[0][0]; $index_of_body_start_tag = $matches[0][1]; $index_of_body_end_tag = strpos($html, '</body>'); $body = substr( $html, $index_of_body_start_tag + strlen($matched_body_start_tag), $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag) ); echo $body;

The result can be seen here: http://ideone.com/vH2FZ

As you can see, I get more text than expected.

There is something I don’t understand to get the correct length for the substr($string, $start, $length) function substr($string, $start, $length) , I use:

 $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

I do not see anything wrong with this formula.

Can anyone suggest where the problem is?

Many thanks to all of you.

EDIT:

Thank you all very much. There is a mistake in my brain. After reading your answers, I now understand what the problem is, it should be:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Or:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

+4

php regex text text-processing html-content-extraction

bobo Feb 06 '11 at 1:42

source share

4 answers

Personally, I will not use regex.

 <?php $html = <<<EOD <html> <head> <title>Example</title> </head> <body> <h1>foobar</h1> </body> </html> EOD; $s = strpos($html, '<body>') + strlen('<body>'); $f = '</body>'; echo trim(substr($html, $s, strpos($html, $f) - $s)); ?>

returns <h1>foobar</h1>

+4

jhine Feb 06 '11 at 2:07

source share

The problem is calculating the final substr index. You must completely deduct:

 $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)

But you do:

 + strlen($matched_body_start_tag)

This suggests that you can do this using preg_match only . You just need to make sure you match newlines using the s modifier:

 preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches); echo $matches[1];

Outputs:

 <p>Some text</p>

+2

netcoder Feb 06 '11 at 1:59

source share

Somebodys probably already found your error, I did not read all the answers.
Algebra is not true.

code here

Btw, seeing ideone.com for the first time, it's pretty cool.

 $body = substr( $html, $index_of_body_start_tag + strlen($matched_body_start_tag), $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag)) );

or..

 $body = substr( $html, $index_of_body_start_tag + strlen($matched_body_start_tag), $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag) );

+1

sln Feb 06 '11 at 5:33

source share

ludesign · Accepted Answer · 2011-02-06T02:02:50+0000

The problem is that your line contains new lines. only single lines correspond to the template, you need to add the / s modifier. to match multi-line lines

Here is my solution, I prefer it that way.

 <?php $html=<<<EOD <html> <head> </head> <body buu="grger" ga="Gag"> <p>Some text</p> </body> </html> EOD; // get anything between <body> and </body> where <body can="have_as many" attributes="as required"> if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) { $body = $matches[1]; } // outputing all matches for debugging purposes var_dump($matches); ?>

Edit: I am updating my answer to give you a more complete explanation of why your code is not working.

You have this line:

 <html> <head> </head> <body> <p>Some text</p> </body> </html>

Everything seems to be in order, but on each line you have non-printable characters (new string characters). You have 53 printable characters and 7 non-printable characters (new lines, \ n == 2 characters for each new line).

When you reach this part of the code:

 $index_of_body_end_tag = strpos($html, '</body>');

You get the correct position </body> (starting at position 51), but this counts new lines.

So, when you reach this line of code:

 $index_of_body_start_tag + strlen($matched_body_start_tag)

It evaluates to 31 (including newlines) and:

 $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

It evaluates to 51 - 25 + 6 = 32 (characters you should read), but you only have 16 printable text characters between <body> and </body> and 4 non-printable characters (new line after <body> and new line before </body>). And here is the problem, you need to group the calculations (priorities) as follows:

 $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).

:) Hope this helps you understand why prioritization is important. (Sorry for misleading about new characters, this is only true in the regex example that I gave above).

Extracting the body of an HTML document using PHP

More articles: