How to determine if an html tag is split across multiple lines

I am writing a PHP script that includes cleaning web pages. Currently, the script parses the page line by line, but it is interrupted if there is a tag that spans several lines, for example

<img src="example.jpg"
alt="example">

If the worst comes to the worst, I could pre-process the page by deleting all line breaks and then reinserting them at the nearest >, but this seems like an error.

Ideally, I could find a tag that spans strings, concatenate only those in strings, and continue processing.
So what is the best way to detect this?

+3
source share
6 answers

, , . . rstrpos - strpos, . :

for($i=0; $i<count($lines); $i++)
{
    $line = handle_mulitline_tags(&$i, $line, $lines);
}

:

function rstrpos($string, $charToFind, $relativePos)
{
    $searchPos = $relativePos;
    $searchChar = '';

    while (($searchChar != $charToFind)&&($searchPos>-1))
    {
        $newPos = $searchPos-1;
        $searchChar = substr($string,$newPos,strlen($charToFind));
        $searchPos = $newPos;
    }

    if (!empty($searchChar))
    {
        return $searchPos;
        return TRUE;
    }
    else
    {
        return FALSE;
    }
}

function handle_multiline_tags(&$i, $line, $lines)
{
    //if a tag is opened but not closed before a line break,

    $open = rstrpos($line, '<', strlen($line));
    $close = rstrpos($line, '>', strlen($line));
    if(($open > $close)&&($open > -1)&&($close > -1))
    {
        $i++;
        return trim($line).trim(handle_multiline_tags(&$i, $lines[$i], $lines));
    }
    else
    {
        return trim($line);
    }
}

, , .

+1

: HTML . HTML . HTML . HTML HTML - .

, - PHP, PHP5- PHP5.

+7
+2

, , ...

, (, , ) , HTML- , HTML, : < > pairs.

, :

  • , , , < br/" >
  • , , , <p> text </p>

(p): mutiline , , .

+1

Why don't you read the line and set it in the line, and then check the line to open and close tags. If the tag spans more than one line, add the following line to the line and move the part to the opening bracket for your processed line. Then just parse the whole file by doing this. It is not beautiful, but should work.

0
source

If you must adhere to your current parsing method, and this is a regular expression, you can use the multi-line flag "m" to cross multiple lines.

0
source

Source: https://habr.com/ru/post/1696583/


All Articles