Preg_match to find the contents of the string between the <html> and </html> tags

I am working on a PHP script that reads the contents of emails and pulls out certain information for storage in a database.

Using imap_fetchbody ($ imap_stream, $ msg_number, 1), I can get the text of the message. In some cases (especially e-mail sent as SMS from mobile phones), the body of the message is as follows:

===------=_Part_110734_170079945.1283532109852
Content-Type: text/html;charset=UTF-8;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<html> 
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
        <title>Multimedia Message</title> 
    </head> 
    <body leftmargin="0" topmargin="0"> 


                <tr height="15" style="border-top: 1px solid #0F7BBC;"> 
                    <td> 
                        SMS to email test
                    </td> 
                </tr> 


     </body> 
</html> 


------=_Part_110734_170079945.1283532109852--===

I want to pull out the "contents" of the letter. So my plan is this:

Check if the body contains html tags. If not, I can read it normally (this is not an HTML letter).

If so, extract the content between the html tags. Then remove all the other HTML tags, and the “content” is what remains.

, , .

:

$pattern = '/<html[^>]*>(.*?)<\/html>/i';
preg_match($pattern, $body, $matches);
// my 'content' should be in $matches[1]

(, , $body ). , :

$pattern = '/<html[^>]*>([.\s]*?)<\/html>/i';
preg_match($pattern, $body, $matches);

.

, $pattern html?

: : :

$body = preg_replace('/\s\s+/', ' ', $body);
$pattern = '/<body[^>]*>(.*?)<\/body>/';

, , , , . - , .

2:. Gumbo , , , , HTML. : http://docstore.mik.ua/orelly/webprog/pcook/ch17_04.htm, , , .

+3
4

[.\s] ., . (.|\s), [\s\S], , . .

, HTML. HTML , .

, , . arent <html>…</html>. , ? . , : border. , CRLF + CRLF, .

, IMAP, ? PHP API IMAP, , , , , .

+2
$pattern = '/<html[^>]*>([^\00]*?)<\/html>/i';

, 0x00 , .

+3

html, : http://php-html.sourceforge.net/

strip_tags php.net/strip_tags

+2

s , . :

$pattern = '/<html[^>]*>(.*?)<\/html>/si';
preg_match($pattern, $body, $matches);
+1

Source: https://habr.com/ru/post/1763151/


All Articles