Word-wise email processing (Content-Type: text / plain)

I try to process email in my application, and everything works fine until I get an email from a user whose mail server uses word wrap from a text message. I know the word wrap is part of the RFC specification, so I'm just looking for the best way to handle this in order to get a beautifully displayed message.

Original Email:

Here is my main problem. When I send an email, the text breaks pretty weird. It seems that the message itself is broken. I am not sure why this is so, because my original letter does not look like that.

Here's what the received email looks like (marked CRLF to show where their mail server is inserting):

Here is my main problem. When I send an email, the text breaks rather into CRLF
surprisingly. It seems that the message itself is broken. I'm not sure CRLF
why is this because my original letter is not like CRLF
what.

My processing code goes through the following and then inserts the result into the database.

$dirty_string = nl2br($dirty_string); $config = HTMLPurifier_Config::createDefault(); $config->set('AutoFormat.RemoveEmpty', 'true'); $config->set('AutoFormat.RemoveEmpty.RemoveNbsp', 'true'); $config->set('HTML.Allowed', 'a[href],br,p'); $purifier = new HTMLPurifier($config); $clean_string = $purifier->purify($dirty_string); 

Below is the result that is displayed. If the div on my page is not wide enough for the line, the browser will automatically close it, but line break from nl2br () will cause the next line to be short.

Here is my main problem. When I send an email, text
rather broken
surprisingly. It seems that the message itself is
is broken. I'm not sure
why is this because my original letter looks
nothing like this
what.

I thought that maybe I could just change the double CRLF to new paragraphs and split all the single CRLFs to merge the lines into one line that will display correctly. But if someone sends the following token list to an email, this will break the list.

This is my CRLF list
- Paragraph 1 of the CRLF
- Paragraph 2 of the CRLF
etc...

Any help would be greatly appreciated.

+6
source share
5 answers

Parsing messages is probably a typical example of a problem that seems simple, but is actually filled with cases of extreme cases that break simple parsers. However, this is also not a completely new problem, so there are many existing solutions that work fine. Some options:

You may have already written a great parser that just needs this small change to be perfect, but most likely you will save a lot of time and suffering using existing tools to do the job.

+1
source

How about this: for any line where the next line contains words and does not start with a space character (for example, indent in the list), check the line length from 65 to 80 characters. If so, remove the trailing CR (and add a space if the end of the line does not contain a space or punctuation). This will lead to most of your word wrapping cases and leave most of your lists alone.

0
source

You can try using TinyMCE to view the email message. He will format it correctly. I used TinyMCE several times to enter data and save it to the database, and every time it displayed it correctly after I received the data no matter how strange the formatting was.

0
source

How about this hack: Delete CLRF characters at any position that is a multiple of 78, (+ 5 characters to account for this fact: the mail server won't just cut a line mid-word ).

So you should look for CLRF characters at these positions:

  • 78 or 79 or 80 or 81 or 82 or 83 AND
  • 156 or 157 or 158 or 159 or 160 or 161 AND
  • etc.

This, of course, assumes that the longest words are 5 characters long. You should configure this based on the emails you need to parse.

0
source

Here's a function that does the job pretty well:

 function PlaintextEmailBrokenLineCombine($lineSet, $startIndex = 0) { $result = ''; $lineCount = count($lineSet); for($i=$startIndex; $i < $lineCount; $i++) { $thisLine = $lineSet[$i]; $nextLine = ($i < $lineCount-1 ? $lineSet[$i+1] : ''); $nextLineFirstWord = substr($nextLine, 0, strpos($nextLine, ' ')); $lineSeparator = "\n"; // we assume until we detect invocation of the 78char rule if(strlen($thisLine) + strlen($nextLineFirstWord) + 1 > 75) { // A line break was PROBABLY put in here where a space once was, so switch back: $lineSeparator = ' '; } $result .= $thisLine . ($i == $lineCount-1 ? '' : $lineSeparator); // no separator for the last line } return $result; } 

This is a bit esoteric because it expects an array of strings from a regular text message. Here's the use:

 $Parser = new MimeMailParser(); $Parser->setText($rawEmailText); $plaintext = $Parser->getMessageBody('text'); // or however you get it, many ways $lineSet = explode("\n", $plaintext); $niceText = PlaintextEmailBrokenLineCombine($lineSet); 

$ niceText is what you want: this is a pretty accurate way to get the text you want with these unobtrusive breaks on the server and replace them with the original spaces.

0
source

Source: https://habr.com/ru/post/912418/


All Articles