Convert UTF-8 Embedded Mail Address

Would you like to convert the following raw message according to the plain text of UTF-8:

=? utf-8? Q? Schuker_hat_sich_vom_ = C3 = 9Cbungsabend_ (01/01/2012) _abgem? = =? utf-8? Q? eldet? =

The real text for this:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

My first conversion approach:

$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?='; mb_internal_encoding("UTF-8"); echo mb_decode_mimeheader($mime); 

This gives me the following result:

Schuker_hat_sich_vom_Übungsabend_ (01.01.2012) _abgemeldet

(Questions here: What am I doing wrong? Why do these underscores occur?)

My second conversion approach:

 $mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?='; echo imap_utf8($mime); 

This gives me the following (correct) result:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

Why does it work? What method should I rely on?

The reason that I ask is that I have previously asked another question the subject matter associated with the expansion of e-mail messages where mb_decode_mimeheader was the solution, but here imap_utf8 - this is the way to go, as I can guarantee that all of these examples will be corrected properly :

=? utf-8? Q? Schuker_hat_sich_vom_ = C3 = 9Cbungsabend_ (01/01/2012) _abgem? = =? utf-8? Q? eldet?

and

=? UTF-8? B UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg ==? = =? UTF-8? B YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW + w73DocOtw6khxYgi? =

Should give me the expected results:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

and

Re: # 2-Final acceptance test test with new text with Slovak interferences "+ ľščťžýáíé! Ň"

+4
source share
4 answers

Based on the hbit answer , I improved the imapUtf8() function to convert the subject text to UTF-8 using encoding information. The result looks something like this:

 function imapUtf8($str){ $convStr = ''; $subLines = preg_split('/[\r\n]+/', $str); for ($i=0; $i < count($subLines); $i++) { $convLine = ''; $linePartArr = imap_mime_header_decode($subLines[$i]); for ($j=0; $j < count($linePartArr); $j++) { if ($linePartArr[$j]->charset === 'default') { if ($linePartArr[$j]->text != " ") { $convLine .= ($linePartArr[$j]->text); } } else { $convLine .= iconv($linePartArr[$j]->charset, 'UTF-8', $linePartArr[$j]->text); } } $convStr .= $convLine; } return $convStr; } 
+7
source

This is also in the comments in the manual for mb_decode_mimeheader , and I really assume this is a bug. Not in the database, so I would write it as new.

However, AFAIK imap_mime_header_decode will handle your encodings without any problems, so your code will be saved.

+1
source

This function works for both examples:

 function imapUtf8($str){ $convStr = ''; $subLines = preg_split('/[\r\n]+/',$str); // split multi-line subjects for($i=0; $i < count($subLines); $i++){ // go through lines $convLine = ''; $linePartArr = imap_mime_header_decode(trim($subLines[$i])); // split and decode by charset for($j=0; $j < count($linePartArr); $j++){ $convLine .= ($linePartArr[$j]->text); // append sub-parts of line together } $convStr .= $convLine; // append to whole subject } return $convStr; // return converted subject } 

Tests:

 $sub1 = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?='; $sub2 = '=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?='; echo imapUtf8($sub1); echo imapUtf8($sub2); 

Result:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

Re: # 2-Final acceptance test test with new text with Slovak interferences "+ ľščťžýáíé! Ň"

+1
source

About the mysterious underline in the Subject header field:

RFC2047 4.2 (2) explicitly states:

An 8-bit hexadecimal value of 20 (for example, ISO-8859-1 SPACE) can be represented as "_" (underscore, ASCII 95.). (This character may not go through some firewalls, but using it will greatly improve the readability of “Q” encoded data with mail readers who do not support this encoding.) Note that “_” always represents hexadecimal 20, even if the SPACE character occupies a different code position in the character set used.

The encoding rule for a Subject string is documented in RFC2047 itself.

0
source

Source: https://habr.com/ru/post/1397299/


All Articles