Parsing MIME mail, extracting binary attachments, and converting text

Question

Parsing MIME mail, extracting binary attachments, and converting text

I started using mime to parse email and extract attachments. Everything I did, the binary application was always distorted when I wrote it to disk. Then I realized that for some strange reason, all base64 attachments are already decoded when a message is parsed for data types. This is when my problem begins.

If this image is not working. The first thing I did was convert the extracted text attachment to ByteString using TE.encodeUtf8 . Bad luck. I tried all the Text.Encoding functions to convert Text to ByteString - nothing works. Then, for some stupid reason, I converted / encoded the extracted text back to base64, and then decrypted it again with base64, and it worked this time. Why?

So, if I encode the extracted attachment to base64 and decode it back, it works. B.writeFile "tmp/test.jpg" $ B.pack $ decode $ encodeRawString True $ T.unpack attachment Why? Why the simple text encoding in ByteString did not work, but does this stupidity do?

In the end, I played with him a little more and got to the point that he works with Data.ByteString.Char8 , like this B.writeFile "tmp/test.jpg" $ BC.pack $ T.unpack attachment So I still need convert Text to String, then String to ByteString.Char8, and only then does it work, and I get an intact image.

Can someone explain all this. Why is this pain associated with a binary image? Why can't I convert decoded base64 text to ByteString? What am I missing?

Thanks.

UPDATE

This is the code for retrieving attachments on request. I thought this did not apply to text encoding / decoding.

 import Codec.MIME.Parse import Codec.MIME.Type import Data.Maybe import Data.Text (Text, unpack, strip) import qualified Data.Text as T (null) import Data.Text.Encoding (encodeUtf8) import Data.ByteString (ByteString) data Attachment = Attachment { attName :: Text , attSize :: Int , attBody :: Text } deriving (Show) genAttach :: Text -> [Attachment] genAttach m = let prs v = if isAttach v then [Just (mkAttach v)] else case mime_val_content v of Single c -> if T.null c then [Nothing] else prs $ parseMIMEMessage c Multi vs -> concatMap prs vs in let atts = filter isJust $ prs $ parseMIMEMessage m in if null atts then [] else map fromJust atts isAttach :: MIMEValue -> Bool isAttach mv = maybe False check $ mime_val_disp mv where check d = if (dispType d) == DispAttachment then True else False mkAttach :: MIMEValue -> Attachment mkAttach v = let prms = dispParams $ fromJust $ mime_val_disp v Single cont = mime_val_content v name = check . filter isFn where isFn (Filename _) = True isFn _ = False check = maybe "" (\(Filename n) -> n) . listToMaybe size = check . filter isSz where isSz (Size _) = True isSz _ = False check = maybe "" (\(Size n) -> n) . listToMaybe in Attachment { attName = name prms , attSize = let s = size prms in if T.null s then 0 else read $ unpack s , attBody = cont }

+5

haskell

r.sendecky Jan 08 '15 at 21:16

source share

2 answers

Erikr · Answer 1 · 2015-01-09T17:09:33+0000

Note that the mime package selects a binary content representation with a value of Text . The way to get the corresponding ByteString is to have latin1 encode the text. In this case, it is guaranteed that all code points in the text string will be in the range from 0 to 255.

Create a file with this content:

 Content-Type: image/gif Content-Transfer-Encoding: base64 R0lGODlhAQABAIABAP8AAP///yH5BAEAAAEALAAAAAABAAEAAAICRAEAOw==

This is a base64 encoded red 1x1 GIF image at http://commons.wikimedia.org/wiki/File:1x1.GIF

Here is the code that uses parseMIMEMessage to recreate this file.

 import Codec.MIME.Parse import Codec.MIME.Type import qualified Data.Text as T import qualified Data.Text.IO as TIO import qualified Data.ByteString.Char8 as BS import System.IO test1 path = do msg <- TIO.readFile path let mval = parseMIMEMessage msg Single img = mime_val_content mval withBinaryFile "out-io" WriteMode $ \h -> do hSetEncoding h latin1 TIO.hPutStr h img test2 path = do msg <- TIO.readFile path let mval = parseMIMEMessage msg Single img = mime_val_content mval bytes = BS.pack $ T.unpack img BS.writeFile "out-bs" bytes

In test2 latin1 is encoded using BS.pack . T.unpack BS.pack . T.unpack .

Dave turner · Answer 2 · 2017-02-18T10:06:40+0000

The mime package uses Text everywhere, but overrides any base64 encoding (or quoted-printable ) for message body parts according to their Content-Encoding MIME field heading.

This means that the body of the resulting part of the message is of type Text , but (if the MIME type is not equal to text/* ), this is not the appropriate type, since the body should really be a sequence of bytes, not characters. The characters he uses are Unicode 00 to FF codes, which have an obvious mapping to bytes, but they are not the same thing. (Moreover, if the MIME type is text/* , but the charset not us-ascii or iso8859-1 , then I think mime will distort the content.)

I suspect that you are writing Text to a disk that you used Data.Text.IO.writeFile or similar, which uses the character encoding specified in your environment to convert characters to bytes. Many common character encodings convert characters 00 to 7F to bytes 00 to 7F , but are unlikely to match the remaining 80 - FF characters with their corresponding bytes. On many systems these days, encoding the environment is UTF8, which does not even map these characters to separate bytes. (Were the file sizes different than expected?)

To correctly write it, you must first convert these characters to right bytes. The easiest way to do this is to use the functions in the Data.Bytestring.Char8 module, which are designed to confuse bytes and characters in this way. But they work on String , not Text , so you need to unzip and repackage everything.

I'm not sure how you managed base-64 to encode the value of Text , since base-64 encoding also works with bytes, not characters. However, you did this, you must have been able to map the characters 80 to FF to the corresponding bytes directly, and not encode them in any way.

If you want to know more, there is a good article here: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode -and-character-sets-no-excuses /

Parsing MIME mail, extracting binary attachments, and converting text

More articles: