Determining if a portion of the MIME email is a message file or text

As part of some batch email processing, we need to decrypt and clear the messages. One important part of this process is the separation of the mail body of the message and the mail attachment. The hardest part is to determine when the Conent-Disposition: inline should be considered as an alternative to the message body or file.

So far, this code has handled most cases:

 from email import message_from_string def split_parts(raw): msg = message_from_string(raw) bodies = [] files = [] for sub in msg.walk(): if sub.is_multipart(): continue cd = sub.get("Content-Disposition", "") if cd.startswith("attachment") or (cd.startswith("inline") and sub.get_filename()): files.append(sub) else: bodies.append(sub) return bodies, files 

Note the dependency on the built-in parts to have the file name indicated in the headers, which Outlook seems to do multipart/related for all of its messages. Content-ID can also be used as a hint, but according to RFC 2387 this is not such an indicator.

Therefore, if the embedded image is encoded as part of a message that has Content-Disposition: inline , defines a Content-ID and does not have a file name, then the code above may incorrectly classify it as an alternative to the message body.

From what I read from the RFC, there is no hope of finding a simple check (especially since RFC coding is practically useless in the real world because no one does it); but I was wondering how great the chances are of getting into the case of an incorrect classification.


Justification

I could have a set of functions to handle each multipart/* case and let them indirectly return. However, we do not care about the exact demonstration; in fact, we filter all HTML messages through tidy . Instead, we are more interested in choosing one of the alternatives to the message body and saving as many attachments as possible, even if they are intended to be embedded.

In addition, some user agents do really strange things when composing multipart/alternative messages with built-in attachments that are not designed to display embedded (for example, PDF files), as a result of the user dragging an arbitrary file into the composition window.

+4
source share
1 answer

I don’t quite understand you, but if you want to use bodies, I would suggest something with a text / plain or text / html content type with a built-in content disc without a file name or without a content identifier, maybe part of the body.

+3
source

Source: https://habr.com/ru/post/1437304/


All Articles