Is there a CPAN module to extract the current level of content from email

I am looking for a module to try to extract the maximum level of content (i.e. drop any cited content and signature block) from the text component of the email.

We already have a code in which there is a chance, therefore, if there is no existing module that does this, ideas for the name for the new module will also be appreciated ( Text::ExtractImmediateLevelOfContentFromEmail seems a little cumbersome).

+4
source share
1 answer

I believe that there is no such module, because this task is oriented and there is a huge variety of message formatting styles. A minimal implementation is what you can do with a few lines of code:

 use Email::MIME; my $email = Email::MIME->new($message); my $body; $email->walk_parts(sub { my ($part) = @_; return unless $part->content_type =~ m[text/plain]; $body .= $part->body; }); # strip quoted lines and attribution line $body =~ s/^.+ wrote:\n(?=\n* ?>)//m; $body =~ s/^ ?>.*\n//gm; # strip signature $body =~ s/-- \R.+//; 

Of course, you can add other heuristic rules to remove attribution strings written in other languages, as well as delete Outlook-style text. I would suggest some heuristics to avoid quoting text if a message using alternating-style quoting is recognized. This is because alternating answers may lose some meaning if you separate the quoted text.

If you want this to affect the module, I would call it Email::ExtractBody or Email::ExtractText . I would like to emphasize in POD that the module has a heuristic and best approach.

+3
source

Source: https://habr.com/ru/post/1339608/


All Articles