Parsing a raw email message, which can be in html or various strange encodings and converting it to plain text, a way, say, a pine tree can display it

The reason I want to do this is to simplify the analysis of instructions sent by email to the bot, which mayordomo can do to analyze commands such as subscribing and unsubscribing. It turns out there are a lot of crazy formats and things you can deal with, such as quoted text, distinguishing between heading and body, etc.

A perl module would be ideal for this, but solutions in any language are welcome.

+3
source share
4 answers

Python has an email .

>>> import email
>>> p = email.Parser.Parser()
>>> msg = p.parsestr("From: me@example.com\nSubject: Hello\nDear Sir or Madam...")
>>> msg.get("Subject")
Hello
>>> msg.get_payload()
'Dear Sir or Madam...'

MIME , Python. HTML , BeautifulSoup Tidy + ElementTree, .

+4

, , , , , , , , .

MIME HTML

+2
0

: http://news.ycombinator.com/item?id=666607

Here is my incomplete solution that really works for my purposes (parsing teams emailed to bot). I keep it here for reference until there is finally a better answer.

# Take an email as a big string and turn it into a plain ascii equivalent.
# TODO: leave any html tags inside of quotes alone.
sub plainify {
  my($email) = @_;

  # translate quoted-printable or whatever this crap is to plain text.
  $email =~ s/\=0D\=0A/\n/gs;
  $email =~ s/\=0A/\n/gs;
  $email =~ s/\=A0/ /gs;
  $email =~ s/\=2E/\./gs;
  $email =~ s/\=20/\ /gs;
  $email =~ s/\=([\n\r]|\n\r|\r\n)//gs;

  # translate html to plain text (or enough of it to parse commands).
  $email =~ s/\&nbsp\;/ /gs;
  $email =~ s/\<br\>/\n/gis;
  $email =~ s/(\<[^\>]+\>)/\n$1\n/gs;

  return $email
}
-1
source

Source: https://habr.com/ru/post/1697542/


All Articles