Detect if email is essentially text

I am writing an Outlook add-in that saves emails for historical purposes. The Outlook MSG format, unfortunately, is too verbose, even when compressed. This leads to the fact that saved MSG files are many times superior to their text equivalent. However, saving all messages as text has obvious drawbacks in the absence of attachments, images, and any appropriate formatting.

For most letters this is not a problem, but emails with a certain degree of complex formatting, images, attachments (etc.) should be saved in MSG format.

Most user emails are sent as HTML, which make my algorithm look something like this:

1. If email has attachment(s), save as MSG and be done 2. If email is stored as text, save as text and be done 3. If email is not stored as HTML store as MSG and be done 4. Decide if the HTML should be converted to text and store it as text if so store it as MSG if not 

This is simple, except for step # 4: How can I decide in which format HTML formatting should be formatted when saving?

+4
source share
1 answer

Idea: Calculate the weighted density of HTML tags in a message. Select a threshold based on existing data. Messages with an HTML density above the threshold are saved as MSG; Messages with a density below the threshold are saved as plain text.

How do you calculate weighted density? Use the HTML parsing library. Ask him to analyze the document and calculate the amount of each type of HTML tag in the document. This is all the library needs. Multiply each tag account by its weight and add them together. Then try to convert the message to plain text and count the number of characters in the message. Divide the weighted label-count-sum by this number, and you have density.

How should density be weighed? According to the table that you create with the importance of each type of HTML tag. I would suggest that losing the bold and italics is not so bad. Losing ordered and unordered list lists is a little worse, unless bullets and numbers are saved when messages are converted to plain text. Tables should be weighted high because they are important for formatting. Choose weight and for unrecognized tags.

How to choose a threshold? Run the density calculation function on a sample of letters. Also manually check these letters to see if they will be better than MSG or plain text, and write this option for each letter. Use some algorithm with this data to find the boundary value. I think the algorithm may be a naive Bayesian classification , but in this case there may be a simpler algorithm. Or a human calculated guess may be good enough. I think you could guess by looking at the scatter plot chosen by the person compared to the weighted density of HTML tags and looking at the density value that roughly separates the two solution formats.

+2
source

Source: https://habr.com/ru/post/1398312/


All Articles