How do I extract a list of email or mailbox strings in text, or check if the string matches the correct email address?

Question

How do I extract a list of email or mailbox strings in text, or check if the string matches the correct email address?

Given some arbitrary text, I would like to extract all email addresses and "mailbox qualifiers" (for example, "Fred Smith" < fred@me.com > ). I looked at NSDataDetector, but it did not process email addresses.

0

ios objective-c macos nsdatadetector

David h Mar 21 '13 at 15:19

source share

1 answer

David h · Answer 1 · 2013-03-21T15:19:20+0000

The way to approach this is to get a really good algorithm that can detect as many valid addresses as possible and reject the wrong ones. Probably the best solution would be a parser built using lex and yacc, but reasonable solutions exist using regular expressions.

See this site for a list of validated regular expressions as well as a more detailed discussion of the problem and possible solutions.

The regular expressions shown on the above site are formatted for PHP and have leading and trailing markers "/", as well as "flags" indicating the absence of a register, etc. (see this site for more information), so you must remove them before using the expression in an Objective-C project. In addition, any anchors need to be cleaned, since we want several addresses to be not only one (ie "^" AND "$").

NSRegularExpression is a class that you can use here. What I found useful is to save the regular expression in a file in my project, so you don't have to worry about excluding all backslashes and quotes. The code then reads the expression into a string and creates the object as follows:

 NSString *fullPath = [[NSBundle mainBundle] pathForResource:self.regex ofType:@"txt"]; NSString *pattern = [NSString stringWithContentsOfFile:fullPath encoding:NSUTF8StringEncoding error:NULL]; __autoreleasing NSError *error = nil; reg = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error]; // some patterns may not need NSRegularExpressionCaseInsensitive assert(reg && !error);

Once you have an initialized expression, you use it to return a list of ranges, each of which is an address:

 NSArray *ret = [reg matchesInString:str options:0 range:NSMakeRange(0, [str length])];

However, we know that all e-mail addresses have one "@", so it’s probably worth checking that the line has at least one before processing it. In addition, since a line and / or caret may appear in the text, you can delete them first. It’s probably best to completely remove them, since some email program might split the string at some internal address point.

When you have a list of address ranges, then for the most part, the task runs - if all you need is an address. However, often the addresses are presented in the format "mailbox specifier", where the name is added to the address, and the address is wrapped in '<' and '>'. This format is presented in RFC5322 , section 3.4.

To restore the name from the "mailbox specifier", check if the address is packed with '<' and '>', and if so, find the line preceding the character '<', ignoring the empty space (until you find the first character). Most names will be wrapped in double quotes (common practice), but in fact they can be bare alphanumeric strings using a backslash to include a space or other special characters (for example, "").

The same technique can be used for real-time verification — say, to enable the submit button when a text string becomes a valid email address. In this case, you evaluate the line each time the user changes and turn on / off the submit button.

If all this seems to work on code, you can grab an open source project on github .

EDIT1: for a faster, but less rigorous method, see CodaFi's comment.

EDIT2: it looks like the contents of the mailto: URL can be quite complex, the github project only handles the simplest ones and doesn't encode the address. This will be discussed in a future update.

EDIT3: the project has been updated to fully handle mailto :, and returns to, cc, bcc, subject and body, all URLdecoded.

How do I extract a list of email or mailbox strings in text, or check if the string matches the correct email address?

More articles: