I need to find all the keywords in a large NSString (for the parsing source code), and my current implementation is too slow, but I'm not sure how to improve it.
I use NSRegularExpression on the assumption that it is more optimized than anything I could write, but performance is slower than I expected. Does anyone know a faster way to implement this?
The target string will contain utf-8 characters, but the keywords themselves will always be plain alphanumeric ascii. I guess this can be used to optimize things quite a bit?
@implementation MyClass // i'm storing the regular expression in a static variable, since it never changes and I need to re-use it often static NSRegularExpression *keywordsExpression; + (void)initialize { [super initialize]; NSArray *keywords = [NSArray arrayWithObjects:@"accumsan", @"adipiscing", @"aliquam", @"aliquet", @"amet", @"ante", @"arcu", @"at", @"commodo", @"congue", @"consectetur", @"consequat", @"convallis", @"cras", @"curabitur", @"cursus", @"dapibus", @"diam", @"dolor", @"dui", @"elit", @"enim", @"erat", @"eros", @"est", @"et", @"eu", @"felis", @"fermentum", @"gravida", @"iaculis", @"id", @"imperdiet", @"integer", @"ipsum", @"lacinia", @"lectus", @"leo", nil]; NSString *pattern = [NSString stringWithFormat:@"\\b(%@)\\b", [keywords componentsJoinedByString:@"|"];
EDIT . In response to @CodeBrickie, I updated my code to perform a regular expression search once in the entire string and keep matches with the cached NSIndexSet , and then every time the method calls it searching in NSIndexSet for keyword ranges instead of searching for a string. The result is about an order of magnitude higher:
@implementation MyClass static NSRegularExpression *keywordsExpression; static NSIndexSet *keywordIndexes = nil; + (void)initialize { [super initialize]; NSArray *keywords = [NSArray arrayWithObjects:@"accumsan", @"adipiscing", @"aliquam", @"aliquet", @"amet", @"ante", @"arcu", @"at", @"commodo", @"congue", @"consectetur", @"consequat", @"convallis", @"cras", @"curabitur", @"cursus", @"dapibus", @"diam", @"dolor", @"dui", @"elit", @"enim", @"erat", @"eros", @"est", @"et", @"eu", @"felis", @"fermentum", @"gravida", @"iaculis", @"id", @"imperdiet", @"integer", @"ipsum", @"lacinia", @"lectus", @"leo", nil]; NSString *pattern = [NSString stringWithFormat:@"\\b(%@)\\b", [keywords componentsJoinedByString:@"|"]; // \b(accumsan|adipiscing|aliquam|โฆ)\b keywordsExpression = [NSRegularExpression regularExpressionWithPattern:pattern] options:NSRegularExpressionCaseInsensitive error:NULL]; } - (void)prepareToFindKeywordsInString:(NSString *)string { NSMutableIndexSet *keywordIndexesMutable = [[NSIndexSet indexSet] mutableCopy]; [keywordsExpression enumerateMatchesInString:string options:0 range:NSMakeRange(0, string.length) usingBlock:^(NSTextCheckingResult *match, NSMatchingFlags flags, BOOL *stop){ [keywordIndexesMutable addIndexesInRange:match.range]; }]; keywordIndexes = [keywordIndexesMutable copy]; } - (NSRange)findNextKeyword:(NSString *)string inRange:(NSRange)range { NSUInteger foundKeywordMax = (foundCharacterSetRange.location == NSNotFound) ? string.length : foundCharacterSetRange.location; NSRange foundKeywordRange = NSMakeRange(NSNotFound, 0); for (NSUInteger index = startingAt; index < foundKeywordMax; index++) { if ([keywordIndexes containsIndex:index]) { if (foundKeywordRange.location == NSNotFound) { foundKeywordRange.location = index; foundKeywordRange.length = 1; } else { foundKeywordRange.length++; } } else { if (foundKeywordRange.location != NSNotFound) { break; } } } return foundKeywordRange; } @end
This seems to work well, and performance is in the range where I want it. I would like to wait a little longer to see if there are any more suggestions before accepting this.