Effectively find the first of many keywords in a large NSString

I need to find all the keywords in a large NSString (for the parsing source code), and my current implementation is too slow, but I'm not sure how to improve it.

I use NSRegularExpression on the assumption that it is more optimized than anything I could write, but performance is slower than I expected. Does anyone know a faster way to implement this?

The target string will contain utf-8 characters, but the keywords themselves will always be plain alphanumeric ascii. I guess this can be used to optimize things quite a bit?

 @implementation MyClass // i'm storing the regular expression in a static variable, since it never changes and I need to re-use it often static NSRegularExpression *keywordsExpression; + (void)initialize { [super initialize]; NSArray *keywords = [NSArray arrayWithObjects:@"accumsan", @"adipiscing", @"aliquam", @"aliquet", @"amet", @"ante", @"arcu", @"at", @"commodo", @"congue", @"consectetur", @"consequat", @"convallis", @"cras", @"curabitur", @"cursus", @"dapibus", @"diam", @"dolor", @"dui", @"elit", @"enim", @"erat", @"eros", @"est", @"et", @"eu", @"felis", @"fermentum", @"gravida", @"iaculis", @"id", @"imperdiet", @"integer", @"ipsum", @"lacinia", @"lectus", @"leo", nil]; NSString *pattern = [NSString stringWithFormat:@"\\b(%@)\\b", [keywords componentsJoinedByString:@"|"]; // \b(accumsan|adipiscing|aliquam|โ€ฆ)\b keywordsExpression = [NSRegularExpression regularExpressionWithPattern:pattern] options:NSRegularExpressionCaseInsensitive error:NULL]; } // this method will be called in quick succession, I need it to be a able to run tens // of thousands of times per second. The target string is big (50KB or so), but the // search range is short, rarely more than 30 characters - (NSRange)findNextKeyword:(NSString *)string inRange:(NSRange)range { return [keywordsExpression rangeOfFirstMatchInString:string options:0 range:range]; } @end 

EDIT . In response to @CodeBrickie, I updated my code to perform a regular expression search once in the entire string and keep matches with the cached NSIndexSet , and then every time the method calls it searching in NSIndexSet for keyword ranges instead of searching for a string. The result is about an order of magnitude higher:

 @implementation MyClass static NSRegularExpression *keywordsExpression; static NSIndexSet *keywordIndexes = nil; + (void)initialize { [super initialize]; NSArray *keywords = [NSArray arrayWithObjects:@"accumsan", @"adipiscing", @"aliquam", @"aliquet", @"amet", @"ante", @"arcu", @"at", @"commodo", @"congue", @"consectetur", @"consequat", @"convallis", @"cras", @"curabitur", @"cursus", @"dapibus", @"diam", @"dolor", @"dui", @"elit", @"enim", @"erat", @"eros", @"est", @"et", @"eu", @"felis", @"fermentum", @"gravida", @"iaculis", @"id", @"imperdiet", @"integer", @"ipsum", @"lacinia", @"lectus", @"leo", nil]; NSString *pattern = [NSString stringWithFormat:@"\\b(%@)\\b", [keywords componentsJoinedByString:@"|"]; // \b(accumsan|adipiscing|aliquam|โ€ฆ)\b keywordsExpression = [NSRegularExpression regularExpressionWithPattern:pattern] options:NSRegularExpressionCaseInsensitive error:NULL]; } - (void)prepareToFindKeywordsInString:(NSString *)string { NSMutableIndexSet *keywordIndexesMutable = [[NSIndexSet indexSet] mutableCopy]; [keywordsExpression enumerateMatchesInString:string options:0 range:NSMakeRange(0, string.length) usingBlock:^(NSTextCheckingResult *match, NSMatchingFlags flags, BOOL *stop){ [keywordIndexesMutable addIndexesInRange:match.range]; }]; keywordIndexes = [keywordIndexesMutable copy]; } - (NSRange)findNextKeyword:(NSString *)string inRange:(NSRange)range { NSUInteger foundKeywordMax = (foundCharacterSetRange.location == NSNotFound) ? string.length : foundCharacterSetRange.location; NSRange foundKeywordRange = NSMakeRange(NSNotFound, 0); for (NSUInteger index = startingAt; index < foundKeywordMax; index++) { if ([keywordIndexes containsIndex:index]) { if (foundKeywordRange.location == NSNotFound) { foundKeywordRange.location = index; foundKeywordRange.length = 1; } else { foundKeywordRange.length++; } } else { if (foundKeywordRange.location != NSNotFound) { break; } } } return foundKeywordRange; } @end 

This seems to work well, and performance is in the range where I want it. I would like to wait a little longer to see if there are any more suggestions before accepting this.

+4
source share
3 answers

Since you need keywords along with their ranges, I would use enumerateMatchesInString:options:range:usingBlock: and implement a block that adds the keyword as a key and range as a value in NSMutableDictionary .

Thus, you have only one call for the entire line and all keywords with their ranges in the dictionary after this call.

+3
source

Instead of collecting a regular expression to match all keywords, I suggest you use a very basic regular expression to match any word, and then look for a matching word in the dictionary containing your keywords; if this word is not there, ignore it.

You can adapt the regular expression to what you know about keywords for maximum effectiveness. For example, if you know that you are looking for words from three to twelve ASCII letters in lower case, you can use @"\\b[az]{3,12}+\\b" . Compared to your monster regex with dozens of alternatives.

I have used this technique with great success in the highlighter syntax project. It was in Java, but a quick look at NSRegularExpression documents NSRegularExpression about a surprisingly similar set of functions.

+1
source

What happens if you switch the search to some NSLiteral search rather than case insensitive?

Of course you need a case insensitive, but if it is faster, it will give you ideas.

Maximum speed would be much more efficient and include the bottom line of c and strstr (), I would put.

0
source

Source: https://habr.com/ru/post/1381863/


All Articles