Efficient string search

I have about 500-1000 objects, each of which has a name and string content. To find how these objects are connected, each name in each content field must be searched. Objects can be edited, so I may have to rebuild the connections for the edited object by viewing its name again in all content fields.

Exact string matching (.indexOf or .contains) is not an option, as there are additional rules:

  • names can consist of several words and predefined special characters (_, /, - ,, ...)
  • names can be surrounded by special characters and will be recognized
  • names can end with predefined multiple endings (s, es, ...) and will still be recognized

Names of examples: small apple juice, apple, app, _n, n

Content example: apps are like fine apple juice_n

matches the names of all examples

edit: Rule 2 clarification: the match should not be something like "appxxy" or other gibberish, but separated by words with spaces (or special characters).

I looked at various possible solutions, such as Aho-Corasick , using regex , a search string , a regular expression pattern , Apache Lucene, or using a custom Scanner with WordDetector . However, I am lost in choosing which one is best suited for my purpose and works best because I am not too experienced in programming.

+4
source share

Source: https://habr.com/ru/post/1532010/


All Articles