At a high level, it seems that phrases that were just nouns or adjective nouns combos would give much better results.
Examples:
- "Blue shirt"
- "Happy people"
- "Book"
First of all, this problem can be as complex as you want. For third-party reads / solutions, I came across:
If you need 100% accuracy, I would not write such a tool myself.
However, if the problem area is limited ...
I would start by throwing out conjunctions, prepositions, abbreviations, state verbs, etc. This is a fairly short list in English (and looks very similar to the temporary words suggested by @HappyTimeGopher).
After that, you can create a dictionary (as, of course, an indexed structure) of all acceptable nouns and adjectives, and compare each word in unprocessed phrases with this. Everything that did not happen in the dictionary and occurs in the correct sequence can be thrown away or rated below.
This can be useful if you were given 100 input values ββand you had to choose the best one. Searching for the values ββin the dictionary would mean that the word / phrase was probably good.
I automatically created such a dictionary, creating a raw index from thousands of documents related to the vertical industry. Then I spent several hours with SQL and Excel, which fixed problems that were easily detected by humans. The resulting list was not perfect, but it eliminated most of the frankly dumb / meaningless terminology.
As you might have guessed, none of this is reliable, although checking the sequence of adjective nouns would help somewhat. Consider the case of "Greatest Hits" compared to "Car Hits [Wall]".
Own nouns (for example, names of people) do not work well with the vocabulary approach, since it is probably not possible to build a dictionary of all variants of data / last names.
Summarizing:
- use stop list
- generate a dictionary of words, classifying them as part of speech (s)
- run raw phrases through vocabulary and stop words
- (optional) rank how confident you are in the match
- if necessary, accept phrases that did not violate well-known patterns (this could handle many of your own nouns)