Filtering meaningless phrases

I have an algorithm (which I cannot change) that outputs a list of phrases. These phrases are for "themes." However, some of them are meaningless in themselves. Take this list:

is the fear freesat are more likely to first sight an hour of sue apple depression and itunes 


How can I filter out phrases that do not make sense on their own in order to leave the list as follows?

 freesat first sight sue apple itunes 


This will apply to phrase sets in many languages, but English is a priority.

+4
source share
4 answers

It should be grammatically acceptable in the sense that it cannot rely on other words in the original sentence from which it was extracted; for example, it cannot end with "and".

Although this is not a asked question yet, it looks like you want some kind of grammar check. I suggest you try applying the tag-part of the speech part to each phrase, compile a list of POS tag templates that are acceptable (for example, everything that ends in the preposition would be unacceptable) and use this to filter your input.

+3
source

At a high level, it seems that phrases that were just nouns or adjective nouns combos would give much better results.

Examples:

  • "Blue shirt"
  • "Happy people"
  • "Book"

First of all, this problem can be as complex as you want. For third-party reads / solutions, I came across:

If you need 100% accuracy, I would not write such a tool myself.

However, if the problem area is limited ...

I would start by throwing out conjunctions, prepositions, abbreviations, state verbs, etc. This is a fairly short list in English (and looks very similar to the temporary words suggested by @HappyTimeGopher).

After that, you can create a dictionary (as, of course, an indexed structure) of all acceptable nouns and adjectives, and compare each word in unprocessed phrases with this. Everything that did not happen in the dictionary and occurs in the correct sequence can be thrown away or rated below.

This can be useful if you were given 100 input values ​​and you had to choose the best one. Searching for the values ​​in the dictionary would mean that the word / phrase was probably good.

I automatically created such a dictionary, creating a raw index from thousands of documents related to the vertical industry. Then I spent several hours with SQL and Excel, which fixed problems that were easily detected by humans. The resulting list was not perfect, but it eliminated most of the frankly dumb / meaningless terminology.

As you might have guessed, none of this is reliable, although checking the sequence of adjective nouns would help somewhat. Consider the case of "Greatest Hits" compared to "Car Hits [Wall]".

Own nouns (for example, names of people) do not work well with the vocabulary approach, since it is probably not possible to build a dictionary of all variants of data / last names.

Summarizing:

  • use stop list
  • generate a dictionary of words, classifying them as part of speech (s)
  • run raw phrases through vocabulary and stop words
  • (optional) rank how confident you are in the match
  • if necessary, accept phrases that did not violate well-known patterns (this could handle many of your own nouns)
+2
source

If you have access to the text from which these phrases were created, it may be easier to simply create your own tag tags.

Otherwise, I would simply delete everything that contained the stop word. See this list, for example: http://www.ranks.nl/resources/stopwords.html

I would not break the POS markings or anything stronger for this.

+1
source

It seems you could create a list that filters out three things:

If you filter these things, you will get pretty far. Are you more concerned with false negatives or positives? If false negatives are not a huge problem, I will address it.

0
source

Source: https://habr.com/ru/post/1432089/


All Articles