Filtering meaningless phrases

Question

Filtering meaningless phrases

I have an algorithm (which I cannot change) that outputs a list of phrases. These phrases are for "themes." However, some of them are meaningless in themselves. Take this list:

is the fear freesat are more likely to first sight an hour of sue apple depression and itunes

How can I filter out phrases that do not make sense on their own in order to leave the list as follows?

 freesat first sight sue apple itunes

This will apply to phrase sets in many languages, but English is a priority.

+4

nlp

Max Sep 03 '12 at 13:15

source share

4 answers

Fred foo · Answer 1 · 2012-09-03T13:29:36+0000

It should be grammatically acceptable in the sense that it cannot rely on other words in the original sentence from which it was extracted; for example, it cannot end with "and".

Although this is not a asked question yet, it looks like you want some kind of grammar check. I suggest you try applying the tag-part of the speech part to each phrase, compile a list of POS tag templates that are acceptable (for example, everything that ends in the preposition would be unacceptable) and use this to filter your input.

Tim medora · Answer 2 · 2012-09-03T20:28:48+0000

At a high level, it seems that phrases that were just nouns or adjective nouns combos would give much better results.

Examples:

"Blue shirt"
"Happy people"
"Book"

First of all, this problem can be as complex as you want. For third-party reads / solutions, I came across:

http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
http://research.microsoft.com/en-us/groups/nlp/
http://sharpnlp.codeplex.com/ (note the part of the speech tag)

If you need 100% accuracy, I would not write such a tool myself.

However, if the problem area is limited ...

I would start by throwing out conjunctions, prepositions, abbreviations, state verbs, etc. This is a fairly short list in English (and looks very similar to the temporary words suggested by @HappyTimeGopher).

After that, you can create a dictionary (as, of course, an indexed structure) of all acceptable nouns and adjectives, and compare each word in unprocessed phrases with this. Everything that did not happen in the dictionary and occurs in the correct sequence can be thrown away or rated below.

This can be useful if you were given 100 input values and you had to choose the best one. Searching for the values in the dictionary would mean that the word / phrase was probably good.

I automatically created such a dictionary, creating a raw index from thousands of documents related to the vertical industry. Then I spent several hours with SQL and Excel, which fixed problems that were easily detected by humans. The resulting list was not perfect, but it eliminated most of the frankly dumb / meaningless terminology.

As you might have guessed, none of this is reliable, although checking the sequence of adjective nouns would help somewhat. Consider the case of "Greatest Hits" compared to "Car Hits [Wall]".

Own nouns (for example, names of people) do not work well with the vocabulary approach, since it is probably not possible to build a dictionary of all variants of data / last names.

Summarizing:

use stop list
generate a dictionary of words, classifying them as part of speech (s)
run raw phrases through vocabulary and stop words
(optional) rank how confident you are in the match
if necessary, accept phrases that did not violate well-known patterns (this could handle many of your own nouns)

HappyTimeGopher · Answer 3 · 2012-09-03T20:11:44+0000

If you have access to the text from which these phrases were created, it may be easier to simply create your own tag tags.

Otherwise, I would simply delete everything that contained the stop word. See this list, for example: http://www.ranks.nl/resources/stopwords.html

I would not break the POS markings or anything stronger for this.

Noah clark · Answer 4 · 2012-09-03T21:18:37+0000

It seems you could create a list that filters out three things:

Prepositions: https://en.wikipedia.org/wiki/List_of_English_prepositions
Conjunctions: https://en.wikipedia.org/wiki/Conjunction_(grammar )
Verbal Forms of the Future: http://www.englishplus.com/grammar/00000040.htm

If you filter these things, you will get pretty far. Are you more concerned with false negatives or positives? If false negatives are not a huge problem, I will address it.

Filtering meaningless phrases

More articles: