Detecting and removing noise text

Question

Detecting and removing noise text

giving a database table with huge data in it, which is best to remove the noise text, for example:

fghfghfghfg
qsdqsdqsd
rtyrtyrty

this noise is stored in the name field.

I work on data with standard Java structures.

+4

java text noise

Youssef May 13, '10 at 13:22

source share

11 answers

Removing such things is not as easy as it might seem.

It is easy for us humans to see that "djkhfkjh" makes no sense. But how would a computer detect such noise? How does he know that Eyyafyalladzhokull is just someone breaking the keyboard, or the most veiled mountain in the last couple of years?

You cannot do this reliably without a lot of false positives, so in the end it filters out false positives and true positive results again.

+8

Luken May 13, '10 at 13:33

source share

Get a dictionary with as many names as you can find and filter your data to display those that are not in the dictionary. Then you must delete them one at a time to make sure that you are not deleting valid data. Sorting the list by name can help you delete more lines at a time.

+3

Ovidiu pacurar May 13, '10 at 13:28

source share

If the rest of the text is English, you can use a list of words. If more than a given percentage (say, 50%) of the words in the text is not contained in the list of words, this is probably noise.

You might want to set a threshold, such as 5 words, to prevent the deletion of messages such as "LOL".

On most Linux installations, you can extract a list of words from aspell as follows:

 aspell --lang en dump master

+2

Thomas May 13, '10 at 13:30

source share

You will need to start with a more effective definition of “noise text”. The definition of the problem is the hard part here. You cannot write code that says, "Get rid of lines that look like _____." It looks like the pattern you identified is "a consecutive set of three characters per line, and the set is repeated at least once, but may not end cleanly (it may end with a character from the middle of the set)."

Now write a regex matching this pattern and test it.

But I'm sure there are other patterns you are looking for ...

+2

Jim kiley May 13, '10 at 13:31

source share

Inspect each word and see how much redundancy exists. If there are more than three consecutive repeating groups of letters, this is a good candidate for noise. In addition, find groups of letters that usually do not belong together, and for groups of consecutive letters that are also sequentially located on the keyboard. If the whole word is made up of letters that are neighbors of the keyboard, it also claims to be in the jamming list.

+2

luvieere May 13, '10 at 13:35

source share

Training the NLP classifier is likely to be the best way. However, a simpler method would be to simply verify that each word exists in the list of all known “correct” words. Most Unix systems have a file called / usr / share / dict / words that you can use for this purpose. In addition, Ubuntu extends this with / usr / share / dict / american -english, / usr / share / dict / american-huge and / usr / share / dict / american-insane, each of which contains more verbose than the last . These lists also include many common spelling errors, so you won’t filter out text that isn’t technically a single word but clearly recognizable as a word.

If you are truly ambitious, you can combine these approaches and use these word lists to train your Bayesian or maximum entropy classifier.

+1

Cerin May 13, '10 at 13:47

source share

There are many good answers here. Which one will work for you depends on the specifics of your problem - for example, the input should be English words, user names, last names of people, etc.

One approach: write a program to analyze what you think is "valid." Watch how often all kinds of three-letter sequences appear in the legal text. Then, when you have the input to check, look at each three-letter input sequence and see its expected frequency. Something like "xzt" probably has a frequency near zero. If you have too many subsequences, mark this as trash.

Problems with this:

You can treat bad spelling as garbage, for example, if someone forgets to put "u" after the word "q" in one word.
You will not catch input like "thethethethe".

+1

Dan May 13, '10 at 13:51

source share

Move snippets of text to Google and see how many results you get.

+1

Chris dennett May 13, '10 at 23:42

source share

You can try to get a database to return a field devoid of everything except letters and spaces, with all the letters below. Then in your program, create a hash based on valid lowercase words. For the given value of the database field, divide it by a space character and check if each substring in the hash exists.

Create a table of initial field values, indicating the flag if it passed the test or not, and review.

It looks like you need to do something like this as a preliminary check before moving on to more advanced methods.

0

Ian May 13, '10 at 14:01

source share

Examples # 1 and # 2 can be removed by the parser, which is trying to figure out how to pronounce the text. Regardless of language, they are inexpressible and therefore not words.

0

Loren pechtel May 13, '10 at 23:46

source share

bmargulies · Accepted Answer · 2010-05-13T13:29:02+0000

Well, you can build a classifier using NLP methods and teach it examples of noise and quietness. One example of this could be the Apache Tika language detector. If a language detector says “hits me,” that can be good enough.

Detecting and removing noise text

More articles: