How to determine keyword stuffing?

We are working on a kind of document search engine - primarily focused on indexing user-submitted text documents MS.

We noticed that there is abuse of keywords.

We identified two main types of abuse:

  • Repeating the same term over and over
  • Many irrelevant terms added to an en-masse document

These two forms of abuse are included by adding text with the same font color as the background color of the document, or by setting the font size to 1px.

When determining whether the background color matches the color of the text, this is difficult, given the subtleties of the MS word layouts - the same goes for the font size - since any interruption seems potentially arbitrary - we can accidentally delete the actual text if we set the slice too large.

My question is: are there standardized pre-processing or statistical analysis methods that can be used to reduce the impact of this kind of keyword stuffing?

Any guidance would be appreciated!

+4
source share
4 answers

A surprisingly simple solution to your problem using the concept of compressibility.

If you convert Word documents to text (you can easily do this on the fly), you can compress them (for example, use the free zlib library) and see compression ratios. Normal text documents usually have a compression ratio of about 2, so any important deviation means that they were “full”. The analysis process is very simple, I analyzed about 100 thousand texts, and it takes about 1 minute using Python.

Another option is to view the statistical properties of documents / words. To do this, you need to have a sample of “clean” documents and calculate the average frequency of individual words, as well as their standard deviations.

After you have done this, you can take a new document and compare it with the mean and deviation. Completed documents will be characterized as words with several words with a very high deviation from the average from this word (documents where one or two words are repeated several times) or many words with large deviations (documents with repeating blocks of text)

Here are some useful compressibility links:

http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf

http://www.ispras.ru/en/proceedings/docs/2011/21/isp_21_2011_277.pdf

Perhaps you can also use the concept of entropy, for example, calculating the Shannon entropy http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

Another possible solution would be to use Part-of-speech (POS) tags. I believe that the average percentage of nouns is similar to the “normal” documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true ). If the percentage was higher or lower for some POS tags, you might find "filled" documents.

+2
source

As Chris Sinclair commented on your question, if you do not have Google-level algorithms (and even they are mistaken and thus have an appeal process), it is best to mark keywords that are filled out with documents for further consideration by a person ...

If there are 100 words on the page and you look at the page that determines the counter for the appearance of keywords (rendering the filling 1px or bgcolor does not matter), thereby obtaining a calculation of the density of keywords, in fact there is no complicated and fast method for a certain percentage "always "- this is filling with keywords, usually 3-7% - this is normal. Perhaps if you find 10% +, then you mark it as "potentially stuffed" and set it aside for viewing by a person.

Also, consider these scenarios ( taken here ):

  • Lists of phone numbers without significant added value
  • A text block that lists cities and states that the webpage is trying to rank for

and what is the context of the keyword.

Pretty hard to do wrong.

+1
source

Tag detection using forecolor / backcolor detection, as you already did. To determine the size, calculate the average text size and remove outliers. Also set predefined limits for the text (for example, you have already done this).

The structure of the blobs tag is shown below. For your first moment, you can simply count the words, and if they are found too often (perhaps 5 times more often than the 2nd word), you can mark this as a repeating tag.

When adding en-mass tags, the user often adds them all in one place, so you can see that similar “fraud tags” appear next to them (possibly with one or two words between them). A.

If you could identify at least some common “fraud tags” and want to get a little more advanced, you can do the following:

  • Divide the document into parts with the same text / font and analyze each part separately. For more accurate results, parts of the group that use almost the same font / size, and not just those that have EXACTLY the same font / size.
  • Count the appearance of each known tag and when the limit you have defined is exceeded, this part of the document is deleted or the document is marked as “bad” (as in “uses extra tags”)

No matter how advanced your discovery is, once people know it there and more or less know how it works, they will find ways to get around it.

When this happens, you just have to mark the violating documents and see them yourself. Then, if you notice that your detection algorithm has received a false positive, you will improve it.

+1
source

If you notice a pattern in which regular fillers always use a font size below a certain size and that size, that is 1-5, which is not actually readable, then you can assume that this is a "stuffed animal".

Then you can check if the font color matches the background color and delete it in this section.

+1
source

Source: https://habr.com/ru/post/1484816/


All Articles