N-gram language models
You can try creating an n-gram language model on auto - generated spam pages and one in a collection of other non-spam web pages.
Then you can simply clog new pages using language models to see if the text looks more like spam web pages or regular web content.
Best Bayesian score
When you clog text using the spam language model, you get an estimate of the probability of finding this text on the P(Text|Spam) spam web page P(Text|Spam) . The designation is read as the Text probability of the specified Spam (page) . A non-spam language model score is an estimate of the likelihood of text on a non-spam web page, P(Text|Non-Spam) .
However, you probably really need the term P(Spam|Text) or, equivalently, P(Non-Spam|Text) . That is, you want to know the likelihood that the Spam or Non-Spam page is listed in the text that appears on it .
To get any of them, you will need to use the Bayesian Law , which says
P(B|A)P(A) P(A|B) = ------------ P(B)
Using the Bayesian law, we have
P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)
and
P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)
P(Spam) is your previous belief that a page randomly selected from the Internet is spam. You can estimate this amount by counting the number of spam web pages in an example, or you can even use it as a parameter that you manually set to compromise accuracy and recall . For example, if you set this parameter to a large value, fewer spam pages are erroneously classified as non-spam, and if it is low, this will result in fewer unwanted pages being accidentally classified as spam.
The term P(Text) is the general probability of finding Text on any web page. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were defined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam) . This summarizes the Spam / Non-Spam binary variable.
Classification only
However, if you are not going to use probabilities for anything else, you do not need to calculate P(Text) . Most likely, you can simply compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam) . If the first one is larger, the page is most likely spam, and if the second is larger, the page is most likely not spam. This works because the above equations for P(Spam|Text) and P(Non-Spam|Text) normalized to the value of the same P(Text) .
Instruments
In terms of software toolkits that you could use for something like this, SRILM would be a good place to start, and it's free for non-commercial use. If you want to use something commercially, and you do not want to pay for a license, you can use IRST LM , which is distributed under LGPL.