How to evaluate the quality of a web page?

Question

How to evaluate the quality of a web page?

I am doing a university project which is to collect and combine data on a topic provided by the user. The problem I am facing is that the Google search results for many terms are polluted with poor-quality auto-generated pages, and if I use them, I can get the wrong facts. How can I evaluate the quality / reliability of the page?

You might think, “Nah, Google engineers have been working on the problem for 10 years and he has been asking for permission,” but if you think about it, SE should provide updated content and if he marks a good page as bad, users will be unhappy. I have no such restrictions, therefore, if the algorithm accidentally marks somehow nice pages, this is not a problem.

Here is an example: Say buy aspirin in south la . Try a search on Google. The first 3 results have already been removed from the sites, but the fourth is interesting: radioteleginen.ning.com/profile/BuyASAAspirin (I do not want to make an active link)

Here is the first paragraph of the text:

The annual purchase of prescription drugs from Canada is big in the US at this point. This is because in the prescription of the United States, explosion prices soared, making it difficult for those who hit or focused on income to buy their many necessary medicines. Americans pay more for their drugs than anyone in the class.

The rest of the text is similar, and then a list of related keywords follows. This is what I consider to be a low quality page. Although this particular text seems to make sense (with the exception of its terrible one), the other examples that I have seen (I can’t find now) are just rubbish, the purpose of which is to get some users from Google and ban them through 1 day after creation.

+4

machine-learning nlp information-retrieval spam

Fluffy May 01, '10 at 7:01

source share

5 answers

Define the "quality" of a web page? What is a metric?

If someone wanted to buy fruit, then the search for "big sweet melons" will give many results that will contain images of a "non-textile" bias.

The layout and placement of these pages may, however, be sound engineering.

But the page of the farmer, representing his high-quality, tasty and healthy product, can only be visible in IE4.5, since the html is "broken" ...

+3

lexu May 01, '10 at 7:07

source share

For each query result for a query by keyword, perform a separate Google query to find the number of sites linking to this site, if no other site links to this site, and then exclude it. I think this will be a good start at least.

+1

Randy morris May 01, '10 at 7:45

source share

if you are looking for performance indicators, then Y might be useful! Slow [plugin for firefox].

http://developer.yahoo.com/yslow/

+1

Inv3r53 May 01, '10 at 21:35

source share

You can use a supervised training model for this type of classification. The overall process is as follows:

Get a sample training kit. This should give examples of the documents you want to cover. The more general you want to be larger, you need to use an example. If you just want to focus on sites related to aspirin, this will reduce the required sample set.
Extract functions from documents. These may be words extracted from a website.
Pass functions to the classifier, for example, to ( MALLET or WEKA ).
Evaluate the model using something like k-fold cross validation .
Use the model to evaluate new websites.

When you talk about neglect, if you mark a good site as a bad site, this is called a review. Recall measures those you have to return, how much you really returned. Precision measures of those that you noted as “good” and “bad”, how many were correct. Since you state your goal to be more precise, and the reminder is not so important, you can set your model to higher accuracy.

0

Thien May 03 '10 at 18:10

source share

dmcer · Accepted Answer · 2010-05-01T20:45:26+0000

N-gram language models

You can try creating an n-gram language model on auto - generated spam pages and one in a collection of other non-spam web pages.

Then you can simply clog new pages using language models to see if the text looks more like spam web pages or regular web content.

Best Bayesian score

When you clog text using the spam language model, you get an estimate of the probability of finding this text on the P(Text|Spam) spam web page P(Text|Spam) . The designation is read as the Text probability of the specified Spam (page) . A non-spam language model score is an estimate of the likelihood of text on a non-spam web page, P(Text|Non-Spam) .

However, you probably really need the term P(Spam|Text) or, equivalently, P(Non-Spam|Text) . That is, you want to know the likelihood that the Spam or Non-Spam page is listed in the text that appears on it .

To get any of them, you will need to use the Bayesian Law , which says

  P(B|A)P(A) P(A|B) = ------------ P(B)

Using the Bayesian law, we have

 P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)

and

 P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)

P(Spam) is your previous belief that a page randomly selected from the Internet is spam. You can estimate this amount by counting the number of spam web pages in an example, or you can even use it as a parameter that you manually set to compromise accuracy and recall . For example, if you set this parameter to a large value, fewer spam pages are erroneously classified as non-spam, and if it is low, this will result in fewer unwanted pages being accidentally classified as spam.

The term P(Text) is the general probability of finding Text on any web page. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were defined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam) . This summarizes the Spam / Non-Spam binary variable.

Classification only

However, if you are not going to use probabilities for anything else, you do not need to calculate P(Text) . Most likely, you can simply compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam) . If the first one is larger, the page is most likely spam, and if the second is larger, the page is most likely not spam. This works because the above equations for P(Spam|Text) and P(Non-Spam|Text) normalized to the value of the same P(Text) .

Instruments

In terms of software toolkits that you could use for something like this, SRILM would be a good place to start, and it's free for non-commercial use. If you want to use something commercially, and you do not want to pay for a license, you can use IRST LM , which is distributed under LGPL.

How to evaluate the quality of a web page?

More articles: