Detect duplicate web pages among a large number of URLs

From a quote from google blogspot ,

"In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates . . . " 

How does Google detect these exact duplicates of web pages or documents? Any idea on an algorithm that Google uses?

+4
source share
2 answers

According to http://en.wikipedia.org/wiki/MinHash :

In 2006, Google conducted a large-scale assessment [10] to compare the performance of the Minhash and Simhash algorithms [11]. In 2007, Google announced the use of Simhash for dual detection for Internet crawling [12] and the use of Minhash and LSH for Google personalization news. [thirteen]

A search for Simhash calls this page:

https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/

https://github.com/leonsim/simhash

which links to a document written by Google employees: Finding nearly duplicates for crawling web pages

Annotation:

Several duplicate web documents are plentiful. Two such documents differ from each other in a very small part that displays advertisements, for example. Such differences are not relevant to web searches. So, the quality of the web crawler increases if it can evaluate whether the crawl web page is almost a duplicate of the previously viewed web page or not. In the process of developing an almost duplicated detection system for the multi-billion-dollar page repository, we conduct two research contributions. First, we demonstrate that the Charikar fingerprint technique is suitable for this purpose. Secondly, we present an algorithmic method for identifying existing f-bit fingerprints that differ from a given fingerprint in no more than k bit positions, for small k. Our method is useful for online requests (one fingerprint) and that's it batch requests (multiple fingerprints). An experimental evaluation based on real data confirms the practicality of our design.

Another Simkhash paper:

http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf

+1
source

possible solutions

exact methods

1) brute force: compare each new page with all pages visited (very slow and inefficient)

2) calculates the hash of each visited page (md5, sha1) and stores the hashes in the database and looks at each new hash of the page in the database

3) standard Boolean information retrieval model (BIR)

........ many other possible methods

near exact methods

1) fuzzy hash

2) hidden semantic indexing

....

0
source

Source: https://habr.com/ru/post/1500551/


All Articles