possible solutions
exact methods
1) brute force: compare each new page with all pages visited (very slow and inefficient)
2) calculates the hash of each visited page (md5, sha1) and stores the hashes in the database and looks at each new hash of the page in the database
3) standard Boolean information retrieval model (BIR)
........ many other possible methods
near exact methods
1) fuzzy hash
2) hidden semantic indexing
....
source share