Web crawler keeper visiting URLs in file

Question

Web crawler keeper visiting URLs in file

I'm having trouble figuring out how I can store and crawl a large number of visited URLs from a web crawler. The idea is that the number of visited URLs will ultimately be too large to store in memory, and I have to store them in a file, but I was wondering if this is becoming very inefficient? If after receiving a batch of URLs and I want to check if the URL has already been visited, I need to check the visited file in turn and see if there is a match?

I was thinking about using a cache, but the problem still remains when the URL is not found in the cache, and I still have to check the file. Should I check the file in turn for each URL and is there a better / more efficient way to do this?

+4

java algorithm data-structures web-crawler

joe 24 sept '15 at 9:57

source share

2 answers

URL? , URL- .

, , , .. ( 404s, , , URL-.)

URL-, URL- .

node (https://www.npmjs.com/package/node-nutch), ( ) S3 ( ).

0

Mark Birbeck 13 '17 12:39

Sleiman Jneidi · Accepted Answer · 2015-09-24T10:07:37+0000

Bloom Filter Guava . Bloom (, URL-), . , , , URL- , Bloom Filter. , , URL-, byte[], (: md5).

byte[] hash = md5(url);
if(bloomFilter.maybe(hash)){
  checkTheFile(hash);
}else{
 visitUrl(url);
 addToFile(hash);
 addToBloomFilter(hash);
}

, - , O(1), , , index .

Web crawler keeper visiting URLs in file

More articles: