Web crawler keeper visiting URLs in file

I'm having trouble figuring out how I can store and crawl a large number of visited URLs from a web crawler. The idea is that the number of visited URLs will ultimately be too large to store in memory, and I have to store them in a file, but I was wondering if this is becoming very inefficient? If after receiving a batch of URLs and I want to check if the URL has already been visited, I need to check the visited file in turn and see if there is a match?

I was thinking about using a cache, but the problem still remains when the URL is not found in the cache, and I still have to check the file. Should I check the file in turn for each URL and is there a better / more efficient way to do this?

+4
source share
2 answers

Bloom Filter Guava . Bloom (, URL-), . , , , URL- , Bloom Filter. , , URL-, byte[], (: md5).

byte[] hash = md5(url);
if(bloomFilter.maybe(hash)){
  checkTheFile(hash);
}else{
 visitUrl(url);
 addToFile(hash);
 addToBloomFilter(hash);
}

, - , O(1), , , index .

+4

URL? , URL- .

, , , .. ( 404s, , , URL-.)

URL-, URL- .

node (https://www.npmjs.com/package/node-nutch), ( ) S3 ( ).

0

Source: https://habr.com/ru/post/1608788/


All Articles