I'm having trouble figuring out how I can store and crawl a large number of visited URLs from a web crawler. The idea is that the number of visited URLs will ultimately be too large to store in memory, and I have to store them in a file, but I was wondering if this is becoming very inefficient? If after receiving a batch of URLs and I want to check if the URL has already been visited, I need to check the visited file in turn and see if there is a match?
I was thinking about using a cache, but the problem still remains when the URL is not found in the cache, and I still have to check the file. Should I check the file in turn for each URL and is there a better / more efficient way to do this?
source
share