Nutch - how to crawl in small spots?

I can't get Naich to crawl over me in small spots. I run it with the bin / nutl crawl command with the -depth 7 and -topN 10000 options. And it never ends. It ends only when my hard drive is empty. What do I need to do:

  • Start crawling your seeds with the ability to go beyond the external links.
  • Scan 20,000 pages, then index them.
  • Scan another 20,000 pages, index them and merge with the first index.
  • Cycle step 3 times.

I also tried the scripts found in the wiki, but all the scripts I found did not go further. If I run them again, they will do everything from the very beginning. And at the end of the script, I have the same index that I had when I started scanning. But I need to continue scanning.

+4
source share
1 answer

You need to understand the Nutch generation / selection / update cycles.

The loop generation step will take up the URLs (you can set the maximum number using the topN parameter) from the bypass bypass and generate a new segment. Initially, a db traversal will only contain seed URLs.

At the sampling stage, the actual scan is performed. The actual content of the pages is saved in the segment.

Finally, the update step updates db crawl with the results from the selection (adds new URLs, sets the last sampling time for the URL, sets the http status code for the selection for the URL, etc.).

The traversal tool starts this cycle n times, it is configured with the depth parameter.

Upon completion of all loops, the crawl tool will delete all indexes in the folder from which it is launched and create a new db crawl from all segments.

So, in order to do what you ask, you probably shouldn't use the crawl tool, but instead invoke separate Nutch commands, which makes the crawl tool behind the scene. In this case, you can control how many times you scan, and also make sure that indexes always merge and do not delete at each iteration.

I suggest you start with a script define here and change it to your needs.

+10
source

Source: https://habr.com/ru/post/1305448/


All Articles