You need to understand the Nutch generation / selection / update cycles.
The loop generation step will take up the URLs (you can set the maximum number using the topN parameter) from the bypass bypass and generate a new segment. Initially, a db traversal will only contain seed URLs.
At the sampling stage, the actual scan is performed. The actual content of the pages is saved in the segment.
Finally, the update step updates db crawl with the results from the selection (adds new URLs, sets the last sampling time for the URL, sets the http status code for the selection for the URL, etc.).
The traversal tool starts this cycle n times, it is configured with the depth parameter.
Upon completion of all loops, the crawl tool will delete all indexes in the folder from which it is launched and create a new db crawl from all segments.
So, in order to do what you ask, you probably shouldn't use the crawl tool, but instead invoke separate Nutch commands, which makes the crawl tool behind the scene. In this case, you can control how many times you scan, and also make sure that indexes always merge and do not delete at each iteration.
I suggest you start with a script define here and change it to your needs.
source share