How to speed up scanning in Nutch

I am trying to develop an application in which I will give a limited set of URLs in a urls file in Nutch. I can scan these URLs and retrieve their contents by reading data from segments.

I scanned with a depth of 1, as I am not worried about outgoing links or inlinks on a web page. I only need the contents of these web pages in the urls file.

But doing this workaround takes time. So, suggest me a way to reduce scan time and increase scan speed. I also don't need indexing, because I'm not interested in the search part.

Does anyone have any suggestions on how to speed up the scan?

+3
source share
6 answers

- nutch-site.xml

<property>
<name>fetcher.threads.per.queue</name>
   <value>50</value>
   <description></description>
</property>
+7

nutch-site.xml. fetcher.threads.per.host fetcher.threads.fetch , . . . , .

+6

, :

 <property>
  <name>generate.max.count</name>
  <value>50</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
 </property>

, robots.txt( ) , : fetcher.max.crawl.delay. , generate.max.count.

:

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

, , fetcher.threads.per.queue, ... ...

+4

, , . , nutch-site.xml

<property>
  <name>fetcher.server.delay</name>
  <value>0.5</value>
 <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
 </description>

</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>400</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>


<property>
  <name>fetcher.threads.per.host</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

+1

, Nutch. URL- http script curl.

0

I have similar issues and can improve speed with https://wiki.apache.org/nutch/OptimizingCrawls

It has useful information so that it can slow down your workaround and what you can do to improve each of these issues.

Unfortunately, in my case, I have rather unbalanced queues, and I cannot request too quickly for more, otherwise I will be blocked, so I probably need to go on to solve the cluster or TOR before I accelerate further flows.

0
source

Source: https://habr.com/ru/post/1789431/


All Articles