How to make nutch bypass file system?

Question

How to make nutch bypass file system?

not based on http,

like http: // localhost: 81 etc.

but directly scan a specific directory on the local file system,

is there a way out

+4

filesystems web-crawler nutch

omg Jun 2 '09 at 19:44

source share

2 answers

Robert Nickens · Answer 1 · 2009-07-12T03:39:23+0000

From the Nutch Wiki:

How to index the local file system?

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6

1) crawl-urlfilter.txt needs to be changed to allow the file: URLs, not the following http: those, otherwise it will either not index anything or it will go from your drive to websites. Change this line:

-^(file|ftp|mailto|https): to this: -^(http|ftp|mailto|https):

2) crawl-urlfilter.txt may have rules below to reject some URLs. If it has this fragment, it is probably good:

  # accept anything else +.*

3) I modified my nutch.xml to include the following:

 <Parameter override="false" name="plugin.includes" value="protocol-file|protocol-http|urlfilter-regex|parse-(msword|pdf|text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"/>

Sumit ghosh · Answer 2 · 2009-06-12T18:25:53+0000

nutch has intranet access. you can read the details here

How to make nutch bypass file system?

More articles: