How to scan .pdf links using Apache Nutch

Question

How to scan .pdf links using Apache Nutch

I have a crawl site that includes some links to PDF files. I want nutch to scan this link and dump them as .pdf files. I use Apache Nutch1.6, also I do it in java as

ToolRunner.run(NutchConfiguration.create(), new Crawl(), tokenize(crawlArg)); SegmentReader.main(tokenize(dumpArg));

can someone help me on this

+4

apache hadoop nutch

sudheer Jul 03 '13 at 7:25

source share

2 answers

nimeshjm · Answer 1 · 2013-10-12T15:06:07+0000

If you want Nutch to scan and index your documents in pdf format, you need to enable document crawl and the Tika plugin:

Document Bypass

1.1 Modify regex-urlfilter.txt and remove any appearance of "pdf"

 # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

1.2 Modify the suffix-urlfilter.txt file and delete all possible "pdf"

1.3. Edit nutch-site.xml, add "parse-tika" and "parse-html" to the plugin.includes section

 <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>

If you really want to download all PDF files from a page, you can use something like Teleport on Windows or Wget in * nix.

olzhas · Answer 2 · 2013-10-10T06:41:22+0000

you can either write your own plugin, either in pdf mimetype, or the built-in apache-tika analyzer, which can extract text from pdf.

How to scan .pdf links using Apache Nutch

More articles: