Nutch 2.x does not crawl sites like flipkart and jabong

I did some nutch experiments to go around sites that didn't have any ajax calls, and I got all the data.

I followed the steps to get data.

  • user @localhost: ~ / sample / nutch / runtime / local / bin $. / nutch inject / path / to / the / seed.txt
  • $: ./ nutch generate -batchId 321
  • $: ./ nutch fetch 321
  • $: ./ nutch parse 321
  • $: ./ nutch updatedb

I have hbase as a repository that stores files on hdf. If I follow these 5 steps, he will give me all the data if url is http://www.naaptol.com/brands/nokia/mobile-phones.html , but if I change it to http://www.flipkart.com / mens-footwear / shoes / sports-shoes / pr? sid = osp, cil, nit, 1cu & otracker = hp_nmenu_sub_men_0_Sports% 20Shoes gives me nothing

My nutch-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>storage.data.store.class</name>
                <value>org.apache.gora.hbase.store.HBaseStore</value>
                <description>Default class for storing data</description>
        </property>
        <property>
                <name>http.agent.name</name>
                <value>com.datametica.agent</value>
                <description>this is just an agent name</description>
        </property>
        <property>
                <name>http.robots.agents</name>
                <value>datametica_robot</value>
                <description>this is just a robot</description>
        </property>
        <property>
                <name>plugin.folders</name>
                <value>/home/sachin/source_codes/svn/nutch/nutch_2.x/build/plugins</value>
        </property>
</configuration>
+4
source share
1 answer

Regex-urlfilter blocks URLs that have request parameters:

skip URLs containing certain characters as likely requests, etc.

- [?! * @ =]

Modify this file so that the URLs with the request parameters are executed:

skip URLs containing certain characters as likely requests, etc.

- [! * @]

Nutch probably lacks support to get around the Ajax page. See this

you can probably see https://issues.apache.org/jira/browse/NUTCH-1323

+3
source

Source: https://habr.com/ru/post/1548235/


All Articles