How to get image file using Scrapy

Question

How to get image file using Scrapy

I just started using Scrapy, and I'm trying to scan an image file. Here is my code.

items.py

from scrapy.item import Item, Field class TutorialItem(Item): image_urls = Field( images = Field() pass

settings.py

 BOT_NAME = 'tutorial' SPIDER_MODULES = ['tutorial.spiders'] NEWSPIDER_MODULE = 'tutorial.spiders' ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline'] IMAGE_STORE = '/Users/rnd/Desktop/Scrapy-0.16.5/tutorial/images'

pipelines.py

 from scrapy.contrib.pipeline.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy.http import Request class TutorialPipeline(object): def process_item(self, item, spider): return item def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url)

tutorial_spider.py

 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from tutorial.items import TutorialItem class TutorialSpider(BaseSpider): name = "tutorial" allowed_domains = ["roxie.com"] start_urls = ["http://www.roxie.com/events/details.cfm?eventid=581D228B%2DB338%2DF449%2DBD69027D7D878A7F"] def parse(self, response): hxs = HtmlXPathSelector(response) item = TutorialItem() link = hxs.select('//div[@id="eventdescription"]//img/@src').extract() item['image_urls'] = ["http://www.roxie.com" + link] return item

print journal - command → workaround for scrapy -o roxie.json -t json

 2013-06-19 17:29:06-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial) /System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/web/microdom.py:181: SyntaxWarning: assertion is always true, perhaps remove parentheses? assert (oldChild.parentNode is self, 2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware Traceback (most recent call last): File "/usr/local/bin/scrapy", line 5, in <module> pkg_resources.run_script('Scrapy==0.16.5', 'scrapy') File "/Library/Python/2.6/site-packages/setuptools-0.6c12dev_r88846-py2.6.egg/pkg_resources.py", line 489, in run_script File "/Library/Python/2.6/site-packages/setuptools-0.6c12dev_r88846-py2.6.egg/pkg_resources.py", line 1207, in run_script # we assume here that our metadata may be nested inside a "basket" File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/EGG-INFO/scripts/scrapy", line 4, in <module> execute() File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py", line 131, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py", line 76, in _run_print_help func(*a, **kw) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command cmd.run(args, opts) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/commands/crawl.py", line 43, in run spider = self.crawler.spiders.create(spname, **opts.spargs) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/command.py", line 33, in crawler self._crawler.configure() File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/crawler.py", line 41, in configure self.engine = ExecutionEngine(self, self._spider_closed) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/core/engine.py", line 63, in __init__ self.scraper = Scraper(crawler) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/core/scraper.py", line 66, in __init__ self.itemproc = itemproc_cls.from_crawler(crawler) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/middleware.py", line 50, in from_crawler return cls.from_settings(crawler.settings, crawler) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/middleware.py", line 29, in from_settings mwcls = load_object(clspath) File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/utils/misc.py", line 39, in load_object raise ImportError, "Error loading object '%s': %s" % (path, e) ImportError: Error loading object 'scrapy.contrib.pipeline.images.ImagesPipeline': No module named PIL

It looks like a need for PIL, so I installed.

 PIL 1.1.7 is already the active version in easy-install.pth Installing pilconvert.py script to /usr/local/bin Installing pildriver.py script to /usr/local/bin Installing pilfile.py script to /usr/local/bin Installing pilfont.py script to /usr/local/bin Installing pilprint.py script to /usr/local/bin Using /Library/Python/2.6/site-packages/PIL-1.1.7-py2.6-macosx-10.6-universal.egg Processing dependencies for pil Finished processing dependencies for pil

However, this will not work. Could you tell me what I missed? Thanks in advance!

+4

python image web-crawler scrapy

user2499003 Jun 20 '13 at 0:35

source share

1 answer

Wandy · Answer 1 · 2014-11-01T08:39:15+0000

Yes, I have the same problem when I started to scan photos from some sites. I worked in CentOs6.5, python2.7.6. I solved it like this below:

yum install easy_install

easy_install pip

Then log in as the root user and use the command: pip install image. And it just worked.

If you worked on Ubuntu, I think the trick is simple: sudo apt-get install easy_install, and the next one will be the same as I think.

I hope this will be helpful.

How to get image file using Scrapy

More articles: