Ruby on Rails XPath Json Scraper Images

I am trying to clear images from websites. I have been using Nokogiri and XPath so far with limited success. For a typical website whose HTML has img and src, I can use:

tmp2 = Nokogiri::HTML(open(site_url)) tmp2.xpath("//img/@src").each do |src| ...do whatever end 

However, some sites, such as Amazon and eBay, only run certain images using javascript. If I look at the code, I can see the data in arrays. For example, from Amazon (source: http://www.amazon.com/Threads-Thought-Womens-Dreams-X-Small/dp/B00T46V758/ref = s = sr_1_5 clothes and amplifier, i.e. = UTF-8 & QID = 1433555447 & avg = 1-5 ):

  <script type="text/javascript"> P.when('jQuery', 'cf').execute(function($, cf){ P.load.js('http://z-ecx.images-amazon.com/images/G/01/browser-scripts/imageBlock-udp-airy/imageBlock-udp-airy-4060168860._V1_.js'); }); P.when('A', 'jQuery', 'ImageBlockATF', 'cf').register('ImageBlockBTF', function(A, $, imageBlockATF, cf){ var data = {"indexToColor":[],"burjImageBlock":0,"isSwatchHoverConsistent":1,"heroFocalPoint":null,"visualDimensions":["color_name"],"productGroupID":"apparel_display_on_website","newVideoMissing":0,"useIV":0,"useClickZoom":null,"useChildVideos":0,"numColors":7,"logMetrics":0,"defaultColor":"initial","airyConfig":{"enableContinuousPlay":null,"installFlashButtonText":"Install Flash Player","contentTitle":null,"autoplayCutOffTimeSeconds":null,"ageGate":{"monthNames":["January","February","March","April","May","June","July","August","September","October","November","December"],"deniedPrompt":"We're sorry. You are not old enough to watch this video.","submitText":"Submit","prompt":"This video is not intended for all audiences. What date were you born?"},"videoAds":null,"videoUnsupportedPrompt":"Sorry, this video is unsupported on this browser.","desiredMode":null,"swfUrl":"http://g-ecx.images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1102.0/flash/AiryBasicRenderer._V304902271_.swf","isAutoplayEnabled":null,"installFlashPrompt":"Adobe Flash Player is required to watch this video.","isLiveStream":null,"regionCode":"NA","contentId":null,"playbackErrorPrompt":"Sorry, an error has occurred while attempting video playback. Please try again later.","contentMinAge":null,"isForesterTrackingDisabled":null,"streamingUrls":null,"parentId":null,"foresterMetadataParams":{"client":"Dpx","requestId":"1MX7VHFRVAS6TWY64BXC","marketplaceId":"ATVPDKIKX0DER","session":"182-9511970-7757812","method":"Apparel.ImageBlock"},"jsUrl":"http://z-ecx.images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1102.0/js/airy.chromeless._V304902265_.js"},"mainImageMaxSizes":null,"staticStrings":{"playVideo":"Click to play video","rollOverToZoom":"Roll over image to zoom in","images":"Images","video":"video","clickToZoom":"Click on image to zoom in","touchToZoom":"Touch the image to zoom in","videos":"Videos","close":"Close","pleaseSelect":"Please select","clickToExpand":"Click to open expanded view","allMedia":"All Media"},"notThumbnailClickImmersiveView":1,"gIsNewTwister":1,"title":"Threads 4 Thought Women Tabitha Basic Tank Top","ivRepresentativeAsin":{"6":"B00T46V76W","4":"B00WM3O7ES","1":"B00T46YZES","3":"B00WM3NLPE","2":"B00T46VD16","5":"B00T46VGXQ"},"mainImageSizes":[[342,445],[385,500],[425,550],[466,606],[522,679]],"isQuickview":0,"ipadVideoSizes":[[340,444],[384,500]],"colorToAsin":{"Coral Dreams":{"asin":"B00T46V76W"},"Heather Grey":{"asin":"B00WM3NLPE"},"Black":{"asin":"B00T46YZES"},"White":{"asin":"B00T46VGXQ"},"Deep Blue Sea":{"asin":"B00T46VD16"},"Sea Glass":{"asin":"B00WM3O7ES"}},"thumbExperimentEnabledValue":1,"showLITBOnClick":0,"videoSizes":[[342,445],[384,500]],"stretchyGoodnessWidth":[1280,1440,1640,1800],"autoplayVideo":0,"hoverZoomIndicator":"","sitbReftag":"","useHoverZoom":1,"staticImages":{"zoomOut":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoom-out._V184888738_.bmp","hoverZoomIcon":"http://g-ecx.images-amazon.com/images/G/01/img11/apparel/UX/DP/icon_zoom._V138923886_.png","zoomIn":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoom-in._V184888790_.bmp","zoomLensBackground":"http://g-ecx.images-amazon.com/images/G/01/apparel/rcxgs/tile._V211431200_.gif","videoThumbIcon":"http://g-ecx.images-amazon.com/images/G/01/Quarterdeck/en_US/images/video._V183716339_SX38_SY50_CR,0,0,38,50_.gif","spinner":"http://g-ecx.images-amazon.com/images/G/01/ui/loadIndicators/loading-large_labeled._V192238949_.gif","zoomInCur":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoomIn._V323082799_.cur","videoSWFPath":"http://g-ecx.images-amazon.com/images/G/01/Quarterdeck/en_US/video/20110518115040892/Video._V178668404_.swf","arrow":"http://g-ecx.images-amazon.com/images/G/01/javascripts/lib/popover/images/light/sprite-vertical-popover-arrow._V186877868_.png","zoomOutCur":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoomOut._V323082798_.cur"},"videos":[],"gPreferChildVideos":0,"altsOnLeft":1,"ivImageSetKeys":{"Coral Dreams":"6","Heather Grey":"3","Black":"1","initial":0,"White":"5","Deep Blue Sea":"2","Sea Glass":"4"},"useHoverZoomIpad":"","isUDP":1,"alwaysIncludeVideo":0,"widths":[1280,1440,1640,1800],"maxAlts":7,"useChromelessVideoPlayer":1,"mainImageHeightPartitions":null}; data["customerImages"] = eval('[]'); data["colorImages"] = {"Coral Dreams":[{"large":"http://ecx.images-amazon.com/images/I/41FGlhksmtL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41FGlhksmtL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81iXQbkcpiL._UY500_.jpg":["385","500"]}},{"large":"http://ecx.images-amazon.com/images/I/41XR9o0cV-L.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41XR9o0cV-L._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81bVmFiRu0L._UY550_.jpg":["423","550"]}}],"Heather Grey":[{"large":"http://ecx.images-amazon.com/images/I/41f-8R8Eu-L.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41f-8R8Eu-L._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81dTYkBL%2BxL._UX342_.jpg":["342","445"]}},{"large":"http://ecx.images-amazon.com/images/I/41gLiFBbcdL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41gLiFBbcdL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81ua3AXCpJL._UX466_.jpg":["466","606"]}}],"Black":[{"large":"http://ecx.images-amazon.com/images/I/41BxSpfEM7L.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41BxSpfEM7L._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81%2BTW8762BL._UX466_.jpg":["466","606"]}},{"large":"http://ecx.images-amazon.com/images/I/41Gf%2BW-cPTL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41Gf%2BW-cPTL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81SJwuaCspL._UY550_.jpg":["423","550"]}}],"White":[{"large":"http://ecx.images-amazon.com/images/I/41tElK2wPKL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41tElK2wPKL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81kKgU75rIL._UX466_.jpg":["466","606"]}},{"large":"http://ecx.images-amazon.com/images/I/31lEDIs4cqL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/31lEDIs4cqL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81OBgvbUR7L._UY550_.jpg":["423","550"]}}],"Deep Blue Sea":[{"large":"http://ecx.images-amazon.com/images/I/41oNq3KmSGL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41oNq3KmSGL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UY550_.jpg":["423","550"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81MtZtmxVLL._UX466_.jpg":["466","606"]}},{"large":"http://ecx.images-amazon.com/images/I/41AJgd1OuYL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41AJgd1OuYL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81uLEksrYFL._UY550_.jpg":["423","550"]}}],"Sea Glass":[{"large":"http://ecx.images-amazon.com/images/I/418vg-re8oL.jpg","variant":"MAIN","hiRes":"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/418vg-re8oL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/81YgtD-bEwL._UY550_.jpg":["423","550"]}},{"large":"http://ecx.images-amazon.com/images/I/41lcpC41VSL.jpg","variant":"BACK","hiRes":"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UL1500_.jpg","thumb":"http://ecx.images-amazon.com/images/I/41lcpC41VSL._SR38,50_.jpg","main":{"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UY500_.jpg":["385","500"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UX342_.jpg":["342","445"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UX522_.jpg":["522","679"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UX466_.jpg":["466","606"],"http://ecx.images-amazon.com/images/I/814%2B6ZLwIxL._UY550_.jpg":["423","550"]}}]}; data["heroImage"] = {}; data["landingAsinColor"] = 'Coral Dreams'; data["shouldApplyResizeFix"] = false; return data; }); </script> 

The names of the files I want to capture do not have src (ie http://ecx.images-amazon.com/images/I/81%2BTW8762BL.UY500.jpg ) In this case, the array is called "data [" colorImages "]

But ... I can’t record anything here because the same thing happens on eBay ... for example: http://www.ebay.com/itm/Summer-Women-Casual-Chiffon-Loose-Tops-Batwing- Short-Sleeve-Loose-T-Shirt-Blouse- / 351411949784? Pt = LH_DefaultDomain_0 & var = & hash = item51d1c8d0d8

The file names I need are in "enImgCarousel"

On the side of the note, when I use the following javascript bookmarklet for each URL to receive the images, I can get the correct images:

 a=''; for (b=0;b<document.images.length;b++){ a+='<img src='+document.images[b].src+'><br>'}; ifa=''){ document.writea+'</center>'); void(document.close()) }else{ alert('No images!') } 

Back to Nokogiri and XPath, I also tried:

 tmp2.xpath("//img").each do |src|... 

and

  tmp2.xpath("html//img").each do |src| 

Any ideas how I should do this or which direction to go?

+6
source share
3 answers

This is an alternative way to decide what you would like to achieve; you can use capybara and poltergeist and i got the one you want to get. Therefore, I assume that you do not need to dive into javascript with this solution.

If you scrape off, I recommend that you consider capybara with a poltergeist, you can find many sources for reference.

In the future, this is the code I tried. Hope this helps!

 require 'capybara' require 'capybara/dsl' require 'capybara/poltergeist' Capybara.register_driver :poltergeist_debug do |app| Capybara::Poltergeist::Driver.new(app, inspector: true) end Capybara.javascript_driver = :poltergeist_debug Capybara.current_driver = :poltergeist_debug # Amazon Case visit_site('https://www.amazon.com/dp/B00T46V758/?tag=stackoverfl08-20') doc_amazon = Nokogiri::HTML.parse(page.html) doc_amazon.xpath("//img/@src").each do |src| p src.value end #ebay case visit_site('https://www.ebay.com/itm/Summer-Women-Casual-Chiffon-Loose-Tops-Batwing-Short-Sleeve-Loose-T-Shirt-Blouse-/351411949784?pt=LH_DefaultDomain_0&var=&hash=item51d1c8d0d8') doc_ebay = Nokogiri::HTML.parse(page.html) doc_ebay.xpath("//img/@src").each do |src| p src.value end 

if you want to delve into it ... (it seems you do not need to)

 doc.xpath("//div[@id='imgTagWrapperId']/img").attribute('src').value # => "https://images-na.ssl-images-amazon.com/images/I/81%2BTW8762BL._UX453_.jpg" doc.xpath("//div[@id='mainImgHldr']/img[@id='icImg']").attribute('src').value # => "https://i.ebayimg.com/images/g/dtAAAOSwpdpVZuU~/s-l300.jpg" 
0
source

Sorry, since I am sending a response from a mobile phone, I cannot immediately write the full code, however, I can give you a way. You should use Mechanize with selenium-webdriver and watir instead of Nokogiri.

Using Mechanize, you can handle elements that come from JavaScript. You can make fun of the actual moves in the browser, that is, you can code to click on the links / buttons, you can wait for the image to load, and then you can clear it. And all this can be done with Mechanize very easily.

-1
source

Are you trying to create a database of competitors' products with price, etc.?
Are you trying to capture entire categories or individual sellers? The reason I'm asking is that you can get the RSS feed of the items that are listed by each seller if they have enabled this feature. This way, you don’t have to waste time cleaning the page when you can get the central data from the RSS feed.

When analyzing web pages, depending on where you are on the web page (you mentioned the carousel), the indexes you come across are taken from a list of thumbnails representing large images.
I recommend looking at the eBay and Amazon APIs and finding RSS feeds for sellers first.

As for going through any issues with Javascript, the web page loads rotating slide shows and carousels dynamically, so you have to use Mechanize (as suggested by RAJ above) or Beautiful Soup or Selenium to get fully displayed web pages in which all images are sparse.

Feel free to post your source if there is anything else I can help with.

-1
source

Source: https://habr.com/ru/post/988616/


All Articles