Competitive price web scanner

I am thinking of writing an application that will create pseudo-walking competing websites to ensure that our prices remain competitive, etc. I looked maybe on the Google shopping search API, but I felt that it could be a lack of flexibility and not all of our competitors are fully listed or updated regularly.

My question is, where is a good place to start with a PHP web browser? I obviously want the tracks to be respectful (even to our competitors), so he hopefully obeys robots.txt and throttling. (To be fair, I think I'm going to host this on a third-party server, and it scans our websites to not show any bias.) I looked through Google and I could not find any mature packages - just a few poorly written sourceforge scripts that have not been supported for more than a year, even though they are designated as beta or alpha.

We are looking for ideas or suggestions. Thanks

+4
source share
2 answers

The seeker alone is not that complicated. You simply load the site, and then evaluate and follow the links found.

What you can do to be “friendly” is to create a crawler for every site you plan to poison. In other words, select one site and see how they are structured. Code your receive requests and html parsing around this structure. Rinse and repeat for other sites.

If they use regular shopping cart software (anything is possible here), then obviously you have a little reuse.

When scanning, you may want to hit your sites during peak hours (this will be an assumption). Also, do not run 500 / queries per second. Drop it a little.

One of the additional things that you might even consider is to contact these other sites and see if they want to participate in the direct exchange of data. An ideal would be for everyone to have an RSS feed for their products.

Of course, depending on who you sell, this may be considered a price fix ... So, proceed with caution.

+1
source

If you are just looking for an effective seeker, you can use it. This crawler can crawl about 10,000 web pages in 300 seconds on a good server. This is one of python, a similar implementation of curl is also available in PHP, and I hope you understand that PHP does not support multithreading, which is an important aspect when considering an efficient crawler.

#! /usr/bin/env python # -*- coding: iso-8859-1 -*- # vi:ts=4:et # $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $ # # Usage: python retriever-multi.py <file with URLs to fetch> [<# of # concurrent connections>] # import sys import pycurl # We should ignore SIGPIPE when using pycurl.NOSIGNAL - see # the libcurl tutorial for more info. try: import signal from signal import SIGPIPE, SIG_IGN signal.signal(signal.SIGPIPE, signal.SIG_IGN) except ImportError: pass # Get args num_conn = 10 try: if sys.argv[1] == "-": urls = sys.stdin.readlines() else: urls = open(sys.argv[1]).readlines() if len(sys.argv) >= 3: num_conn = int(sys.argv[2]) except: print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0] raise SystemExit # Make a queue with (url, filename) tuples queue = [] for url in urls: url = url.strip() if not url or url[0] == "#": continue filename = "doc_%03d.dat" % (len(queue) + 1) queue.append((url, filename)) # Check args assert queue, "no URLs given" num_urls = len(queue) num_conn = min(num_conn, num_urls) assert 1 <= num_conn <= 10000, "invalid number of concurrent connections" print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM) print "----- Getting", num_urls, "URLs using", num_conn, "connections -----" # Pre-allocate a list of curl objects m = pycurl.CurlMulti() m.handles = [] for i in range(num_conn): c = pycurl.Curl() c.fp = None c.setopt(pycurl.FOLLOWLOCATION, 1) c.setopt(pycurl.MAXREDIRS, 5) c.setopt(pycurl.CONNECTTIMEOUT, 30) c.setopt(pycurl.TIMEOUT, 300) c.setopt(pycurl.NOSIGNAL, 1) m.handles.append(c) # Main loop freelist = m.handles[:] num_processed = 0 while num_processed < num_urls: # If there is an url to process and a free curl object, add to multi stack while queue and freelist: url, filename = queue.pop(0) c = freelist.pop() c.fp = open(filename, "wb") c.setopt(pycurl.URL, url) c.setopt(pycurl.WRITEDATA, c.fp) m.add_handle(c) # store some info c.filename = filename c.url = url # Run the internal curl state machine for the multi stack while 1: ret, num_handles = m.perform() if ret != pycurl.E_CALL_MULTI_PERFORM: break # Check for curl objects which have terminated, and add them to the freelist while 1: num_q, ok_list, err_list = m.info_read() for c in ok_list: c.fp.close() c.fp = None m.remove_handle(c) print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL) freelist.append(c) for c, errno, errmsg in err_list: c.fp.close() c.fp = None m.remove_handle(c) print "Failed: ", c.filename, c.url, errno, errmsg freelist.append(c) num_processed = num_processed + len(ok_list) + len(err_list) if num_q == 0: break # Currently no more I/O is pending, could do something in the meantime # (display a progress bar, etc.). # We just call select() to sleep until some more data is available. m.select(1.0) # Cleanup for c in m.handles: if c.fp is not None: c.fp.close() c.fp = None c.close() m.close() 

If you are looking for a complete price comparison system, you are actually looking for an individual complex project on the Internet. If you find any request, share the link here, if you are interested in receiving this freelancer, you can contact me :)

0
source

Source: https://habr.com/ru/post/1336034/


All Articles