Web scraper: expand / reduce contract boundaries depending on the results

The client wants to know the location of their competitor stores, so I become quasi-evil and scraping the site of competitors.

The server accepts the bounding box (i.e., the coordinates of the lower and upper right corner) as parameters and returns the locations found in the bounding box. This part is working fine and I can successfully retrieve store locations given the bounding box.

The problem is that only the first 10 locations in the bounding box are returned - therefore, in populated areas, a 10-degree bounding box returns too many places:

enter image description here

, , , .

, , 10 ( 10 ) , .

, :

stores = checkForStores(<bounding box>)
if len(stores) >= 10:
  # There are too many stores. Search again with a smaller bounding box
else:
  # Everything is good - process these stores

, checkForStores.

, for :

cellsize = 10
for minLat in range(-40, -10, cellsize):
    for minLng in range(110, 150, cellsize):
        maxLat = minLat + cellsize
        maxLng = minLng + cellsize

... , , 10 . while, .

, .

+4
1

, . , , : , , 10, . , 10 . .

: , , . . , 40 000 x 40 000 , 15 , 1 x 1 cell_axis_reduction_factor=2:

In [1]: import math

In [2]: math.log(40000, 2)
Out[2]: 15.287712379549449

, cell_axis_reduction_factor.

: Python, PEP 8, , , checkForStores check_for_stores.

# Save visited boxes. Only for debugging purpose.
visited_boxes = []


def check_for_stores(bounding_box):
    """Function mocking real `ckeck_fo_stores` function by returning
    random list of "stores"
    """
    import random
    randint = random.randint(1, 12)
    print 'Found {} stores for bounding box {}.'.format(randint, bounding_box)
    visited_boxes.append(bounding_box)
    return ['store'] * randint


def split_bounding_box(bounding_box, cell_axis_reduction_factor=2):
    """Returns generator of bounding box coordinates splitted
    from parent `bounding_box`

    :param bounding_box: tuple containing coordinates containing tuples of
          lower-left and upper-right corner coordinates,
          e.g. ((0, 5.2), (20.5, 14.0))
    :param cell_axis_reduction_factor: divide each axis in this param,
                                       in order to produce new box,
                                       meaning that in the end it will
                                       return `cell_axis_reduction_factor`**2 boxes
    :return: generator of bounding box coordinates

    """
    box_lc, box_rc = bounding_box
    box_lc_x, box_lc_y = box_lc
    box_rc_x, box_rc_y = box_rc

    cell_width = (box_rc_x - box_lc_x) / float(cell_axis_reduction_factor)
    cell_height = (box_rc_y - box_lc_y) / float(cell_axis_reduction_factor)

    for x_factor in xrange(cell_axis_reduction_factor):
        lc_x = box_lc_x + cell_width * x_factor
        rc_x = lc_x + cell_width

        for y_factor in xrange(cell_axis_reduction_factor):
            lc_y = box_lc_y + cell_height * y_factor
            rc_y = lc_y + cell_height

            yield ((lc_x, lc_y), (rc_x, rc_y))


def get_stores_in_box(bounding_box, result=None):
    """Returns list of stores found provided `bounding_box`.

    If there are more than or equal to 10 stores found in `bounding_box`,
    recursively splits current `bounding_box` into smaller one and checks
    stores in them.

    :param bounding_box: tuple containing coordinates containing tuples of
          lower-left and upper-right corner coordinates,
          e.g. ((0, 5.2), (20.5, 14.0))
    :param result: list containing found stores, found stores appended here;
                   used for recursive calls
    :return: list with found stores

    """
    if result is None:
        result = []

    print 'Checking for stores...'
    stores = check_for_stores(bounding_box)
    if len(stores) >= 10:
        print 'Stores number is more than or equal 10. Splitting bounding box...'
        for splitted_box_coords in split_bounding_box(bounding_box):
            get_stores_in_box(splitted_box_coords, result)
    else:
        print 'Stores number is less than 10. Saving results.'
        result += stores

    return result


stores = get_stores_in_box(((0, 1), (30, 20)))
print 'Found {} stores in total'.format(len(stores))
print 'Visited boxes: '
print visited_boxes

:

Checking for stores...
Found 10 stores for bounding box ((0, 1), (30, 20)).
Stores number is more than or equal 10. Splitting bounding box...
Checking for stores...
Found 4 stores for bounding box ((0.0, 1.0), (15.0, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 4 stores for bounding box ((0.0, 10.5), (15.0, 20.0)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 10 stores for bounding box ((15.0, 1.0), (30.0, 10.5)).
Stores number is more than or equal 10. Splitting bounding box...
Checking for stores...
Found 1 stores for bounding box ((15.0, 1.0), (22.5, 5.75)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 9 stores for bounding box ((15.0, 5.75), (22.5, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 4 stores for bounding box ((22.5, 1.0), (30.0, 5.75)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 1 stores for bounding box ((22.5, 5.75), (30.0, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 6 stores for bounding box ((15.0, 10.5), (30.0, 20.0)).
Stores number is less than 10. Saving results.
Found 29 stores in total
Visited boxes: 
[
((0, 1), (30, 20)), 
((0.0, 1.0), (15.0, 10.5)), 
((0.0, 10.5), (15.0, 20.0)), 
((15.0, 1.0), (30.0, 10.5)), 
((15.0, 1.0), (22.5, 5.75)), 
((15.0, 5.75), (22.5, 10.5)), 
((22.5, 1.0), (30.0, 5.75)), 
((22.5, 5.75), (30.0, 10.5)), 
((15.0, 10.5), (30.0, 20.0))
]
+5

Source: https://habr.com/ru/post/1628681/


All Articles