Scraping data from an unknown number of pages using a beautiful soup

I want to parse some information from a website that has data spread across multiple pages.

The problem is that I do not know how many pages there are. Maybe 2, but maybe 4, or even one page.

How can I iterate over pages when I don’t know how many pages there will be?

I know, however, a URL pattern that looks something like the code below.

In addition, the page names are not prime numbers, but they are in 'pe2'for page 2 and 'pe4'for page 3, etc., so they cannot just go through a range (number).

This dummy code for the loop I'm trying to fix.

pages=['','pe2', 'pe4', 'pe6', 'pe8',]

import requests 
from bs4 import BeautifulSoup
for i in pages:
    url = "http://www.website.com/somecode/dummy?page={}".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    #rest of the scraping code
+4
1

while, .

:

from bs4 import BeautifulSoup
from time import sleep
import requests 

i = 0
while(True):
    try:
        if i == 0:
            url = "http://www.website.com/somecode/dummy?page=pe"
        else:
            url = "http://www.website.com/somecode/dummy?page=pe{}".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')

        #print page url
        print(url)

        #rest of the scraping code

        #don't overflow website
        sleep(2)

        #increase page number
        i += 2
    except:
        break

:

http://www.website.com/somecode/dummy?page
http://www.website.com/somecode/dummy?page=pe2
http://www.website.com/somecode/dummy?page=pe4
http://www.website.com/somecode/dummy?page=pe6
http://www.website.com/somecode/dummy?page=pe8
...
... and so on, until it faces an Exception.
+2

Source: https://habr.com/ru/post/1673998/


All Articles