Python & Beautiful Soup: search in a specific class only

I am writing a script to capture the independence date of several countries on Wikipedia.

For example, with Kazakhstan:

URL_QS = 'https://en.wikipedia.org/wiki/Kazakhstan' r = requests.get(URL_QS) soup = BeautifulSoup(r.text, 'lxml') # Only keep the infobox (top right) infobox = soup.find("table", class_="infobox geography vcard") if infobox: formation = infobox.find_next(text = re.compile("Formation")) if formation: independence = formation.find_next(text = re.compile("independence")) if independence: independ_date = independence.find_next("td").text else: independence = formation.find_next(text = re.compile("Independence")) if independence: independ_date = independence.find_next("td").text print(independ_date) 

And I have the following output:

 Almaty 

This conclusion is not localized in infoboxes, but after, in the text. This is because "form.find_next (text = re.compile (" independent "))" found something outside the infobox, but I do not understand why the study should not be conducted only in infoboxes? How can I just search in this field?

Thank you in advance for your help!

+5
source share
2 answers

This is because "form.find_next (text = re.compile (" independent "))" found something outside the info box

add .extract() to soup.find() to search only inside the infobox geography vcard element.

infobox = soup.find("table", class_="infobox geography vcard").extract()

+1
source

Your code looked for the meaning after the first word "independence" , which should be the second, also the line "Formation" does not generalize very well, as I tested in some countries, so I think you can search for "independence" from the very beginning:

 infobox = soup.find("table", class_="infobox geography vcard") if infobox: formation = infobox.find_next(text = re.compile("Independence")) if formation: independence = formation.find_next(text = re.compile("independence")) if independence: independence = infobox.find_next(text = re.compile("Independence")) independ_date = independence.find_next("td").text print(independ_date) 

This will return the first date in the independence section of the wikipedia page for any country with an independence date.

0
source

Source: https://habr.com/ru/post/1273584/


All Articles