How can I get information from the <a href> tag in the <div> tags using BeautifulSoup and Python?

all. I have a quick question about BeautifulSoup with Python. I have a few bits of HTML that look like this (the only differences are the links and product names), and I'm trying to get the link from the "href" attribute.

<div id="productListing1" xmlns:dew="urn:Microsoft.Search.Response.Document"> <span id="rank" style="display:none;">94.36</span> <div class="productPhoto"> <img src="/assets/images/ocpimages/87684/00131cl.gif" height="82" width="82" /> </div> <div class="productName"> <a class="on" href="/Products/ProductInfoDisplay.aspx?SiteId=1&amp;Product=8768400131">CAPRI SUN - JUICE DRINK - COOLERS VARIETY PACK 6 OZ</a> </div> <div class="size">40 CT</div> 

I currently have this Python code:

 productLinks = soup.findAll('a', attrs={'class' : 'on'}) for link in productLinks: print link['href'] 

This works (for each link on the page I get something like /Products/ProductInfoDisplay.aspx?SiteId=1&amp;Product=8768400131 ); however, I was trying to figure out if there is a way to get the link in the "href" attribute without directly searching for "class =" on "". My guess is that my first question should be whether this would be the best way to find this information (class = "on" seems too general and is likely to break in the future, although my CSS and HTML skills are not so good). I have tried many combinations of the methods find, findAll, findAllnext etc., but I cannot get it to work. This is basically what I had (I rebuilt and changed it several times):

 productLinks = soup.find('div', attrs={'class' : 'productName'}).find('a', href=True) 

If this is not a good way to do this, how can I get to the <a> tag from the <div class="productName"> ? Let me know if you need more information.

Thanks.

+6
source share
2 answers

Well, if you have a <div> element, you can get the <a> subelement by calling find() :

 productDivs = soup.findAll('div', attrs={'class' : 'productName'}) for div in productDivs: print div.find('a')['href'] 

However, since <a> is directly above the <div> , you can get the a attribute from the div:

 productDivs = soup.findAll('div', attrs={'class' : 'productName'}) for div in productDivs: print div.a['href'] 

Now, if you want to put all <a> elements in a list, your code above will not work, because "find ()" just returns one element that matches its criteria. You will get a list of divs and get subelements from them, for example, using lists:

 productLinks = [div.a for div in soup.findAll('div', attrs={'class' : 'productName'})] for link in productLinks: print link['href'] 
+11
source

I give this solution in BeautifulSoup4

 for data in soup.find_all('div', class_='productName'): for a in data.find_all('a'): print(a.get('href')) #for getting link print(a.text) #for getting text between the link 
+1
source

Source: https://habr.com/ru/post/904017/


All Articles