How to clear items that immediately follow a specific item?

I have an HTML document that looks like this:

<div id="whatever">
  <a href="unwanted link"></a>
  <a href="unwanted link"></a>
  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
  <a href="interesting link"></a>
  ...
</div>

I want to clear only those links that immediately follow the tag code. If I do soup.findAll('a'), it returns all hyperlinks.

How can I get BS4 to start flushing after this particular item code?

+4
source share
2 answers

Try soup.find_all_next():

>>> tag = soup.find('div', {'id': "whatever"})
>>> tag.find('code').find_all_next('a')
[<a href="interesting link"></a>, <a href="interesting link"></a>]
>>> 

This is similar to soup.find_all(), but finds tags after the tag .


If you want to remove tags <a>before <code>, we have a function called soup.find_all_previous():

>>> tag.find('code').find_all_previous('a')
[<a href="unwanted link"></a>, <a href="unwanted link"></a>]

>>> for i in tag.find('code').find_all_previous('a'):
...     i.extract()
...     
... 
<a href="unwanted link"></a>
<a href="unwanted link"></a>

>>> tag
<div id="whatever">


  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
<a href="interesting link"></a>
  ...
</div>
>>> 

So this is:

  • <a>, <code>.
  • soup.extract() for, .
+4

- css- .select() , decompose. ~, , code: code ~ a

soup = BeautifulSoup('''<div id="whatever">
      <a href="unwanted link"></a>
      <a href="unwanted link"></a>
      ...
      <code>blah blah</code>
      ...
      <a href="interesting link"></a>
      <a href="interesting link"></a>
      ...
      </div>''', 
     'lxml'
)

for link in soup.select('code ~ a'):
    link.decompose()     

print(soup)

:

<html><body><div id="whatever">
<a href="unwanted link"></a>
<a href="unwanted link"></a>
  ...
  <code>blah blah</code>
  ...


  ...
</div></body></html>

code - , find_all, "code" find_all_next, a. decompose, , .

In [85]: from bs4 import BeautifulSoup

In [86]: soup = BeautifulSoup('''<div id="whatever">
   ....:   <a href="unwanted link"></a>
   ....:   <a href="unwanted link"></a>
   ....:   ...
   ....:   <code>blah blah</code>
   ....:   ...
   ....:   <a href="interesting link"></a>
   ....:   <a href="interesting link"></a>
   ....:   ...
   ....: </div>''', 'lxml')

In [87]: for code in soup.find_all('code'):
   ....:     for link in code.find_all_next('a'):
   ....:         link.decompose()
   ....:         

In [88]: soup
Out[88]: 
<html><body><div id="whatever">
<a href="unwanted link"></a>
<a href="unwanted link"></a>
  ...
  <code>blah blah</code>
  ...


  ...
</div></body></html>
+2

Source: https://habr.com/ru/post/1621848/


All Articles