How to clear items that immediately follow a specific item?
I have an HTML document that looks like this:
<div id="whatever">
<a href="unwanted link"></a>
<a href="unwanted link"></a>
...
<code>blah blah</code>
...
<a href="interesting link"></a>
<a href="interesting link"></a>
...
</div>
I want to clear only those links that immediately follow the tag code. If I do soup.findAll('a'), it returns all hyperlinks.
How can I get BS4 to start flushing after this particular item code?
+4
2 answers
Try soup.find_all_next():
>>> tag = soup.find('div', {'id': "whatever"})
>>> tag.find('code').find_all_next('a')
[<a href="interesting link"></a>, <a href="interesting link"></a>]
>>>
This is similar to soup.find_all(), but finds tags after the tag .
If you want to remove tags <a>before <code>, we have a function called soup.find_all_previous():
>>> tag.find('code').find_all_previous('a')
[<a href="unwanted link"></a>, <a href="unwanted link"></a>]
>>> for i in tag.find('code').find_all_previous('a'):
... i.extract()
...
...
<a href="unwanted link"></a>
<a href="unwanted link"></a>
>>> tag
<div id="whatever">
...
<code>blah blah</code>
...
<a href="interesting link"></a>
<a href="interesting link"></a>
...
</div>
>>>
So this is:
<a>,<code>.soup.extract()for, .
+4
- css- .select() , decompose. ~, , code: code ~ a
soup = BeautifulSoup('''<div id="whatever">
<a href="unwanted link"></a>
<a href="unwanted link"></a>
...
<code>blah blah</code>
...
<a href="interesting link"></a>
<a href="interesting link"></a>
...
</div>''',
'lxml'
)
for link in soup.select('code ~ a'):
link.decompose()
print(soup)
:
<html><body><div id="whatever">
<a href="unwanted link"></a>
<a href="unwanted link"></a>
...
<code>blah blah</code>
...
...
</div></body></html>
code - , find_all, "code" find_all_next, a. decompose, , .
In [85]: from bs4 import BeautifulSoup
In [86]: soup = BeautifulSoup('''<div id="whatever">
....: <a href="unwanted link"></a>
....: <a href="unwanted link"></a>
....: ...
....: <code>blah blah</code>
....: ...
....: <a href="interesting link"></a>
....: <a href="interesting link"></a>
....: ...
....: </div>''', 'lxml')
In [87]: for code in soup.find_all('code'):
....: for link in code.find_all_next('a'):
....: link.decompose()
....:
In [88]: soup
Out[88]:
<html><body><div id="whatever">
<a href="unwanted link"></a>
<a href="unwanted link"></a>
...
<code>blah blah</code>
...
...
</div></body></html>
+2