How to extract and ignore range in HTML markup?
My input is as follows:
<ul class="definitions">
<li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>
Required Outputs:
label = 'noun'
meaning = 'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
related_to = ['sale', 'chain', 'wine']
utag = ['product']
I tried this:
>>> from bs4 import BeautifulSoup
>>> text = '''<ul class="definitions">
... <li><span>noun</span> the joining together of businesses which deal with different stages in the production or <a href="sale.html">sale</a> of the same <u slug="product">product</u>, as when a restaurant <a href="chain.html">chain</a> takes over a <a href="wine.html">wine</a> importer</li></ul>'''
>>> bsoup = BeautifulSoup(text)
>>> bsoup.text
u'\nnoun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
>>> label = bsoup.find('span')
>>> label
<span>noun</span>
>>> label = bsoup.find('span').text
>>> label
u'noun'
>>> bsoup.text.strip()
u'noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
>>> bsoup.text.strip
>>> definition = bsoup.text.strip()
>>> definition = definition.partition(' ')[2] if definition.split()[0] == label else definition
>>> definition
u'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> related_to = [r.text for r in bsoup.find_all('u')]
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> utag = [r.text for r in bsoup.find_all('u')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> utag
[u'product']
Using BeautifulSoup is fine, but a little more verbose to get what you need.
Is there any other way to achieve the same results?
Is there a way to regex with some groups to find the desired outputs?
alvas source
share