How to use BeautifulSoup4 to get all the text before the tag

Question

How to use BeautifulSoup4 to get all the text before the tag

I am trying to clear some data for my application. My question is: I need a little Here is the HTML code:

<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>

I want the output to look like

This is the first sentence.
This is the second sentence.
This is the third sentence.

Can this be done?

+4

python html scrapy beautifulsoup

jack45j Feb 10 '18 at 15:50

source share

4 answers

, , . , , HTML.

td,

td = soup.find('td')

, ,

>>> td_kids = list(td.children)
>>> td_kids
['\n    This\n    ', <a class="tip info" href="blablablablabla">is a first</a>, '\n    sentence.\n    ', <br/>, '\n    This\n    ', <a class="tip info" href="blablablablabla">is a second</a>, '\n    sentence.\n    ', <br/>, 'This\n    ', <a class="tip info" href="blablablablabla">is a third</a>, '\n    sentence.\n    ', <br/>, '\n']

, - HTML. , br.

, ,

isinstance(td_kid[<some k>], bs4.element.Tag)

.

, , , . , , , BeautifulSoup " ", , .

, sub :

result = re.sub(r'\s{2,}', '', <joined list>)

+2

Bill Bell 10 . '18 18:27

You can easily do this using bs4basic string manipulations:

from bs4 import BeautifulSoup

data = '''
<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

This will give as output:

This is a first sentence.
This is a second sentence.
This is a third sentence.

+2

game0ver Feb 10 '18 at 20:57

source share

htmlText = """<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>"""
from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while "  " in htmlText:
    htmlText = htmlText.replace("  ", " ")

# import into bs4
soup = BeautifulSoup(htmlText, "lxml")

# using https://stackoverflow.com/a/34640357/5702157
for br in soup.find_all("br"):
    br.replace_with("\n")

parsedText = soup.get_text()
while "\n " in parsedText:
    parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new lines
print(parsedText.strip())

+1

yenter Feb 10 '18 at 16:20

source share

SIM · Accepted Answer · 2018-02-10T20:23:24+0000

Try it. It should give you the desired result. Just consider the variable contentused in the script below to be the holder of your above inserted html elements.

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)

Conclusion:

This is a first sentence. 
This is a second sentence. 
This is a third sentence.

How to use BeautifulSoup4 to get all the text before the tag

More articles: