How to use BeautifulSoup4 to get all the text before the tag

I am trying to clear some data for my application. My question is: I need a little Here is the HTML code:

<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>

I want the output to look like

This is the first sentence.
This is the second sentence.
This is the third sentence.

Can this be done?

+4
source share
4 answers

Try it. It should give you the desired result. Just consider the variable contentused in the script below to be the holder of your above inserted html elements.

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)

Conclusion:

This is a first sentence. 
This is a second sentence. 
This is a third sentence.
+2
source

, , . , , HTML.

td,

td = soup.find('td')

, ,

>>> td_kids = list(td.children)
>>> td_kids
['\n    This\n    ', <a class="tip info" href="blablablablabla">is a first</a>, '\n    sentence.\n    ', <br/>, '\n    This\n    ', <a class="tip info" href="blablablablabla">is a second</a>, '\n    sentence.\n    ', <br/>, 'This\n    ', <a class="tip info" href="blablablablabla">is a third</a>, '\n    sentence.\n    ', <br/>, '\n']

, - HTML. , br.

, ,

isinstance(td_kid[<some k>], bs4.element.Tag)

.

, , , . , , , BeautifulSoup " ", , .

, sub :

result = re.sub(r'\s{2,}', '', <joined list>)
+2

You can easily do this using bs4basic string manipulations:

from bs4 import BeautifulSoup

data = '''
<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

This will give as output:

This is a first sentence.
This is a second sentence.
This is a third sentence.
+2
source
htmlText = """<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>"""
from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while "  " in htmlText:
    htmlText = htmlText.replace("  ", " ")

# import into bs4
soup = BeautifulSoup(htmlText, "lxml")

# using https://stackoverflow.com/a/34640357/5702157
for br in soup.find_all("br"):
    br.replace_with("\n")

parsedText = soup.get_text()
while "\n " in parsedText:
    parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new lines
print(parsedText.strip())
+1
source

Source: https://habr.com/ru/post/1693466/


All Articles