How to say BeautifulSoup to extract the contents of a specific tag as text? (not touching him)

I need to parse an html document containing "code" tags

I get blocks of code as follows:

soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

The problem is that if I have a code tag like this:

<code class="csharp">
    List<Person> persons = new List<Person>();
</code>

BeautifulSoup to close nested tags and convert a block of code into:

<code class="csharp">
    List<person> persons = new List</person><person>();
    </person>
</code>

Is there a way to extract the contents of code tags as text using BeautifulSoup without letting it fix what html IT analytics thinks?

+3
source share
1 answer

Add the code tag to the QUOTE_TAGS dictionary.

from BeautifulSoup import BeautifulSoup

content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"

BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

Conclusion:

[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]
+7
source

Source: https://habr.com/ru/post/1790478/


All Articles