How to say BeautifulSoup to extract the contents of a specific tag as text? (not touching him)

Question

How to say BeautifulSoup to extract the contents of a specific tag as text? (not touching him)

I need to parse an html document containing "code" tags

I get blocks of code as follows:

soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

The problem is that if I have a code tag like this:

<code class="csharp">
    List<Person> persons = new List<Person>();
</code>

BeautifulSoup to close nested tags and convert a block of code into:

<code class="csharp">
    List<person> persons = new List</person><person>();
    </person>
</code>

Is there a way to extract the contents of code tags as text using BeautifulSoup without letting it fix what html IT analytics thinks?

+3

python syntax-highlighting beautifulsoup

Bfil Feb 07 '11 at 15:21

source share

1 answer

Rod · Accepted Answer · 2011-02-07T15:47:38+0000

Add the code tag to the QUOTE_TAGS dictionary.

from BeautifulSoup import BeautifulSoup

content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"

BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

Conclusion:

[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]

How to say BeautifulSoup to extract the contents of a specific tag as text? (not touching him)

More articles: