How to convert HTML to text without markup in Python?

Question

How to convert HTML to text without markup in Python?

I need to get plain text from an HTML document, while respecting <br> elements as newlines. BeautifulSoup.text does not process <br> and newlines. HTML2Text is pretty nice, but it gets converted to markdowns. How else can I approach this?

+4

python html

Sean W. Jun 09 '13 at 16:33

source share

2 answers

You can separate the tags and replace them with spaces (if necessary):

 import re myString = re.sub(r"<(/)?br(/)?>", "\n", myString) myString = re.sub(r"<[^>]*>", " ", myString)

0

mishik Jun 09 '13 at 16:40

source share

That1guy · Accepted Answer · 2013-06-09T16:43:01+0000

I like to use the following method. You can make the .replace('<br>','\r\n') manual in a string before passing it to strip_tags(html) to honor new lines.

From this question :

 from HTMLParser import HTMLParser class MLStripper(HTMLParser): def __init__(self): self.reset() self.fed = [] def handle_data(self, d): self.fed.append(d) def get_data(self): return ''.join(self.fed) def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()

How to convert HTML to text without markup in Python?

More articles: