Remove the html formatting "& gt;" from a text file using Python csv.reader

Question

Remove the html formatting "& gt;" from a text file using Python csv.reader

I have a text file; used as a delimiter. The problem is that it has HTML text formatting, for example, >Obviously, that; this is causing problems. The text file is large, and I do not have a list of these html lines, that is, there are many different examples, such as $amp;. How to remove all of them using python. The file is a list of names, addresses, phone numbers and several fields. I am looking for crap.html.remove module (textfile)

+3

python html file regex csv

Vincent Oct 28 '09 at 13:30

source share

3 answers

Take a look at the code here :

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except (ValueError, OverflowError):
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

Of course, this only applies to HTML objects. You may have other semicolons in the text that fiddled with your CSV parser. But I think you already know that ...

UPDATE : added catch for possible OverflowError.

+3

itsadok Oct 28 '09 at 13:39

source share

Unix ( Mac OS X)

recode html.. file_with_html.txt

> " > " ..

Python, .

+1

EOL 02 . '09 10:59

bobince · Accepted Answer · 2009-10-28T13:41:44+0000

The fastest way is to use an undocumented, but still stable method unescapein HTMLParser :

import HTMLParser
s= HTMLParser.HTMLParser().unescape(s)

Please note that this necessarily prints a Unicode string, so if you have any bytes without ASCII, you need to first s.decode(encoding).

Remove the html formatting "& gt;" from a text file using Python csv.reader

More articles: