Convert io.BytesIO to io.StringIO to parse an HTML page

Question

Convert io.BytesIO to io.StringIO to parse an HTML page

I'm trying to parse the HTML page I got through pyCurl, but pyCurl WRITEFUNCTION returns the page as BYTES, not a string, so I cannot parse it with BeautifulSoup.

Is there a way to convert io.BytesIO to io.StringIO?

Or is there another way to parse an HTML page?

I am using Python 3.3.2.

+13

html type-conversion beautifulsoup stringio pycurl

Shipra Jul 04 '14 at 4:18

source share

2 answers

. , .

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())

+20

kakarukeys 10 . '18 3:33

Anthony sottile · Accepted Answer · 2014-07-04T04:35:36+0000

Naive approach:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that what you need

Convert io.BytesIO to io.StringIO to parse an HTML page

More articles: