General way to open a (possibly gzipped) file with specific text encoding in python

Question

General way to open a (possibly gzipped) file with specific text encoding in python

I am writing a piece of code that opens a (possibly gzipped) text file that works in both Python 2 and Python 3.

If I only had regular text files (not compressed), I could do:

import io for line in io.open(file_name, encoding='some_encoding'): pass

If I didn't want to decode (using strings / bytes in python 2/3)

 if file_name.endswith('.gz'): file_obj = gzip.open(file_name) else: file_obj = open(file_name) for line in file_obj: pass

How can I smoothly take care of all these cases? In other words, how to seamlessly integrate decoding with gzip.open ()?

+4

python encoding gzip

Peter Smit Sep 19 '12 at 10:15

source share

1 answer

Matti lyra · Accepted Answer · 2012-09-19T10:33:12+0000

I checked this for a short while, and it seems that everything will go right. You can provide the obj file for gzip.GzipFile and io.open , therefore

 import io import gzip f_obj = open('file.gz','r') io_obj = io.open(f_obj.fileno(), encoding='UTF-8') gzip_obj = gzip.GzipFile(fileobj=io_obj, mode='r') gzip_obj.read()

This gives me a UnicodeDecodeError , because the file I am reading is not really UTF-8, so it seems to be doing the right thing.

For some reason, if I use io.open to open file.gz directly, gzip says the file is not a compressed file.

UPDATE Yes, this is stupid, from the very beginning threads are the wrong way.

test file

 ö ä u y

The following code decodes a compressed file with a specific codec

 import codecs import gzip gz_fh = gzip.open('file.gz') ascii = codecs.getreader('ASCII') utf8 = codecs.getreader('UTF-8') ascii_fh = ascii(gz_fh) utf8_fh = utf8(gz_fh) ascii_fh.readlines() -> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) utf8_fh.readlines() -> [u'\xf6\n', u'\xe4\n', u'u\n', u'y']

codecs.StreamReader takes a stream, so you must transfer compressed or uncompressed files to it.

http://docs.python.org/library/codecs.html#codecs

General way to open a (possibly gzipped) file with specific text encoding in python

More articles: