Python string file init and strange characters

Question

Python string file init and strange characters

I have a huge text file with gzipped extension that I need to read line by line. I go with the following:

for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
  print i, line

At some point at the end of the file, python output diverges from the file. This is because strings get broken due to weird special characters that python considers new. When I open the file in "vim" they are correct, but suspicious characters are formatted weirdly. Is there something I can do to fix this?

I tried other codecs, including utf-16, latin-1. I also tried without a codec.

I looked at the file using 'od'. Of course, there are \ n characters where they should not be. But the "wrong" are preceded by a strange character. I think there is some encoding here with some characters being 2 bytes, but the trailing byte is \ n if not displayed properly.

According to the "od -h file", the offensive character is "1d1c".

If I replace:

gzip.open('file.gz')

WITH

os.popen('zcat file.gz')

It works great (and actually, pretty fast). But I would like to know where I am going wrong.

+3

python gzip line-breaks codec

muckabout Apr 29 '10 at 13:57

source share

2 answers

( ) "" print repr (weird_special_characters). vim, ? , , " ". "" : - (

od? file.gz?? - , gzip! , , 0x0A.

utf-8, ?

" zcat" , utf8?

... ., , . , , , , repr() .

, DS , \x1c \x1d.

, :

ASCII \r \n :

>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 'A\x0bA\x0cA\r', # line break
 'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>

\x1D (FILE SEPARATOR),\x1E (GROUP SEPARATOR) \x1E (RECORD SEPARATOR) -:

>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 u'A\x0bA\x0cA\r', # line break
 u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
 u'A\x1d', # line break
 u'A\x1e', # line break
 u'A\x1fBBB']
>>>

. , ( ) . , , . , \x1c \x1d .

+1

John Machin 30 . '10 2:28

DS. · Accepted Answer · 2010-05-02T04:20:28+0000

Try again without a codec. The following reproduces your problem when using a codec and the absence of a problem without it:

import gzip 
import os 
import codecs 

data = gzip.open("file.gz", "wb") 
data.write('foo\x1d\x1cbar\nbaz') 
data.close() 

print list(codecs.getreader('utf-8')(gzip.open('file.gz'))) 
print list(os.popen('zcat file.gz')) 
print list(gzip.open('file.gz'))

Outputs:

[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']

Python string file init and strange characters

More articles: