Python string file init and strange characters

I have a huge text file with gzipped extension that I need to read line by line. I go with the following:

for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
  print i, line

At some point at the end of the file, python output diverges from the file. This is because strings get broken due to weird special characters that python considers new. When I open the file in "vim" they are correct, but suspicious characters are formatted weirdly. Is there something I can do to fix this?

I tried other codecs, including utf-16, latin-1. I also tried without a codec.

I looked at the file using 'od'. Of course, there are \ n characters where they should not be. But the "wrong" are preceded by a strange character. I think there is some encoding here with some characters being 2 bytes, but the trailing byte is \ n if not displayed properly.

According to the "od -h file", the offensive character is "1d1c".

If I replace:

gzip.open('file.gz')

WITH

os.popen('zcat file.gz')

It works great (and actually, pretty fast). But I would like to know where I am going wrong.

+3
source share
2 answers

Try again without a codec. The following reproduces your problem when using a codec and the absence of a problem without it:

import gzip 
import os 
import codecs 

data = gzip.open("file.gz", "wb") 
data.write('foo\x1d\x1cbar\nbaz') 
data.close() 

print list(codecs.getreader('utf-8')(gzip.open('file.gz'))) 
print list(os.popen('zcat file.gz')) 
print list(gzip.open('file.gz')) 

Outputs:

[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']
+5
source

( ) "" print repr (weird_special_characters). vim, ? , , " ". "" : - (

od? file.gz?? - , gzip! , , 0x0A.

utf-8, ?

" zcat" , utf8?

... ., , . , , , , repr() .

, DS , \x1c \x1d.

, :

ASCII \r \n :

>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 'A\x0bA\x0cA\r', # line break
 'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>

\x1D (FILE SEPARATOR),\x1E (GROUP SEPARATOR) \x1E (RECORD SEPARATOR) -:

>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 u'A\x0bA\x0cA\r', # line break
 u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
 u'A\x1d', # line break
 u'A\x1e', # line break
 u'A\x1fBBB']
>>>

. , ( ) . , , . , \x1c \x1d .

+1

Source: https://habr.com/ru/post/1743359/


All Articles