Python 3 UnicodeDecodeError - How to debug UnicodeDecodeError?

I have a text file that is approved by the publisher (U.S. Securities Commission), is encoded in UTF-8 ( https://www.sec.gov/files/aqfs.pdf , section 4). I process strings with the following code:

def tags(filename): """Yield Tag instances from tag.txt.""" with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f: fields = f.readline().strip().split('\t') for line in f.readlines(): yield process_tag_record(fields, line) 

I get the following error:

 Traceback (most recent call last): File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module> main() File "/home/randm/Projects/finance/secxbrl.py", line 143, in main all_tags = list(tags("tag.txt")) File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags content = f.read() File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read return self.reader.read(size) File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte 

Given that I probably can't go back to the SEC and tell them that they have files that don't seem to be encoded in UTF-8, how do I debug and catch this error?

What i tried

I made a hexdump file and found that the text "ADDITIONAL DISCLOSURE OF NONCASHASH DISABLED PEOPLE" was in violation of the text. If I decode a byte violation as a hexadecimal code point (ie, "U + 00AD"), this makes sense in context, as it is a soft hyphen. But the following does not seem to work:

 Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> b"\x41".decode("utf-8") 'A' >>> b"\xad".decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte >>> b"\xc2ad".decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte 

I used errors='replace' , which seems to go through. But I would like to understand what will happen if I try to insert this into the database.

Edited to add hexdump:

 0036ae40 31 09 09 09 09 53 55 50 50 4c 45 4d 45 4e 54 41 |1....SUPPLEMENTA| 0036ae50 4c 20 44 49 53 43 4c 4f 53 55 52 45 20 4f 46 20 |L DISCLOSURE OF | 0036ae60 4e 4f 4e ad 43 41 53 48 20 49 4e 56 45 53 54 49 |NON.CASH INVESTI| 0036ae70 4e 47 20 41 4e 44 20 46 49 4e 41 4e 43 49 4e 47 |NG AND FINANCING| 0036ae80 20 41 43 54 49 56 49 54 49 45 53 3a 09 0a 50 72 | ACTIVITIES:..Pr| 
+5
source share
1 answer

You have a damaged data file. If this character is really for U + 00AD SOFT HYPHEN , you are missing the 0xC2 byte:

 >>> '\u00ad'.encode('utf8') b'\xc2\xad' 

Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen makes the most sense. However, this indicates a data set that may not contain other bytes. You have just achieved what matters.

I will return to the source of this dataset and verify that the file was not corrupted during upload. Otherwise, using error='replace' is a viable workaround if there are no separators (tabs, newlines, etc.).

Another possibility is that the SEC does use a different encoding for the file; for example, on Windows Codepage 1252 and Latin-1, 0xAD is the correct soft hyphen encoding. And indeed, when I load the same data set directly (warning, large ZIP file) and open tags.txt , I cannot decode the data as UTF-8:

 >>> open('/tmp/2017q1/tag.txt', encoding='utf8').read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte >>> from pprint import pprint >>> f = open('/tmp/2017q1/tag.txt', 'rb') >>> f.seek(3583550) 3583550 >>> pprint(f.read(100)) (b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A' b'CTIVITIES:\t\nProceedsFromSaleOfIn') 

There are two such non-ASCII characters in the file:

 >>> f.seek(0) 0 >>> pprint([l for l in f if any(b > 127 for b in l)]) [b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0' b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I' b'NVESTING AND FINANCING ACTIVITIES:\t\n', b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani' b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h' b'e.\n'] 

Hotel Kranichh\xf6he stands for Latin-1 Hotel Kranichhรถhe .

The file also has several 0xC1 / 0xD1 pairs:

 >>> f.seek(0) 0 >>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)] >>> quotes[0].split(b'\t')[-1][50:130] b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th' >>> quotes[1].split(b'\t')[-1][50:130] b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai' 

I am sure that it is really U + 201C LEFT DOUBLE QUOTATION MARK and U + 201D CORRECT DOUBLE QUOTATION MARK PRICE ; Pay attention to parts 1C and 1D . It seems that their encoder took UTF-16 and deleted all the upper bytes, and not encoded UTF-8 correctly!

There is no codec transfer from Python that will encode '\u201C\u201D' to b'\x1C\x1D' , which makes it more likely that the SEC spoils the encoding process. In fact, there are also 0x13 and 0x14 characters, which are probably en and em hyphens ( U + 2013 and U + 2014 ), as well as 0x19 bytes, which are almost certainly single quotes ( U + 2019 ). All that is missing to complete the image is 0x18 bytes to represent U + 2018 .

Assuming encoding is broken, we can try to recover. The following code will read the file and correct problems with quotes, assuming that the rest of the data does not use characters outside of Latin-1, except for quotes:

 _map = { # dashes 0x13: '\u2013', 0x14: '\u2014', # single quotes 0x18: '\u2018', 0x19: '\u2019', # double quotes 0x1c: '\u201c', 0x1d: '\u201d', } def repair(line, _map=_map): """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1""" return line.translate(_map) 

then apply this to the lines read:

 with open(filename, 'r', encoding='latin-1') as f: repaired = map(repair, f) fields = next(repaired).strip().split('\t') for line in repaired: yield process_tag_record(fields, line) 

Separately, referring to your published code, you make Python more complex than necessary. Do not use codecs.open() ; this legacy code that has known issues and is slower than the new Python 3 I / O layer. Just use open() . Do not use f.readlines() ; You do not need to read the entire file in the list here. Just iterate over the file directly:

 def tags(filename): """Yield Tag instances from tag.txt.""" with open(filename, 'r', encoding='utf-8', errors='strict') as f: fields = next(f).strip().split('\t') for line in f: yield process_tag_record(fields, line) 

If process_tag_record also tabbed, use the csv.reader() object and don't break each line manually:

 import csv def tags(filename): """Yield Tag instances from tag.txt.""" with open(filename, 'r', encoding='utf-8', errors='strict') as f: reader = csv.reader(f, delimiter='\t') fields = next(reader) for row in reader: yield process_tag_record(fields, row) 

If process_tag_record combines the fields list with the values โ€‹โ€‹in row to form a dictionary, simply use csv.DictReader() instead:

 def tags(filename): """Yield Tag instances from tag.txt.""" with open(filename, 'r', encoding='utf-8', errors='strict') as f: reader = csv.DictReader(f, delimiter='\t') # first row is used as keys for the dictionary, no need to read fields manually. yield from reader 
+6
source

Source: https://habr.com/ru/post/1271739/


All Articles