You have a damaged data file. If this character is really for U + 00AD SOFT HYPHEN , you are missing the 0xC2 byte:
>>> '\u00ad'.encode('utf8') b'\xc2\xad'
Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen makes the most sense. However, this indicates a data set that may not contain other bytes. You have just achieved what matters.
I will return to the source of this dataset and verify that the file was not corrupted during upload. Otherwise, using error='replace' is a viable workaround if there are no separators (tabs, newlines, etc.).
Another possibility is that the SEC does use a different encoding for the file; for example, on Windows Codepage 1252 and Latin-1, 0xAD is the correct soft hyphen encoding. And indeed, when I load the same data set directly (warning, large ZIP file) and open tags.txt , I cannot decode the data as UTF-8:
>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte >>> from pprint import pprint >>> f = open('/tmp/2017q1/tag.txt', 'rb') >>> f.seek(3583550) 3583550 >>> pprint(f.read(100)) (b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A' b'CTIVITIES:\t\nProceedsFromSaleOfIn')
There are two such non-ASCII characters in the file:
>>> f.seek(0) 0 >>> pprint([l for l in f if any(b > 127 for b in l)]) [b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0' b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I' b'NVESTING AND FINANCING ACTIVITIES:\t\n', b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani' b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h' b'e.\n']
Hotel Kranichh\xf6he stands for Latin-1 Hotel Kranichhรถhe .
The file also has several 0xC1 / 0xD1 pairs:
>>> f.seek(0) 0 >>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)] >>> quotes[0].split(b'\t')[-1][50:130] b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th' >>> quotes[1].split(b'\t')[-1][50:130] b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'
I am sure that it is really U + 201C LEFT DOUBLE QUOTATION MARK and U + 201D CORRECT DOUBLE QUOTATION MARK PRICE ; Pay attention to parts 1C and 1D . It seems that their encoder took UTF-16 and deleted all the upper bytes, and not encoded UTF-8 correctly!
There is no codec transfer from Python that will encode '\u201C\u201D' to b'\x1C\x1D' , which makes it more likely that the SEC spoils the encoding process. In fact, there are also 0x13 and 0x14 characters, which are probably en and em hyphens ( U + 2013 and U + 2014 ), as well as 0x19 bytes, which are almost certainly single quotes ( U + 2019 ). All that is missing to complete the image is 0x18 bytes to represent U + 2018 .
Assuming encoding is broken, we can try to recover. The following code will read the file and correct problems with quotes, assuming that the rest of the data does not use characters outside of Latin-1, except for quotes:
_map = { # dashes 0x13: '\u2013', 0x14: '\u2014', # single quotes 0x18: '\u2018', 0x19: '\u2019', # double quotes 0x1c: '\u201c', 0x1d: '\u201d', } def repair(line, _map=_map): """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1""" return line.translate(_map)
then apply this to the lines read:
with open(filename, 'r', encoding='latin-1') as f: repaired = map(repair, f) fields = next(repaired).strip().split('\t') for line in repaired: yield process_tag_record(fields, line)
Separately, referring to your published code, you make Python more complex than necessary. Do not use codecs.open() ; this legacy code that has known issues and is slower than the new Python 3 I / O layer. Just use open() . Do not use f.readlines() ; You do not need to read the entire file in the list here. Just iterate over the file directly:
def tags(filename): """Yield Tag instances from tag.txt.""" with open(filename, 'r', encoding='utf-8', errors='strict') as f: fields = next(f).strip().split('\t') for line in f: yield process_tag_record(fields, line)
If process_tag_record also tabbed, use the csv.reader() object and don't break each line manually:
import csv def tags(filename): """Yield Tag instances from tag.txt.""" with open(filename, 'r', encoding='utf-8', errors='strict') as f: reader = csv.reader(f, delimiter='\t') fields = next(reader) for row in reader: yield process_tag_record(fields, row)
If process_tag_record combines the fields list with the values โโin row to form a dictionary, simply use csv.DictReader() instead:
def tags(filename): """Yield Tag instances from tag.txt.""" with open(filename, 'r', encoding='utf-8', errors='strict') as f: reader = csv.DictReader(f, delimiter='\t')