I can read from the MSSQL database by sending queries to python via pypyodbc.
Mostly Unicode characters are handled correctly, but I hit a specific character that causes an error.
The field in question is of type nvarchar(50) and starts with this “” character, which reminds me a bit ...
----- |100| |111| -----
If this number is hex 0x100111 , then this is the symbol supplementary private use area-b u+100111 . Although it is interesting if it is binary 0b100111 , then this is an apostrophe, maybe the wrong encoding was used when loading data? This field stores part of the Chinese mailing address.
The error message includes
UnicodeDecodeError: codec 'utf16' cannot decode bytes at position 0-1: unexpected end of data
Here he is completely ...
Traceback (most recent call last): File "question.py", line 19, in <module> results.fetchone() File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer))) File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding) File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
Here's some minimal replay code ...
import pypyodbc connection_string = ( "DSN=sqlserverdatasource;" "UID=REDACTED;" "PWD=REDACTED;" "DATABASE=obi_load") connection = pypyodbc.connect(connection_string) cursor = connection.cursor() query_sql = ( "SELECT address_line_1 " "FROM address " "WHERE address_id == 'REDACTED' ") with cursor.execute(query_sql) as results: row = results.fetchone()
Here is a piece of my /etc/freetds/freetds.conf
[global]
I also tried with client charset = UTF-16 and omitting this line together.
Here's the corresponding snippet from my /etc/odbc.ini
[sqlserverdatasource] Driver = FreeTDS Description = ODBC connection via FreeTDS Trace = No Servername = sqlserver Database = REDACTED
Here is the corresponding snippet from my /etc/odbcinst.ini
[FreeTDS] Description = TDS Driver (Sybase/MS SQL) Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so CPTimeout = CPReuse = UsageCount = 1
I can work around this problem by extracting the results into a try / except block, discarding any lines that raise a UnicodeDecodeError , but is there a solution? Can I throw away only an unprovable character, or is there a way to extract this string without raising the error?
It is possible that some bad data ended up in the database.
I went to Google and checked related issues, but no luck.