What is the reason for this UnicodeDecodeError with nvarchar field using pyodbc and MSSQL?

Question

What is the reason for this UnicodeDecodeError with nvarchar field using pyodbc and MSSQL?

I can read from the MSSQL database by sending queries to python via pypyodbc.

Mostly Unicode characters are handled correctly, but I hit a specific character that causes an error.

The field in question is of type nvarchar(50) and starts with this “􀄑” character, which reminds me a bit ...

 ----- |100| |111| -----

If this number is hex 0x100111 , then this is the symbol supplementary private use area-b u+100111 . Although it is interesting if it is binary 0b100111 , then this is an apostrophe, maybe the wrong encoding was used when loading data? This field stores part of the Chinese mailing address.

The error message includes

UnicodeDecodeError: codec 'utf16' cannot decode bytes at position 0-1: unexpected end of data

Here he is completely ...

 Traceback (most recent call last): File "question.py", line 19, in <module> results.fetchone() File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer))) File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding) File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data

Here's some minimal replay code ...

 import pypyodbc connection_string = ( "DSN=sqlserverdatasource;" "UID=REDACTED;" "PWD=REDACTED;" "DATABASE=obi_load") connection = pypyodbc.connect(connection_string) cursor = connection.cursor() query_sql = ( "SELECT address_line_1 " "FROM address " "WHERE address_id == 'REDACTED' ") with cursor.execute(query_sql) as results: row = results.fetchone() # This is the line that raises the error. print row

Here is a piece of my /etc/freetds/freetds.conf

 [global] ; tds version = 4.2 ; dump file = /tmp/freetds.log ; debug flags = 0xffff ; timeout = 10 ; connect timeout = 10 text size = 64512 [sqlserver] host = REDACTED port = 1433 tds version = 7.0 client charset = UTF-8

I also tried with client charset = UTF-16 and omitting this line together.

Here's the corresponding snippet from my /etc/odbc.ini

 [sqlserverdatasource] Driver = FreeTDS Description = ODBC connection via FreeTDS Trace = No Servername = sqlserver Database = REDACTED

Here is the corresponding snippet from my /etc/odbcinst.ini

 [FreeTDS] Description = TDS Driver (Sybase/MS SQL) Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so CPTimeout = CPReuse = UsageCount = 1

I can work around this problem by extracting the results into a try / except block, discarding any lines that raise a UnicodeDecodeError , but is there a solution? Can I throw away only an unprovable character, or is there a way to extract this string without raising the error?

It is possible that some bad data ended up in the database.

I went to Google and checked related issues, but no luck.

+5

python sql-server unicode pyodbc pypyodbc

Andrew Martin Apr 18 '16 at 12:18

source share

2 answers

I myself fixed the problem using this:

 conn.setencoding('utf-8')

immediately before creating the cursor.

Where conn is the connection object.

I collected tens of millions of rows using fetchall() , and in the middle of a transaction that would be extremely expensive to cancel manually, so I couldn’t let just skip the invalid ones.

Source where I found the solution: https://github.com/mkleehammer/pyodbc/issues/112#issuecomment-264734456

0

Max candocia Feb 05 '18 at 17:23

source share

Andrew Martin · Accepted Answer · 2017-01-28T14:35:50+0000

This problem was finally resolved, I suspect that the problem was that the text had the character of one encoding, clogged in a field with another declared encoding through some hacker method, when the table was set up.

What is the reason for this UnicodeDecodeError with nvarchar field using pyodbc and MSSQL?

More articles: