What is the reason for this UnicodeDecodeError with nvarchar field using pyodbc and MSSQL?

I can read from the MSSQL database by sending queries to python via pypyodbc.

Mostly Unicode characters are handled correctly, but I hit a specific character that causes an error.

The field in question is of type nvarchar(50) and starts with this “􀄑” character, which reminds me a bit ...

 ----- |100| |111| ----- 

If this number is hex 0x100111 , then this is the symbol supplementary private use area-b u+100111 . Although it is interesting if it is binary 0b100111 , then this is an apostrophe, maybe the wrong encoding was used when loading data? This field stores part of the Chinese mailing address.

The error message includes

UnicodeDecodeError: codec 'utf16' cannot decode bytes at position 0-1: unexpected end of data

Here he is completely ...

 Traceback (most recent call last): File "question.py", line 19, in <module> results.fetchone() File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 1869, in fetchone value_list.append(buf_cvt_func(from_buffer_u(alloc_buffer))) File "/VIRTUAL_ENVIRONMENT_DIR/local/lib/python2.7/site-packages/pypyodbc.py", line 482, in UCS_dec uchar = buffer.raw[i:i + ucs_length].decode(odbc_decoding) File "/VIRTUAL_ENVIRONMENT_DIR/lib/python2.7/encodings/utf_16.py", line 16, in decode return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data 

Here's some minimal replay code ...

 import pypyodbc connection_string = ( "DSN=sqlserverdatasource;" "UID=REDACTED;" "PWD=REDACTED;" "DATABASE=obi_load") connection = pypyodbc.connect(connection_string) cursor = connection.cursor() query_sql = ( "SELECT address_line_1 " "FROM address " "WHERE address_id == 'REDACTED' ") with cursor.execute(query_sql) as results: row = results.fetchone() # This is the line that raises the error. print row 

Here is a piece of my /etc/freetds/freetds.conf

 [global] ; tds version = 4.2 ; dump file = /tmp/freetds.log ; debug flags = 0xffff ; timeout = 10 ; connect timeout = 10 text size = 64512 [sqlserver] host = REDACTED port = 1433 tds version = 7.0 client charset = UTF-8 

I also tried with client charset = UTF-16 and omitting this line together.

Here's the corresponding snippet from my /etc/odbc.ini

 [sqlserverdatasource] Driver = FreeTDS Description = ODBC connection via FreeTDS Trace = No Servername = sqlserver Database = REDACTED 

Here is the corresponding snippet from my /etc/odbcinst.ini

 [FreeTDS] Description = TDS Driver (Sybase/MS SQL) Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so CPTimeout = CPReuse = UsageCount = 1 

I can work around this problem by extracting the results into a try / except block, discarding any lines that raise a UnicodeDecodeError , but is there a solution? Can I throw away only an unprovable character, or is there a way to extract this string without raising the error?

It is possible that some bad data ended up in the database.

I went to Google and checked related issues, but no luck.

+5
source share
2 answers

This problem was finally resolved, I suspect that the problem was that the text had the character of one encoding, clogged in a field with another declared encoding through some hacker method, when the table was set up.

0
source

I myself fixed the problem using this:

 conn.setencoding('utf-8') 

immediately before creating the cursor.

Where conn is the connection object.

I collected tens of millions of rows using fetchall() , and in the middle of a transaction that would be extremely expensive to cancel manually, so I couldn’t let just skip the invalid ones.

Source where I found the solution: https://github.com/mkleehammer/pyodbc/issues/112#issuecomment-264734456

0
source

Source: https://habr.com/ru/post/1247360/


All Articles