String Encoding / Decoding Error - Missing Character from the End

I have a column of type NVARCHAR in my database. I cannot convert the contents of this column to a regular string in my code. (I am using pyodbc to connect to the database).

 # This unicode string is returned by the database >>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d' # prints something in chineese >>> print my_string䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭 

The closest I left is its encoding to utf-16 like:

 >>> my_string.encode('utf-16') '\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5' >>> print my_string.encode('utf-16')   WAGCenter-04190517953516060503-20160605124857-4190-5 

But the actual value that I need is according to the store of values ​​in the database:

 WAGCenter-04190517953516060503-20160605124857-4190-51 

I tried with encoding utf-8 , utf-16 , ascii , utf-32 , but nothing worked.

Does anyone have an idea regarding what I don't see? And how to get the desired result from my_string .

Change When converting it to utf-16-le I can remove unnecessary characters from the beginning, but still one character is missing from the end

 >>> print t.encode('utf-16-le') WAGCenter-04190517953516060503-20160605124857-4190-5 

When trying to use some other columns it works. What could be causing this intermittent problem?

+5
source share
2 answers

You have a serious problem in defining your database, in how you store values ​​in it, or in how you read values ​​from it. I can only explain what you see, but neither why, nor how to fix it without:

  • type of database
  • way to enter values ​​into it
  • way to extract values ​​to get pseudo unicode string
  • actual content if you use direct (native) access to the database

What you get is an ASCII string, where 8-bit characters are grouped in pairs to create 16-bit Unicode characters in a small trailing order. Since the expected line has an odd number of characters, the last character (lossless) was lost in translation, because the original line ends with u'\352d' , where 0x2d is the ASCII code for '-' and 0x35 for '5' . Demo video:

 def cvt(ustring): l = [] for uc in ustring: l.append(chr(ord(uc) & 0xFF)) # low order byte l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte return ''.join(l) cvt(my_string) 'WAGCenter-04190517953516060503-20160605124857-4190-5' 
+2
source

The problem was that I used UTF-16 in my odbcinst.ini file, where I had to use the UTF-8 character encoding format.

I used to change it as an OPTION parameter when connecting to PyODBC . But later, changing it in the odbcinst.ini file odbcinst.ini problem.

+1
source

Source: https://habr.com/ru/post/1258030/


All Articles