Why did TSQL refer to sophia as sophia? What string encoding is this?

Question

Why did TSQL refer to sophia as sophia? What string encoding is this?

I came across a situation where the SQL server can store "sofia" and "sofia" as two different lines, but when compared in TSQL they are the same no matter which COLLATE is used, even if the Collate binary is:

CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'ｓｏｆｉａ')

SELECT * FROM #r WHERE NAME = N'ｓｏｆｉａ'

sofia
ｓｏｆｉａ

(2 row(s) affected)

IF 'ｓｏｆｉａ' = 'sofia'  COLLATE SQL_Latin1_General_CP1_CI_AS 
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

IF 'ｓｏｆｉａ' = 'sofia'  COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

I tried to find out the encode of "ｓｏｆｉａ"

http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp

It said:

            // If all else fails, the encoding is probably (though certainly not
            // definitely) the user local codepage! One might present to the user a
            // list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
            // A full list can be found using Encoding.GetEncodings();

I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match

Looking into the binary I found an interesting fact: "ｓｏｆｉａ" itself is encoded with UTF16, but it can be generated from  "SOFIA" UTF16 by filling "1" instead of "0" in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0)  It is shown as lower case. In C#, 

"ｓｏｆｉａ"

                             [0]         83          byte                                    
                             [1]         255        byte
                             [2]         79          byte
                             [3]         255        byte
                             [4]         70          byte
                             [5]         255        byte
                             [6]         73          byte
                             [7]         255        byte
                             [8]         65          byte
                             [9]         255        byte

"SOFIA"

                             [0]         83          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         73          byte
                             [7]         0        byte
                             [8]         65          byte
                             [9]         0        byte

"sofia"

                             [0]         115          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         105          byte
                             [7]         0        byte
                             [8]         97          byte
                             [9]         0        byte

One can create two different directorie/files with name as C:\ｓｏｆｉａ\, C:\sofia\ or  ｓｏｆｉａ.txt, sofia.txt.

Why does the SQL engine think they are the same while storing them with the original streams?

In order to get just the exact I want I had to convert to binary first:

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'ｓｏｆｉａ')

ｓｏｆｉａ

(1 row(s) affected)

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')

sofia

(1 row(s) affected)

But it has many side effects, such as culture and case. How can I teach TSQL Engine to know that they are different, at no particular cost?

Is there an official name for this kind of string coding?

+4

c # sql-server encoding unicode collation

Hong ao May 22, '15 at 2:33

source share

2 answers

Solomon Rutzky · Answer 1 · 2015-05-22T04:53:36+0000

There are two questions here.

-: . . . , , , . _WS , . , SQL_Latin1_General_CP1_CI_AS_WS .

, SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';. , , , , , - Latin1_General_CI_AS_WS. , _BIN2, ( , _BIN, , , , SQL_).

- :

IF 'ｓｏｆｉａ' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF 'ｓｏｆｉａ' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

: " ". :

-: NVARCHAR ¹ N, VARCHAR ² ( ?, , , , ).

IF N'ｓｏｆｉａ' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF N'ｓｏｆｉａ' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

N " ".

¹ XML N -prefixed UTF-16 Little Endian. - UCS-2/Base Multilingual Plane (BMP). , , _SC, UTF-16 .

²CHAR, VARCHAR TEXT ( , ), 8- ASCII Extended .

Kazetsukai · Answer 2 · 2015-05-22T02:45:59+0000

, . , , . SQL_Latin1_General_CP1_CI_AS, , -, .

, _WS , SQL_Latin1_General_CP1_CI_AS_WS .

EDIT: @srutzky, , _WS , _WS.

Why did TSQL refer to sophia as sophia? What string encoding is this?

More articles: