Why did TSQL refer to sophia as sophia? What string encoding is this?

I came across a situation where the SQL server can store "sofia" and "sofia" as two different lines, but when compared in TSQL they are the same no matter which COLLATE is used, even if the Collate binary is:

CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'sofia')

SELECT * FROM #r WHERE NAME = N'sofia'

sofia
sofia

(2 row(s) affected)

IF 'sofia' = 'sofia'  COLLATE SQL_Latin1_General_CP1_CI_AS 
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

IF 'sofia' = 'sofia'  COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

-------------------
Values are the same

(1 row(s) affected)

I tried to find out the encode of "sofia"

http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp

It said:

            // If all else fails, the encoding is probably (though certainly not
            // definitely) the user local codepage! One might present to the user a
            // list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
            // A full list can be found using Encoding.GetEncodings();

I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match

Looking into the binary I found an interesting fact: "sofia" itself is encoded with UTF16, but it can be generated from  "SOFIA" UTF16 by filling "1" instead of "0" in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0)  It is shown as lower case. In C#, 

"sofia"

                             [0]         83          byte                                    
                             [1]         255        byte
                             [2]         79          byte
                             [3]         255        byte
                             [4]         70          byte
                             [5]         255        byte
                             [6]         73          byte
                             [7]         255        byte
                             [8]         65          byte
                             [9]         255        byte

"SOFIA"

                             [0]         83          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         73          byte
                             [7]         0        byte
                             [8]         65          byte
                             [9]         0        byte

"sofia"

                             [0]         115          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         105          byte
                             [7]         0        byte
                             [8]         97          byte
                             [9]         0        byte

One can create two different directorie/files with name as C:\sofia\, C:\sofia\ or  sofia.txt, sofia.txt.

Why does the SQL engine think they are the same while storing them with the original streams?

In order to get just the exact I want I had to convert to binary first:

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')

sofia

(1 row(s) affected)

SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')

sofia

(1 row(s) affected)

But it has many side effects, such as culture and case. How can I teach TSQL Engine to know that they are different, at no particular cost?

Is there an official name for this kind of string coding?

+4
source share
2 answers

There are two questions here.

-: . . . , , , . _WS , . , SQL_Latin1_General_CP1_CI_AS_WS .

, SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';. , , , , , - Latin1_General_CI_AS_WS. , _BIN2, ( , _BIN, , , , SQL_).

- :

IF 'sofia' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF 'sofia' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

: " ". :

-: NVARCHAR 1 N, VARCHAR 2 ( ?, , , , ).

IF N'sofia' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

IF N'sofia' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

N " ".


1 XML N -prefixed UTF-16 Little Endian. - UCS-2/Base Multilingual Plane (BMP). , , _SC, UTF-16 .

2CHAR, VARCHAR TEXT ( , ), 8- ASCII Extended .

+6

, . , , . SQL_Latin1_General_CP1_CI_AS, , -, .

, _WS , SQL_Latin1_General_CP1_CI_AS_WS .

EDIT: @srutzky, , _WS , _WS.

+2

Source: https://habr.com/ru/post/1589337/


All Articles