MD5 value mismatch between SQL server and PostgreSQL

In order to write code to verify the integrity of the data stored both on the SQL server and on PostgreSQL, I plan to calculate MD5 from the table data for both databases and check if they are equal. This works fine as long as the data is plain text (ANSI), as shown below:

sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', 'a')); 0x0cc175b9c0f1b6a831c399e269772661 postgres=# select MD5('a'); 0cc175b9c0f1b6a831c399e269772661 

Now, if I try to use some Hangul characters (Korean), MD5 fails:

 sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', '무')); 0x7827b52f65d9f7777d37071cbbbf7f2d postgres=# select MD5('무'); cb3e9be1a3a28b355eabae1fa1e291b3 

In my opinion, the reason for the discrepancy is that Unicode characters are stored as UCS-2 encoding (fixed 16-bit encoding) in SQL Server and UTF-8 encoding in PostgreSQL. And since MD5 works on character bits, the sequence of character bits will be different in both SQL Server and PostgreSQL.

AS I mainly focused on the character set in Hangul, the workaround I used in PostgreSQL is to convert the encoding from UTF-8 to UHC (the universal character set of Hangul) before computing the hash, as shown below:

 postgres=# select MD5(CONVERT('무'::bytea,'UTF8','UHC')); 7827b52f65d9f7777d37071cbbbf7f2d 

As you can see, the above hash value is the same as for the SQL server.

Everything is fine as long as I deal with Hangul symbols. But some tables contain a combination of Hangul and Chinese characters, in which case the conversion fails:

 postgres=# select MD5(CONVERT('무么'::bytea,'UTF8','UHC')); ERROR: character 0xe4b988 of encoding "UTF8" has no equivalent in "UHC" postgres=# 

The error makes sense since there is no equivalent Chinese characters in the UHC character.

How can I make it work? Basically, I need to find a way to convert UCS-2 to UTF-8 on SQL server or convert UTF-8 to UCS-2 in PostgreSQL before calculating MD5. I want to perform all these operations within the database engine, rather than loading data in an external application to calculate MD5, since some tables have a huge data set.

PostgreSQL version 9.1 SQL Server version 9.1

+6
source share
2 answers

Unfortunately, PostgreSQL does not support UTF-16 / UCS-2.

But you can write a function to convert utf8 text to ucs2 binary data ( bytea ):

 create or replace function text_to_ucs2be(input_in_utf8 text) returns bytea immutable strict language sql as $$ select decode(string_agg(case when code_point < 65536 then lpad(to_hex(code_point), 4, '0') end, ''), 'hex') from regexp_split_to_table(input_in_utf8, '') chr, ascii(chr) code_point $$; create or replace function text_to_ucs2le(input_in_utf8 text) returns bytea immutable strict language sql as $$ select decode(string_agg(case when code_point < 65536 then lpad(to_hex(code_point & 255), 2, '0') || lpad(to_hex(code_point >> 8), 2, '0') end, ''), 'hex') from regexp_split_to_table(input_in_utf8, '') chr, ascii(chr) code_point $$; 

Note. These functions above will cross out any code points without BMP (therefore, they will have the name ucs2).

The following statements should give you the same results:

 -- on PostgreSQL select md5(text_to_ucs2le('무')); -- on SQL server select master.dbo.fn_varbintohexstr(HashBytes('MD5', N'무')); 
+3
source

In SQL Server 2005, the NVARCHAR_TO_UTF8 function must be implemented based on the code provided in this blog post.

SQL FUNCTION FOR OBTAINING NVARCHAR FROM UTF-8 MEMORED IN VARCHAR

You will need to perform the inverse transform.

With the convenient function NVARCHAR_TO_UTF8, it should be possible to calculate the same MD5 hash in SQL Server 2005 as in PostgreSQL 9.1.

Note that native support for UTF-8 is still missing in SQL Server 2014: Suggestions for using UTF-8 in Microsoft Connect

+1
source

Source: https://habr.com/ru/post/983168/


All Articles