In order to write code to verify the integrity of the data stored both on the SQL server and on PostgreSQL, I plan to calculate MD5 from the table data for both databases and check if they are equal. This works fine as long as the data is plain text (ANSI), as shown below:
sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', 'a')); 0x0cc175b9c0f1b6a831c399e269772661 postgres=
Now, if I try to use some Hangul characters (Korean), MD5 fails:
sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', '무')); 0x7827b52f65d9f7777d37071cbbbf7f2d postgres=
In my opinion, the reason for the discrepancy is that Unicode characters are stored as UCS-2 encoding (fixed 16-bit encoding) in SQL Server and UTF-8 encoding in PostgreSQL. And since MD5 works on character bits, the sequence of character bits will be different in both SQL Server and PostgreSQL.
AS I mainly focused on the character set in Hangul, the workaround I used in PostgreSQL is to convert the encoding from UTF-8 to UHC (the universal character set of Hangul) before computing the hash, as shown below:
postgres=# select MD5(CONVERT('무'::bytea,'UTF8','UHC')); 7827b52f65d9f7777d37071cbbbf7f2d
As you can see, the above hash value is the same as for the SQL server.
Everything is fine as long as I deal with Hangul symbols. But some tables contain a combination of Hangul and Chinese characters, in which case the conversion fails:
postgres=# select MD5(CONVERT('무么'::bytea,'UTF8','UHC')); ERROR: character 0xe4b988 of encoding "UTF8" has no equivalent in "UHC" postgres=#
The error makes sense since there is no equivalent Chinese characters in the UHC character.
How can I make it work? Basically, I need to find a way to convert UCS-2 to UTF-8 on SQL server or convert UTF-8 to UCS-2 in PostgreSQL before calculating MD5. I want to perform all these operations within the database engine, rather than loading data in an external application to calculate MD5, since some tables have a huge data set.
PostgreSQL version 9.1 SQL Server version 9.1