Oracle: convert xml objects in varchar2 field to utf-8 characters

Question

Oracle: convert xml objects in varchar2 field to utf-8 characters

I have a field in a table that contains XML objects for special characters, since the table is in Latin-1. For instance. " Hallöle slovenčina " ("ö" is in Latin-1, but the "č" in "slovenčina" had to be converted into an object by some application that stores the values in the database)

Now I need to export the table to a utf-8 encoded file, converting the XML objects to their original characters.

Is there a function in Oracle that can handle this for me, or do I really need to create a huge key / value map for it?

Any help is greatly appreciated.

EDIT: I found the DBMS_XMLGEN.convert function, but it only works on < , > and & . Not on &#NNN; : - (

+4

oracle utf-8 latin1

mawimawi Feb 02 '12 at 2:43

source share

3 answers

You can also use the internationalization package:

UTL_I18N.unescape_reference ('text')

It works well in order to change these html objects to ordinary characters (for example, to clear after moving the database from iso 8859P1 to UTF-8)

+7

Olafur Tryggvason Mar 11 '13 at 0:19

source share

This should probably be done in PL / SQL, which I don't know, but I wanted to see how far I can get it with pure SQL. This only replaces the first appearance of the code, so you have to somehow run it several times.

 select regexp_replace(s, '&#([0-9]+);', u) from (select s, unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(c), 'xxxx'), ' ', '')) u from (select s, regexp_replace(s, '.*&#([0-9]+);.*', '\1') c from (select 'Hallöle sloven&#269;ina' s from dual)))

Or less readable, but more convenient:

 SELECT REGEXP_REPLACE(s, '&#([0-9]+);', unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(regexp_replace(s, '.*?&#([0-9]+);.*$', '\1', 1, 1)), 'xxxx'), ' ', '')), 1, 1) FROM (SELECT 'Hallöle sloven&#269;ina &#269; &#278;' s FROM DUAL)

This (updated) version correctly replaces the first occurrence. You need to apply it until all of them are replaced.

+2

MK. Feb 07 '12 at 4:37

source share

Dan A. · Accepted Answer · 2012-02-07T02:28:09+0000

I believe the problem with dbms_xmlgen is that there are technically only five XML objects. In your example, there is a numerical HTML object that corresponds to Unicode:

http://theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=0100&end=017F

Oracle has a UNISTR function, which is useful here:

 select unistr('sloven\010dina') from dual;

I converted 269 to its hexadecimal equivalent 010d in the above example (in Unicode it is U+010D ). However, you can pass the decimal number and do the following conversion:

 select unistr('sloven\' || replace(to_char(269, 'xxx'), ' ', '0') || 'ina') from dual;

EDIT: PL / SQL solution:

Here is an example that I have compiled for you. This should intersect and replace any occurrences for each row selected from the table.

 create table html_entities ( id NUMBER(3), text_row VARCHAR2(100) ); INSERT INTO html_entities VALUES (1, 'Hallöle sloven&#269;ina &#266; &#250;'); INSERT INTO html_entities VALUES (2, 'I like the letter &#266;'); INSERT INTO html_entities VALUES (3, 'Nothing to change here.'); DECLARE v_replace_str NVARCHAR2(1000); v_fh UTL_FILE.FILE_TYPE; BEGIN --v_fh := utl_file.fopen_nchar(LOCATION IN VARCHAR2, FILENAME IN VARCHAR2, OPEN_MODE IN VARCHAR2, MAX_LINESIZE IN BINARY_INTEGER); FOR v_rec IN (select id, text_row from html_entities) LOOP v_replace_str := v_rec.text_row; WHILE (REGEXP_INSTR(v_replace_str, '&#[0-9]+;') <> 0) LOOP v_replace_str := REGEXP_REPLACE( v_replace_str, '&#([0-9]+);', unistr('\' || replace(to_char(to_number(regexp_replace(v_replace_str, '.*?&#([0-9]+);.*$', '\1')), 'xxx'), ' ', '0')), 1, 1 ); END LOOP; -- utl_file.put_line_nchar(v_fh, v_replace_str); dbms_output.put_line(v_replace_str); END LOOP; --utl_file.fclose(v_fh); END; /

Please note that I made a call when calling the UTL_FILE function to write NVARCHAR (Oracle extended character set) strings to a file on the database server. Dbms_output is great for debugging, it doesn't seem to support extended characters, but that should not be a problem if you use UTL_FILE to write to a file. Here DBMS_OUTPUT:

 Hallöle slovencina C ú I like the letter C Nothing to change here.

Oracle: convert xml objects in varchar2 field to utf-8 characters

More articles: