Why is JSON :: XS not generating valid UTF-8?

Question

Why is JSON :: XS not generating valid UTF-8?

I get corrupted JSON and I reduced it to this test case.

use utf8; use 5.18.0; use Test::More; use Test::utf8; use JSON::XS; BEGIN { # damn it my $builder = Test::Builder->new; foreach (qw/output failure_output todo_output/) { binmode $builder->$_, ':encoding(UTF-8)'; } } foreach my $string ( 'Deliver «French Bread»', '日本国' ) { my $hashref = { value => $string }; is_sane_utf8 $string, "String: $string"; my $json = encode_json($hashref); is_sane_utf8 $json, "JSON: $json"; say STDERR $json; } diag ord('»'); done_testing;

And this is the result:

 utf8.t .. ok 1 - String: Deliver «French Bread» not ok 2 - JSON: {"value":"Deliver Â«French BreadÂ»"} # Failed test 'JSON: {"value":"Deliver Â«French BreadÂ»"}' # at utf8.t line 17. # Found dodgy chars "<c2><ab>" at char 18 # String not flagged as utf8...was it meant to be? # Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex) {"value":"Deliver «French Bread»"} ok 3 - String: 日本国ok 4 - JSON: {"value":"æ¥æ¬å½"} 1..4 {"value":"日本国"} # 187

So, the line containing guillemets ("") is valid UTF-8, but as a result of JSON it is not. What am I missing? The utf8 correctly labels my source. In addition, this tail 187 is located from the diagonal. It's less than 255, so it is almost like a variant of the old Unicode error in Perl. (And the test result still looks like shit. I could never get it right with Test :: Builder).

Switching to JSON::PP gives the same result.

This is Perl 5.18.1 running on OS X Yosemite.

+6

json perl utf-8

Ovid Dec 6 '14 at 19:56

source share

2 answers

ikegami · Answer 1 · 2014-12-06T20:36:47+0000

is_sane_utf8 does not do what you think. You must pass the strings you decrypted. I am not sure what this is, but it is not the right tool. If you want to check if the UTF-8 string is correct, you can use

 ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 }, '$string is valid UTF-8');

To show that JSON :: XS is correct, look at the sequence is_sane_utf8 .

  +--------------------- Start of two byte sequence | +---------------- Not zero (good) | | +---------- Continuation byte indicator (good) | | | vvv C2 AB = [110]00010 [10]101011 00010 101011 = 000 1010 1011 = U+00AB = «

The following shows that JSON :: XS produces the same output as Encode.pm:

 use utf8; use 5.18.0; use JSON::XS; use Encode; foreach my $string ('Deliver «French Bread»', '日本国') { my $hashref = { value => $string }; say(sprintf("Input: U+%v04X", $string)); say(sprintf("UTF-8 of input: %v02X", encode_utf8($string))); my $json = encode_json($hashref); say(sprintf("JSON: %v02X", $json)); say(""); }

Output (with spaces added):

 Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D Input: U+65E5.672C.56FD UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D

Nova patch · Answer 2 · 2014-12-06T22:33:50+0000

JSON :: XS generates valid UTF-8, but you use the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.

Problem 1: Test :: utf8

Here are two main situations where is_sane_utf8 will fail:

Do you have an encoded character string that was decoded from a UTF-8 byte string, as if it were Latin-1 or double encoded UTF-8, or the character string was excellent and looked like a potentially “dodgy” miscoding (using the terminology from their documents).
You have a valid UTF-8 byte string containing encoded codes U + 0080 through U + 00FF, such as «French Bread» .

The is_sane_utf8 test is for character strings only and has a documented potential for false negatives.

Problem 2: Output Encoding

All your non-JSON strings are character strings, while your JSON strings are UTF-8 encoded byte strings that are returned from the JSON encoder. Since you use the :encoding(UTF-8) PerlIO level to output TAP, character strings are implicitly encoded in UTF-8 with good results, and byte strings containing JSON are encoded in a double way. However, STDERR does not have a set :encoding PerlIO, so the encoded JSON byte strings look good in your warn , as they are already encoded and passed directly.

Use only :encoding(UTF-8) PerlIO level for input-output with character strings, unlike encoded bytes of UTF-8 strings returned by default from the JSON encoder.

Why is JSON :: XS not generating valid UTF-8?

Problem 1: Test :: utf8

Problem 2: Output Encoding

More articles: