Unicode character compression

Question

Unicode character compression

I use GZIPOutputStreamin my java program to compress large strings and finally to store it in a database.

I see that when compressing English text, I achieve compression compression from 1/4 to 1/10 (depending on the string value). So let's say, for example, my original English text is 100 kb, then the average compressed text will be somewhere around 30 kb.

But when I compress Unicode characters, the compressed string actually takes up more bytes than the original string. Say, for example, my original Unicode string is 100 kb, then the compressed version comes out up to 200 KB.

Unicode string example: "嗨，这是，短信计数测试持续for.Hi这是短"

Can anyone suggest how I can achieve compression for Unicode text? and why is the compressed version actually bigger than the original version?

My compression code in Java:

            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            GZIPOutputStream zos = new GZIPOutputStream(baos);

            zos.write(text.getBytes("UTF-8"));
            zos.finish();
            zos.flush();

            byte[] udpBuffer = baos.toByteArray();

+4

java unicode gzip compression gzipoutputstream

Arry 11 . '14 13:14

2

JonK · Answer 1 · 2014-04-11T13:56:25+0000

Java GZIPOutputStream Deflate . Deflate - LZ77 . Unicode:

: , LZW?
A: SCSU 8- LZW 16- Unicode-, , ( ), . SCSU LZW , .
, Huffman Lempel-Ziv, 16 , - . , , , . LZW. . " " , (Prentice Hall 1990).

Java SCSU - , , 't .jar, , , , , .

Aleksandar Stojadinovic · Answer 2 · 2014-04-11T13:51:10+0000

, , GZIP , "" ( ). , "" 20 , "" , x, y, z... , , . , .

, , .

P.S : gzip , ?

Unicode character compression

More articles: