How to compress a string in Java?

I use GZIPOutputStream or ZIPOutputStream to compress the string (my string.length() less than 20), but the compressed result is longer than the original string.

On some site, I found that some friends said that it was because my source line was too short, GZIPOutputStream could be used to compress longer lines.

so can anyone help me squeeze the string?

My function is similar:

 String compress(String original) throws Exception { } 

Update:

 import java.io.ByteArrayOutputStream; import java.io.IOException; import java.util.zip.GZIPOutputStream; import java.util.zip.*; //ZipUtil public class ZipUtil { public static String compress(String str) { if (str == null || str.length() == 0) { return str; } ByteArrayOutputStream out = new ByteArrayOutputStream(); GZIPOutputStream gzip = new GZIPOutputStream(out); gzip.write(str.getBytes()); gzip.close(); return out.toString("ISO-8859-1"); } public static void main(String[] args) throws IOException { String string = "admin"; System.out.println("after compress:"); System.out.println(ZipUtil.compress(string)); } } 

Result:

alt text

+51
java string compression zip
Sep 06 '10 at 6:40
source share
11 answers

Compression algorithms almost always have some form of space overhead, which means that they are only effective in compressing data that is large enough so that the overhead is less than the amount of space saved.

Compressing a string of only 20 characters is not too easy, and it is not always possible. If you have repetition, Huffman coding or simple runtime coding can be compressed, but probably not very much.

+37
Sep 06 2018-10-06T00:
source share

When you create a String, you can think of it as a char list, which means that for every character in your string you need to support all possible char values. Sun docs

char . The char data type is a single 16-bit Unicode character. It has a minimum value of '\ u0000' (or 0) and a maximum value of '\ uffff' (or 65535 inclusive).

If you have a reduced character set that you want to support, you can write a simple compression algorithm that is similar to converting from binary β†’ decimal β†’ hexadecimal. You go from 65,536 (or as many characters as your target system supports) to 26 (in alphabetical order) / 36 (alphanumeric), etc.

I used this trick several times, for example, text encoding timestamps (target 36+, source 10) - just make sure you have a lot of unit tests!

+9
Sep 06 2018-10-06T00:
source share

If the passwords are more or less "random", you are out of luck, you will not be able to significantly reduce the size.

But: Why do you need to compress passwords? Maybe you do not need compression, but some kind of hash value? If you just need to check if the name matches the specified password, you do not need to save the password, but you can save the password hash. To check if the entered password matches the given name, you can build the hash value in the same way and compare it with the stored hash. Since the hash (Object.hashCode ()) is int, you can store all 20 password hashes in 80 bytes).

+7
Sep 06 2018-10-10T00:
source share

Your friend is right. Both gzip and ZIP are based on DEFLATE . This is a general purpose algorithm and is not intended to encode small strings.

If you need it, a possible solution is to custom encode and decode HashMap<String, String> . This allows you to do a simple one-to-one mapping:

 HashMap<String, String> toCompressed, toUncompressed; String compressed = toCompressed.get(uncompressed); // ... String uncompressed = toUncompressed.get(compressed); 

It is clear that this requires adjustment, and this is practical only for a small number of lines.

+6
Sep 06 '10 at 6:41
source share

Huffman Coding can help, but only if you have many frequent characters in your little String

+4
Sep 06 '10 at 6:44
source share

The ZIP algorithm is a combination of LZW and Huffman Trees . You can use one of these algorithms separately.

Compression is based on two factors:

  • substring repetition in source chain (LZW): if there are many repetitions, compression will be effective. This algorithm has good characteristics for compressing long plaintext, as words are often repeated
  • the number of each character in a compressed chain (Huffman): more redistribution between characters is asymmetric, more efficient compression

In your case, you should try only the LZW algorithm. Basically, a chain can be compressed without adding meta-information: it is probably better for compressing short lines.

For the Huffman algorithm, the encoding tree must be sent with compressed text. Thus, for small text, the result may be larger than the source text due to the tree.

+4
Sep 06 '10 at 6:50
source share

Huffman coding is a smart option here. Gzip and friends do this, but the way they work is to build a Huffman tree for input, send it, and then send the data encoded using the tree. If the tree is large in relation to the data, the file size cannot be saved.

However, you can avoid sending the tree: instead, you agree that the sender and the recipient already have one. It cannot be created specifically for each row, but you can have one global tree used to encode all rows. If you build it in the same language as the input lines (in English or any other), you will still get good compression, although not as good as with a custom tree for each input.

+4
Sep 06 '10 at 7:24 a.m.
source share

If you know that your strings are basically ASCII, you can convert them to UTF-8.

 byte[] bytes = string.getBytes("UTF-8"); 

This can reduce memory by about 50%. However, you will get an array of bytes, not a string. If you write it to a file, this should not be a problem.

To convert back to string:

 private final Charset UTF8_CHARSET = Charset.forName("UTF-8"); ... String s = new String(bytes, UTF8_CHARSET); 
+2
May 8 '17 at 12:03
source share

You do not see any compression for your string. Since you at least need several hundred bytes for real compression using GZIPOutputStream or ZIPOutputStream. Your string is too small (I don’t understand why you need compression for it)

Complete the conclusion from this article :

The article also shows how to compress and decompress data on the fly to reduce network traffic and improve the performance of your client / server applications. However, data compression on the fly improves client / server performance only when the compressible objects are more than a couple of hundred bytes. You will not be able to observe an increase in efficiency if objects are compressed and passed simple String objects, for example.

0
Sep 06 2018-10-06T00:
source share

Take a look at the Huffman algorithm.

https://codereview.stackexchange.com/questions/44473/huffman-code-implementation

The idea is that each character is replaced by a sequence of bits, depending on their frequency in the text (the more often, the smaller the sequence).

You can read the entire text and build a code table, for example:

Character code

a 0

s 10

e 110

m 111

The algorithm builds a tree of characters based on text input. The more characters you have, the worse the compression will be.

But depending on your text, this may be effective.

0
Feb 05 '15 at 16:11
source share

Compact row enhancement is available out of the box in Java 9 https://openjdk.java.net/jeps/254

java.lang.String now has:

closed final value byte [];

0
Jun 07 '19 at 11:41 on
source share



All Articles