Avoiding memory loss when storing UTF-8 characters (8 bits) in a Java character (16 bits). two in one?

Question

Avoiding memory loss when storing UTF-8 characters (8 bits) in a Java character (16 bits). two in one?

I am afraid that I have a question about the details of a rather oversaturated topic, I searched a lot, but could not find a clear answer to this specific obvious - a very important problem:

When converting byte [] to String using UTF-8, each byte (8 bits) becomes an 8-bit character encoded by UTF-8, but each UTF-8 character is stored as a 16-bit character in java. It's right? If so, does this mean that every stupid java character uses only the first 8 bits and consumes twice as much memory? Is that right too? I wonder how this wasteful behavior is acceptable.

Are there any tricks to have a pseudo-string string that has 8 bits? Will this lead to less memory consumption? Or maybe there is a way to store> two <8 bits in a single 16-bit Java character to avoid memory loss?

thanks for any answers to deconfusion ...

EDIT: hi thanks everyone for the answer. I knew about variable length variable UTF-8. However, since my source is a byte that is 8 bits, I realized (apparently wrongly) that it only needs 8-bit UTF-8 words. Does UTF-8 conversion actually save the weird characters that you see when you do "cat somebinary" on the CLI? I thought that UTF-8 was just somehow used to map each of the possible 8-bit byte words to one specific 8-bit UTF-8 word. Wrong? I was thinking about using Base64, but this is bad because it uses only 7 bits.

the questions are reformulated: is there a smarter way to convert a byte to something String? May my favorite was just toss byte [] on char [], but then I still have 16-bit words.

Additional Use Information:

I am adapting Jedis (java client for NoSQL Redis) as a "primitive storage level" for hypergraphDB. So jedis is the database for another "database". My problem is that I have to constantly transmit jedis with bytes [], but internally,> Redis <(actual server) deals only with binary safe strings. Since Redis is written in C, a char is 8 bits long, AFAIK is not ASCIII, which is 7 bits. In Jedis, however, in the java world, each character has an internal 16-bit length. I don’t understand this code (yet), but I suppose that jedis then converts these 16-bit java strings to the corresponding 8-bit Redis string ((here [3]). It says that it extends FilterOutputStream. Full string conversion byte [] ↔ and use this filter Outputoutstream ...?)

now I'm wondering: if I had to convert bytes [] and String all the time, with data sizes from very small to potentially very large, there is no huge waste of memory so that every 8-bit character goes around 16 bits in java?

+6

java memory utf-8 byte 8bit

ib84 Apr 12 '11 at 12:02

source share

7 answers

Actually, you have the wrong part of UTF-8: UTF-8 is a multibyte encoding of variable length, so the valid characters are 1-4 bytes (in other words, some UTF-8 characters are 8-bit, some of them 16-bit, some of 24-bit and some 32-bit). Although 1-byte characters occupy 8 bits, there are many multibyte characters. If you had only 1-byte characters, this would allow you to have a total of 256 different characters (aka "Extended ASCII"); which may be enough for 90% use in English (my naive guesstimate), but will bite you in the ass as soon as you even think about something outside of this subset (see naïve is English, but cannot be written simply with ASCII).

So, although UTF-16 (which uses Java) looks wasteful, in fact it is not. In any case, if you are not using a very limited embedded system (in this case, what are you doing there with Java?), Trying to trim the lines is pointless microoptimization.

For a slightly longer introduction to character encoding, see, for example, this: http://www.joelonsoftware.com/articles/Unicode.html

+5

Piskvor Apr 12 '11 at 12:12

source share

When converting byte [] to String using UTF-8, each byte (8 bits) becomes an 8-bit character encoded by UTF-8

Not. When converting byte[] to String using UTF-8, each UTF-8 sequence of 1-6 bytes is converted to a UTF-16 sequence of 1-2 16-bit characters.

In almost all cases around the world, this UTF-16 sequence contains one character.

In Western Europe and North America, only 8 bits of this 16-bit character are used for most texts. However, if you have the Euro sign, you will need more than 8 bits.

For more information, see Unicode . Or an article by Joel Spolsky .

+2

Anon Apr 12 '11 at 12:29

source share

Java stores all the "characters" inside, like two bytes of a value representation. However, they are not saved in the same way as UTF-8. For example, the maximum value is supported by "\ uFFFF" (hex FFFF, dec 65536) or 11111111 11111111 binary (two bytes), but it will be a 3-byte Unicode character on disk.

The only possible loss is for truly "single" byte characters in memory (most ASCII characters are actually 7 bits). When characters are written to disk, they will still be in the specified encoding (therefore, single-byte UTF-8 characters will occupy only one byte).

The only thing that matters is a bunch of JVMs. However, you will need to have thousands and thousands of 8-bit characters in order to notice any real difference in the use of the Java heap, which will be far outweighed by all the additional (hacker) processing that you did.

A million odd 8-bit characters in RAM only "squanders" about 1 MiB anyway ...

+2

Mikaveli Apr 12 '11 at 13:49

source share

Redis (the actual server) deals only with binary safe strings.

I assume that you can use arbitrary octet sequences for keys / values. If you can use any C char sequence apart from character encoding, then the equivalent in Java is a byte type.

Strings in Java are implicitly UTF-16 . I mean, you can insert arbitrary numbers there, but the purpose of the class is to represent Unicode character data. Methods that perform byte char conversions perform transcoding operations from the known encoding to UTF-16.

If Jedis treats keys / values as UTF-8, then it will not support every value that Redis supports. Not every byte sequence is valid UTF-8, so encoding cannot be used for binary safe strings.

If UTF-8 or UTF-16 consumes more memory, it depends on the data - the euro symbol (€), for example, consumes three bytes in UTF-8 and only two bytes in UTF-16.

+1

Mcdowell Apr 12 '11 at 15:00

source share

Just for the record, I wrote a small implementation of the [] ↔ String interconverter byte, which works by casting every 2 bytes to 1 char. It is about 30-40% faster and consumes (possibly less) half the memory of the standard Java method: the new String (somebyte) and someString.getBytes ().

However, it is not compatible with existing string encoded bytes or byte encoded strings. In addition, it is not safe to call a method from different JVMs for shared data.

https://github.com/ib84/castriba

0

ib84 Apr 26 '11 at 14:58

source share

Perhaps this is exactly what you want:

 // Store them into the 16 bit datatype. char c1_8bit = 'a'; char c2_8bit = 'h'; char two_chars = (c1_8bit << 8) + c2_8bit; // extract them char c1_8bit = two_chars >> 8; char c2_8bit = two_chars & 0xFF;

Of course, this trick only works with ASCII characters (characters in the range [0-255]). What for? Because you want to keep your characters this way:
xxxx xxxx yyyy yyyy with x is char 1 and y is char 2. So this means that you only have 8 bits per char. And what is the biggest integer you can do with 8 bits? Answer: 255

255 = 0000 0000 1111 1111 (8 bits). And when you use char> 255, you will get the following:
256 = 0000 0001 0000 0000 (more than 8 bits), which is not suitable for 8 bits, which you provide for 1 char.

Plus: Keep in mind that Java is a language developed by smart people. They knew what they were doing. Eliminate Java API

-1

Martijn courteaux Apr 12 '11 at 12:12

source share

Peter Lawrey · Accepted Answer · 2011-04-12T13:56:12+0000

Isn't there any trick to have a 8-bit pseudo-string string?

yes, make sure you have an updated version of Java .;)

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

-XX: + UseCompressedStrings Use byte [] for strings that can be represented as pure ASCII. (Introduced in the Java 6 Update 21 Performance Release)

EDIT: This option does not work in Java 22 update 22 and is not enabled by default in Java 6 24. Note: this option seems to slow performance by about 10%.

Next program

public static void main(String... args) throws IOException { StringBuilder sb = new StringBuilder(); for (int i = 0; i < 10000; i++) sb.append(i); for (int j = 0; j < 10; j++) test(sb, j >= 2); } private static void test(StringBuilder sb, boolean print) { List<String> strings = new ArrayList<String>(); forceGC(); long free = Runtime.getRuntime().freeMemory(); long size = 0; for (int i = 0; i < 100; i++) { final String s = "" + sb + i; strings.add(s); size += s.length(); } forceGC(); long used = free - Runtime.getRuntime().freeMemory(); if (print) System.out.println("Bytes per character is " + (double) used / size); } private static void forceGC() { try { System.gc(); Thread.sleep(250); System.gc(); Thread.sleep(250); } catch (InterruptedException e) { throw new AssertionError(e); } }

Default Print

 Bytes per character is 2.0013668655941212 Bytes per character is 2.0013668655941212 Bytes per character is 2.0013606946433575 Bytes per character is 2.0013668655941212

with the option -XX:+UseCompressedStrings

 Bytes per character is 1.0014671435440285 Bytes per character is 1.0014671435440285 Bytes per character is 1.0014609725932648 Bytes per character is 1.0014671435440285

Avoiding memory loss when storing UTF-8 characters (8 bits) in a Java character (16 bits). two in one?

More articles: