Disadvantages of fragmented arrays for storing dynamic bytes

Question

Disadvantages of fragmented arrays for storing dynamic bytes

Byte ByteArrayOutputStream seems like a pretty wasteful implementation, and I was wondering if there were any specific reasons for this. First it saves 1 fixed array to the backend. If it is full, it creates a new array and copies the old array into it (more memory and more overhead). Then, if you do toByteArray (), it actually copies the array again.

The byte buffers are good, but also fixed in size, they just offer several in the same array, nothing more.

I was wondering if it would be interesting to create a class (or, if it already exists, point it to me) that uses one or more support arrays. Instead of duplicating the array each time for expansion, it simply adds a new support array. For reading, you can easily create an interface, such as an input stream, while you can set an interface, such as an output stream for writing

Any feedback on whether such a thing exists, and if not: why? Do I have a flaw that I do not see?

+6

java arrays

nablex Oct 4 '13 at 6:51

source share

4 answers

Peter Walser · Answer 1 · 2013-10-04T07:46:24+0000

This is actually a great idea, especially for big data.

You can quickly run into memory problems when allocating huge arrays on the heap , since they need contiguous free memory. We once had a situation where we often allocated arrays of bytes with a size of 10-50 MB and OutOfMemoryException up in OutOfMemoryException s, and not because there was too little available memory (we usually had 90% or 900 MB for free), but because of heap fragmentation, there wasn’t one single contiguous block of memory that could be used for this array.

As a result, we created the Blob class, which internally stored data in the form of pieces of chains (List) of smaller arrays . The arrays were of a fixed size (necessary for a quick search, so you can quickly calculate the involved array and offset for a given index), and we create the InputStream and OutputStream for this Blob. Later we expanded it to replace the disk and from disk.

Downside? Nothing but a little simple programming.
Benefits? Efficient storage of big data in memory, no problems with heap fragmentation.

I can only encourage you to give it away!

Raedwald · Answer 2 · 2013-10-04T07:49:33+0000

The C ++ standard library has both a vector class (for example, Java ArrayList) and a deque class (another class of type List). The latter provides effective addition and effective addition. The implementation I saw supported a list of fixed-length array blocks. So, like the case that interests you. Thus, it is certainly possible.

The disadvantage is significantly increased code complexity. I assume that the implementation in the JRE can be changed to do what you suggest with the toByteArray method collecting data from fragments. But doing this would be a very low priority, because a simple implementation was dazzlingly fast. Any code that runs an IO should assume that reading and writing are slow operations that can be blocked. ByteArrayOutputStream instead is very fast because it performs memory operations instead of true external I / O. Copying these byte arrays around is likely to be much faster than external IO. The drawback of the current implementation is the creation of large garbage arrays when used for large output streams. But use cases for a class are for small threads; if you want to temporarily store the bytes of a large output stream, you must use a temporary file. Thus, the complexity of your proposal will probably not help in practice.

Joni · Answer 3 · 2013-10-04T07:55:04+0000

Sounds like you already know the benefits. The disadvantages of the buffer list compared to a single buffer include:

If the buffers are fixed in size, you need to allocate O (n) memory allocations for writing n bytes, ByteArrayOutputStream does O (log n), because the buffer grows exponentially
the implementation is more complicated: you need to track the active buffer, you may have to switch buffers in the middle of the recording (depending on the design)
Toggle buffers are cache misses when reading

You can write such a data structure if it makes sense for your application

nablex · Answer 4 · 2013-10-04T08:12:34+0000

Since there seems to be no real implementation, I quickly wrote an initial implementation to test the speed:

 public class Buffer { private int size; private int writeIndex, writeOffset, readIndex, readOffset; private List<byte[]> backingArrays = new ArrayList<byte[]>(); public Buffer() { this(10240); } public Buffer(int size) { this.size = size; } public int read(byte [] bytes) { return read(bytes, 0, bytes.length); } public int read(byte [] bytes, int offset, int length) { int read = 0; while(length > 0) { byte [] input = getInput(); // no more data if (input == null) { if (read == 0) return -1; else return read; } int readLength = Math.min(length, (readIndex == writeIndex ? writeOffset : size) - readOffset); System.arraycopy(input, readOffset, bytes, offset, readLength); length -= readLength; offset += readLength; readOffset += readLength; read += readLength; } return read; } public void write(byte [] bytes) { write(bytes, 0, bytes.length); } public void write(byte [] bytes, int offset, int length) { while (length > 0) { byte [] output = getOutput(); int writeLength = Math.min(length, output.length - writeOffset); System.arraycopy(bytes, offset, output, writeOffset, writeLength); length -= writeLength; offset += writeLength; writeOffset += writeLength; } } private byte[] getOutput() { // if we have filled an array, move to the next one if (writeOffset >= size) { writeIndex++; writeOffset = 0; } // create it if it doesn't exist yet if (backingArrays.size() <= writeIndex) backingArrays.add(new byte[size]); return backingArrays.get(writeIndex); } private byte [] getInput() { // nothing written yet if (backingArrays.size() == 0) return null; if (readOffset >= size) { readIndex++; readOffset = 0; } // can not read past where it is written if (readIndex > writeIndex || (readIndex == writeIndex && readOffset >= writeOffset)) return null; else return backingArrays.get(readIndex); } public long size() { return (long) size * (long) writeIndex + writeOffset; } }

I am testing it by copying a 36 megabyte file. Much depends on the interaction of the files, but, as a rule, it is 40% faster to read a little faster than writing than bytearrayinputstream (it hangs by about 5-20%)

I put it together quickly, so if you catch any errors let me know.

EDIT

Added function, by default released arrays released for gc

Disadvantages of fragmented arrays for storing dynamic bytes

More articles: