GZIP C decompression # OutOfMemory

Question

GZIP C decompression # OutOfMemory

I have many large gzip files (approximately 10 MB - 200 MB) that I downloaded from ftp for unpacking.

So, I tried to find Google and find a solution for gzip decompression.

static byte[] Decompress(byte[] gzip) { using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress)) { const int size = 4096; byte[] buffer = new byte[size]; using (MemoryStream memory = new MemoryStream()) { int count = 0; do { count = stream.Read(buffer, 0, size); if (count > 0) { memory.Write(buffer, 0, count); } } while (count > 0); return memory.ToArray(); } } }

It works well for any files below 50 MB, but as soon as I have input of more than 50 MB, I got an exception from memory. The last position and memory length before the exception is 134217728. I do not think that it is related to my physical memory, I understand that I cannot have an object larger than 2 GB, since I use the 32-bit version.

I also need to process the data after unpacking the files. I am not sure if it is best to use a memory stream here, but I do not like to write to a file and then read the files again.

My questions

Why did I get a System.OutMemoryException?
What is the best possible solution for unzipping gzip files and subsequent text processing?

+6

c # out-of-memory gzip decompression gzipstream

William calvin May 03 '12 at 12:58

source share

4 answers

You can try the test as follows to understand how much you can write to a MemoryStream before you get an OutOfMemoryException:

  const int bufferSize = 4096; byte[] buffer = new byte[bufferSize]; int fileSize = 1000 * 1024 * 1024; int total = 0; try { using (MemoryStream memory = new MemoryStream()) { while (total < fileSize) { memory.Write(buffer, 0, bufferSize); total += bufferSize; } } MessageBox.Show("No errors"); } catch (OutOfMemoryException) { MessageBox.Show("OutOfMemory around size : " + (total / (1024m * 1024.0m)) + "MB" ); }

You may need to unzip the temporary physical file first and re-read it in small pieces and process it as you go.

Side Point: Interestingly, on a Windows XP PC, the code above gives: “OutOfMemory around 256 MB” when the code targets .net 2.0 and “OutOfMemory around 512 MB” on .net 4.

+1

Moe sisko May 03 '12 at 2:03

source share

Does it happen that you process files in multiple threads? This will require a lot of your address space. OutOfMemory errors are usually not related to physical memory, so MemoryStream can work much sooner than you expected. Check out this discussion http://social.msdn.microsoft.com/Forums/en-AU/csharpgeneral/thread/1af59645-cdef-46a9-9eb1-616661babf90 . If you switched to a 64-bit process, you will probably be more than fine for the file sizes you are dealing with.

In your current situation, however, you can work with memory mapped files to circumvent address size limits. If you are using .NET 4.0, it provides a built-in shell for Windows functions http://msdn.microsoft.com/en-us/library/dd267535.aspx .

+1

Michael yoon May 03 '12 at 2:06

source share

I understand that I cannot have an object larger than 2 GB since I use 32-bit

This is not true. You can have as much memory as you need. A 32-bit limit means that you can only have 4 GB (the OS takes up half) of the virtual address space. The virtual address space is not memory. It is well read here.

Why did I get a System.OutMemoryException?

Because the allocator could not find the adjacent address space for your object, or it happens too quickly and it gets clogged. (Most likely the first)

what is the best possible solution to unzip gzip files and some text processing after?

Write a script that downloads the files, then uses tools like gzip or 7zip to unzip it and then process it. Depending on the type of processing, the number of files and the total size, you will have to save them at some point in order to avoid such memory problems. Save them after unpacking and process 1 MB immediately.

0

Lukasz Madon May 03 '12 at 1:16

source share

Alexei Levenkov · Accepted Answer · 2012-05-03T02:03:52+0000

The memory allocation strategy for MemoryStream is not suitable for a huge amount of data.

Since the contract for MemoryStream must have a contiguous array as the underlying storage, it must reallocate the array often enough for a large stream (often like log2 (size_of_stream)). Side effects of such a redistribution are

long copy delays during redistribution
the new array should fit in the free address space already heavily fragmented by previous allocations
the new array will be on the LOH heap, which has its own quirks (no compaction, build on GC2).

As a result, processing a large (100 MB +) stream through a MemoryStream most likely eliminates memory exclusion on x86 systems. In addition, the most common pattern for returning data is calling GetArray, as you do, which additionally requires approximately the same amount of space as the last array buffer used for a MemoryStream.

Approaches to the solution:

The cheapest way is to pre-develop a MemoryStream to get closer to the size you need (preferably a little large). You can pre-calculate the size needed to read so that a fake stream that does not store anything (waste CPU resources, but you can read it). Consider also returning a stream instead of a byte array (or a returned byte array of a MemoryStream buffer along with its length).
Another option for processing it if you need a whole stream or an array of bytes is to use a temporary file stream instead of a MemoryStream to store a lot of data.
A more sophisticated approach is to implement a stream that blocks the underlying data in smaller (e.g. 64 KB) blocks to avoid allocating to LOH and copying data when the stream is to grow.

GZIP C decompression # OutOfMemory

More articles: