Compression Library Using Nvidia CUDA

Question

Compression Library Using Nvidia CUDA

Does anyone know a project that implements standard compression methods (e.g. Zip, GZip, BZip2, LZMA, ...) using the NVIDIA CUDA library

I was wondering if algorithms that can use many parallel tasks (for example, compression) will not work much faster on a graphics card than with a dual or quad core processor.

What do you think about the pros and cons of this approach?

+43

compression gpgpu cuda

Xn0vv3r Jan 19 '09 at 7:54

source share

6 answers

We have completed the first phase of research to improve the performance of lossless compression algorithms. Bzip2 was chosen for the prototype, our team optimized only one operation - the Burrows-Wheeler conversion, and we got some results: 2x-4x is accelerated on good compressible files. The code runs faster in all of our tests.

We are going to complete bzip2, deflate and LZMA support for some real tasks, such as: HTTP traffic and backup compression.

link to the blog: http://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx

+41

Alexander Azarov Apr 22 2018-11-11T00:

source share

Usually, compression algorithms cannot use parallel tasks; it is not easy to make algorithms very parallel. In your examples, TAR is not a compression algorithm, and the only algorithm that can be highly parallelized is BZIP, because it is a block compression algorithm. Each block can be compressed separately, but this will require a lot and a lot of memory. LZMA also does not work in parallel when you see 7zip using multiple streams, this is due to the fact that 7zip splits the data stream into 2 different streams, each of which is compressed using LZMA in a separate stream, so the compression algorithm itself is not parallel. This splitting only works when the data allows it.

+7

martinus Jan 19 '09 at 8:04

source share

Encryption algorithms have been quite successful in this area, so you may need to learn this. Here is an article related to CUDA and AES encryption: http://www.manavski.com/downloads/PID505889.pdf

+2

Robert Gould Jan 19 '09 at 8:32

source share

We are trying to port bzip2 to CUDA. :) So far (and only with rough tests) our Burroughs-Wheeler transform is 30% faster than a sequential algorithm. http://bzip2.github.com

+2

tarantinofan Dec 23 '10 at 11:39

source share

30% is good, but for applications such as backups, this is not enough.

My experience is that the average data stream in such cases receives compression of 1.2-1.7: 1 using gzip and ends with a limited output volume of 30-60 Mbit / s (this happens in a wide range of modern ones (around 2010-2012 ) middle class.

The limitation here is usually the speed at which data can be transmitted in the CPU itself.

Unfortunately, in order to support the LTO5 tape drive, it needs a raw (incompressible) data transfer rate of about 160 Mbps. When applying compressed data, an even higher data rate is required.

LTO compression is clearly much faster, but somewhat inefficient (equivalent to gzip -1 - this is good enough for most purposes). LTO4 and higher drives are typically built into AES-256 encryption mechanisms that can also support these speeds.

What this means for my case is that I will need 400% or better impressionability to consider it worthwhile.

Similar considerations apply to networks. At a speed of 30 Mbit / s, compression is an obstacle for Gb-class networks, and the question arises whether more money should be spent on network communication or compression ... :)

+1

Alan Brown Mar 05 2018-12-15T00:

source share

Die in Sente · Accepted Answer · 2009-01-20 22:41

I don’t know, someone did this and made it publicly available. Just IMHO, this does not sound very promising.

As Martinus points out, some compression algorithms are very consistent. Block compression algorithms, such as LZW, can be parallelized by encoding each block independently. Ziping a large file tree can be parallelized at the file level.

However, none of them is truly SIMD-style parallelism (Single Instruction Multiple Data), and they are not massively parallel.

GPUs are mostly vector processors where you can execute hundreds or thousands of ADD instructions at the blocking stage and execute programs that have very little dependency on these branches.

Compression algorithms in general sound are more like the programming model SPMD (Single Program Multiple Data) or MIMD (multiple instructions with multiple data), which is better suited for a multi-core processor.

Video compression algorithms can be combined by GPGPU processing, for example CUDA, only to the extent that there are a very large number of pixel blocks that are cosine-transformed or collapsed (for motion detection) in parallel, and IDCT or convolution routines can be expressed using unallocated code.

GPUs are also similar to algorithms with high numerical intensity (the ratio of mathematical operations to memory access). Algorithms with low numerical intensity (for example, with the addition of two vectors) can be massively parallel and SIMD, but still slower on gpu than cpu because they are memory related.

Compression Library Using Nvidia CUDA

More articles: