How can I speed up crc32 calculation?

I am trying to write a crc32 implementation on linux that is as quick as an exercise in learning how to optimize C. I tried my best, but I could not find many good resources on the Internet. I'm not even sure the buffer size is reasonable; he was chosen by repeated experiment.

#include <stdio.h> #define BUFFSIZE 1048567 const unsigned long int lookupbase = 0xEDB88320; unsigned long int crctable[256] = { 0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA, /* LONG LIST OF PRECALCULTED VALUES */ 0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D}; int main(int argc, char *argv[]){ register unsigned long int x; int i; register unsigned char *c, *endbuff; unsigned char buff[BUFFSIZE]; register FILE *thisfile=NULL; for (i = 1; i < argc; i++){ thisfile = fopen(argv[i], "r"); if (thisfile == NULL) { printf("Unable to open "); } else { x = 0xFFFFFFFF; c = &(buff[0]); endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]); while (c != endbuff){ while (c != endbuff){ x=(x>>8) ^ crctable[(x&0xFF)^*c]; c++; } c = &(buff[0]); endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]); } fclose(thisfile); x = x ^ 0xFFFFFFFF; printf("%0.8X ", x); } printf("%s\n", argv[i]); } return 0; } 

Thanks in advance for any suggestions or resources I can read.

+4
source share
3 answers

You asked to store three values ​​in registers, but in standard x86 there are only four general-purpose registers: this is a very big load on the last remaining register, which is one of the reasons why I expect register really only prevents you from using &foo to find the address of a variable . I don’t think that any modern compiler even uses it as a hint, these days. Feel free to remove all three uses and reuse your application.

Since you yourself read huge fragments of the file, you can use open(2) and read(2) directly and remove all the standard I / O processing behind the scenes. Another common approach is open(2) and mmap(2) file in memory: let the OS page be in the form of pages. This allows you to optimize the reading of future pages from disk when performing calculations: this is a common access pattern and one OS developer tried to optimize. (A simple mechanism for displaying the entire file immediately sets an upper limit on the size of files that you can process, perhaps around 2.5 gigabytes on 32-bit platforms and absolutely huge on 64-bit platforms. Matching the file in pieces will allow you to process files of arbitrary size even on 32-bit platforms, but at the cost of cycles like yours now, for reading, but for comparison.)

As David Gelhar points out, you use an odd-length buffer β€” this can complicate the code path for reading a file into memory. If you want to stick to reading from files to buffers, I suggest using a few 8192 (two pages of memory), since it will not have special cases until the last cycle.

If you really delve into the last bit of speed and don't mind sharply increasing the size of your preliminary calculation table, you can look at the file in 16-bit chunks, and not just 8-bit chunks. Often, memory access on 16-bit alignment is faster than on 8-bit alignment, and you reduce the number of iterations through your loop in half, which usually gives a huge increase in speed. The disadvantage, of course, is the increased memory pressure (65 thousand records, each of 8 bytes, and not only 256 records each of 4 bytes), and a much larger table is much less suitable for a full processor cache.

And the last optimization idea that comes to my mind is to fork(2) on 2, 3, or 4 processes (or use threads), each of which can calculate the crc32 parts of the file, and then merge the end results after all processes are complete. crc32 may not be computational enough to actually benefit from trying to use multiple cores from SMP or multi-core computers, and figuring out how to combine partial crc32 calculations may not be feasible - I have not studied it myself :) - but it can pay back for it, and learning how to write multi-processor or multi-threaded software is worth the effort.

+2
source

In Linux? Forget the register keyword, this is just a suggestion to the compiler and, in my experience with gcc , this is a waste of space. gcc more than able to figure this out for itself.

I would just make sure that you are compiling with a crazy level of optimization -O3 and checking it. I saw gcc code at this level, which took me hours to understand, so it was not easy.

And in the size of the buffer, make it as large as possible. Even with buffering, the cost of calling fread is still worth it, so the less you do it, the better. You would see a huge improvement if you increased the buffer size from 1K to 1M, not so much if you increased it from 1M to 2M, but even a small increase in performance was an increase. And 2M is not the upper limit of what you can use, I would set it to one or more gigabytes, if possible.

Then you can put it at the file level (and not inside main ). At some point, the stack will not be able to hold it.

As with most optimizations, you can usually trade in space for a while. Keep in mind that for small files (less than 1 M) you will not see an improvement, as there is still only one read left, no matter how big the buffer you make. You can even detect a slight slowdown if it takes longer to load the process to configure the memory.

But, since it will only be for small files (where performance is still not a problem), it does not really matter. Large files in which performance is a problem should hope to find an improvement.

And I know that I do not need to talk about it (because you indicate that you are doing it), but I mentioned it in any case for those who do not know this: "Measure, do not guess! The earth is dotted with corpses of those who are optimized with guesses :-)

+3
source

You cannot speed up the actual arithmetic of calculating CRC, so the areas that you can look at are the overhead of (a) reading the file and (b) the loop.

You are using a rather large buffer size, which is good (but why is this an odd number?). Using the read (2) system call (assuming you are on a system like unix) instead of the standard library function fread (3), you can save a single copy operation (copying data from the internal fread buffer to your buffer).

For cycle overhead, review the rollout cycle .


There are also some abbreviations in your code that you can eliminate.

  • sizeof (unsigned char) is 1 (as defined in C); no need to explicitly calculate it

  • c = &(buff[0]) is exactly equivalent to c = buff

None of these changes will improve the performance of the code (provided a decent compiler), but they will make it more understandable and more in accordance with the usual style C.

+2
source

Source: https://habr.com/ru/post/1344708/


All Articles