You asked to store three values ββin registers, but in standard x86 there are only four general-purpose registers: this is a very big load on the last remaining register, which is one of the reasons why I expect register really only prevents you from using &foo to find the address of a variable . I donβt think that any modern compiler even uses it as a hint, these days. Feel free to remove all three uses and reuse your application.
Since you yourself read huge fragments of the file, you can use open(2) and read(2) directly and remove all the standard I / O processing behind the scenes. Another common approach is open(2) and mmap(2) file in memory: let the OS page be in the form of pages. This allows you to optimize the reading of future pages from disk when performing calculations: this is a common access pattern and one OS developer tried to optimize. (A simple mechanism for displaying the entire file immediately sets an upper limit on the size of files that you can process, perhaps around 2.5 gigabytes on 32-bit platforms and absolutely huge on 64-bit platforms. Matching the file in pieces will allow you to process files of arbitrary size even on 32-bit platforms, but at the cost of cycles like yours now, for reading, but for comparison.)
As David Gelhar points out, you use an odd-length buffer β this can complicate the code path for reading a file into memory. If you want to stick to reading from files to buffers, I suggest using a few 8192 (two pages of memory), since it will not have special cases until the last cycle.
If you really delve into the last bit of speed and don't mind sharply increasing the size of your preliminary calculation table, you can look at the file in 16-bit chunks, and not just 8-bit chunks. Often, memory access on 16-bit alignment is faster than on 8-bit alignment, and you reduce the number of iterations through your loop in half, which usually gives a huge increase in speed. The disadvantage, of course, is the increased memory pressure (65 thousand records, each of 8 bytes, and not only 256 records each of 4 bytes), and a much larger table is much less suitable for a full processor cache.
And the last optimization idea that comes to my mind is to fork(2) on 2, 3, or 4 processes (or use threads), each of which can calculate the crc32 parts of the file, and then merge the end results after all processes are complete. crc32 may not be computational enough to actually benefit from trying to use multiple cores from SMP or multi-core computers, and figuring out how to combine partial crc32 calculations may not be feasible - I have not studied it myself :) - but it can pay back for it, and learning how to write multi-processor or multi-threaded software is worth the effort.