I am writing a program that needs to process many small files, say, thousands or even millions. I tested this part on 500 thousand files, and the first step was to simply sort through a directory that contains about 45 thousand directories (including sub-disks, etc.) and 500 thousand small files. Traversing all directories and files, including obtaining file sizes and calculating the total size, takes about 6 seconds. Now, if I try to open every file while crawling and close it right away, it seems like it never stops. In fact, it takes too much time (hours ...). Since I do this on Windows, I tried to open files using CreateFileW, _wfopen and _wopen. I did not read or write anything in the files, although in the final implementation I only need to read. However, I did not notice a noticeable improvement in any of the attempts.
I wonder if there is a more efficient way to open files with any of the available functions, whether it be C, C ++ or the Windows API, or is it only a more efficient way to read MFT and read disk blocks directly, which I try to avoid?
Update: The application I'm working on takes backup snapshots with version control. Thus, it also has incremental backups. The test with 500k files is done on a huge source code repository to do version control, something like scm. Thus, all files are not in the same directory. There are also around 45k directories (mentioned above).
Thus, the proposed solution for zip files does not help, because when performing a backup, this happens when accessing all files. Therefore, I will not see the benefits of this, and it will even entail some performance overhead.
source share