Opening many small files on NTFS is too slow

I am writing a program that needs to process many small files, say, thousands or even millions. I tested this part on 500 thousand files, and the first step was to simply sort through a directory that contains about 45 thousand directories (including sub-disks, etc.) and 500 thousand small files. Traversing all directories and files, including obtaining file sizes and calculating the total size, takes about 6 seconds. Now, if I try to open every file while crawling and close it right away, it seems like it never stops. In fact, it takes too much time (hours ...). Since I do this on Windows, I tried to open files using CreateFileW, _wfopen and _wopen. I did not read or write anything in the files, although in the final implementation I only need to read. However, I did not notice a noticeable improvement in any of the attempts.

I wonder if there is a more efficient way to open files with any of the available functions, whether it be C, C ++ or the Windows API, or is it only a more efficient way to read MFT and read disk blocks directly, which I try to avoid?

Update: The application I'm working on takes backup snapshots with version control. Thus, it also has incremental backups. The test with 500k files is done on a huge source code repository to do version control, something like scm. Thus, all files are not in the same directory. There are also around 45k directories (mentioned above).

Thus, the proposed solution for zip files does not help, because when performing a backup, this happens when accessing all files. Therefore, I will not see the benefits of this, and it will even entail some performance overhead.

+6
source share
5 answers

What you are trying to do is inherently difficult to do efficiently for any operating system. 45,000 subdirectories require a large amount of access to the disk, regardless of how it is cut.

Any file about 1000 bytes in size is "large" for NTFS. If there was a way to make most data files less than about 900 bytes, you could realize greater efficiency by storing data files in the MFT. Then it would be no more expensive to get data than to get timestamps or file size.

I doubt that there is any way to optimize program parameters, process parameters or even operating system settings so that the application works well. You have faced a multi-hour operation if you cannot rebuild it differently.

One strategy would be to distribute files on several computers: perhaps thousands of them and have a sub-application for each process of local files, submitting any results to the main application.

Another strategy would be to redesign all the files into several large files, such as large .zip files, as suggested by @felicepollano, effectively virtualizing your set of files. Random access to a 4000 GB file is inherently much more efficient and efficient in using resources than access to 4 billion 1 MB files. In addition, moving all the data to a suitable database manager (MySQL, SQL Server, etc.) will allow you to do this and possibly provide other benefits, such as a simple search and an easy archival strategy.

+6
source

Overhead of 5 to 20 ms per file is not abnormal for an NTFS volume with so many files. (On a conventional spindle drive, you cannot expect much better than this, anyway, because it is in the same order as the head search time. From now on, I assume that we are dealing with enterprise-class equipment, SSDs and / or RAID.)

Based on my experience, you can significantly increase throughput by parallelizing queries, i.e. using multiple threads and / or processes. Most of the overhead seems to be related to streams, the system can open ten files at once almost as fast as it can open one file by itself. I do not know why this is so. You may need to experiment to find the optimal level of parallelization.

A system administrator can also significantly improve performance by copying content to a new volume, preferably in approximately the same order in which they are available. I had to do this recently, and it reduced the backup time (for a volume with about 14 million files) from 85 hours to 18 hours.

You can also try OpenFileById () , which may work better for files in large directories, as it bypasses the need to enumerate the directory tree. However, I have never tried this myself, and this may not have much effect, since the directory will probably be cached anyway if you just listed it.

You can also list files on disk faster by reading them from the MFT , although it sounds as if it is not a bottleneck for you at the moment.

+3
source

It is possible to hack these files with a low compression ratio, and then use some Zip libraries to read them, which is usually faster than reading individual files one at a time. Pile drivers this should be done in advance as a step before the process.

+1
source

You can try to make one pass to list the files in the data structure, and then open and close them in the second pass to see if the rotation of these operations causes a conflict.

As I wrote in the comments, there are many performance issues associated with the sheer number of entries in a single NTFS directory. Therefore, if you have control over the distribution of these files into directories, you can take advantage of this.

Also check for malware on your system. Some slow down every file access by looking at the entire file every time you try to access it. Using Sysinternals Procmon can help you identify this problem.

When trying to improve performance, it is recommended that you set a goal. How fast is fast enough?

EDIT: This part of the original answer does not apply if you are not using Windows XP or earlier:

Opening and closing each file will by default update the last access time in the index. You can try an experiment in which you disable this feature through the registry or command line and see how big the difference is. I'm not sure how realistic this is to do in your real product, as this is a global setting.

+1
source

NTFS is slow with a lot of files. Especially if they are in the same directory. When they are divided into separate channels and sub-channels, access is faster. I have experience working with many files stored on a camcorder (4 cameras), and it was too slow even to see the number of files and size (Properties in the root folder). Interestingly, when a FAT32 drive, it is much faster. And all sources say that NTFS is faster ... Maybe faster to read a single file, but directory operations are slower.

Why do you need so many files? I hope the directory indexing service is enabled.

+1
source

Source: https://habr.com/ru/post/980717/


All Articles