What is the temporary difficulty of reading a file from a Linux file system?

Suppose I have a very large number of directories (say, 100.000 ) in my file system, and within each directory there is the same number of directories. Each directory can contain any number of files, but usually no more than a few. This structure goes to a constant depth ( 10 ).

My question is that there is a difference in time complexity (in the read operation) if I read in a file from this directory structure, for example: /dir-34/dir-215/dir-345/file1 using Paths.get() compared to reading a file form with a simple file system, for example this:

 /dir1 /dir2 /dir3 file1 /dir4 file2 

Note. This is just a theoretical question. I just want to know if the number of directories / files in the directory in which I am trying to open the file affects the speed of the read operation.

+5
source share
2 answers

If /path/to/file (note: as before, performance and time complexity will largely depend on the structures on the disk and the implementation of the main file system. Ex btrfs, all of this b-tree, ext4 and XFS use H -trees)

Therefore, to move the directory structure to the node leaf (the directory that contains the file), the average case time complexity should be O (logN), in the worst case, O (N), N = there are no directories in the tree. Worst of all, when you have the N-th directory created under N-1, and the N-1-th directory created in N-2, etc. .... to the root directory, which forms a single branch in the tree. Ideally, you do not need to go through all the tree directories from the root if you have the full path.

Then, if your base FS supports directory indexes and hashing, for each search you will need another O (1) to search for the file in the directory. Therefore, O (logN) + O (1), i.e. Ignoring the members of the lower order, it should be only O (logN), where N is the level.

+2
source

Some popular file systems use more efficient data structures than older file systems. ext4 has default directory hashing (as @ninjalj pointed out), like XFS. This means that searching in one directory is expected to average O(1) (so constant time if your path has a fixed maximum number of subdirectories). This follows the hash function itself.

Even if you have millions of files in a directory, accessing one file is very fast - but only if you have the full path. If you do not have the full path and instead need to look at the directory for the template, you are faced with O(n) number of entries in the directory. This is further compounded by the small read size (32k) for standard system-level read requests.

(While ext4 directories can have a huge number of files, they are limited to entries in the 64000 subdirectory.)

+1
source

Source: https://habr.com/ru/post/1209474/


All Articles