Why does Linux use getdents () in directories instead of read ()?

Question

Why does Linux use getdents () in directories instead of read ()?

I was looking through K & RC, and I noticed that they used to read entries in directories:

while (read(dp->fd, (char *) &dirbuf, sizeof(dirbuf)) == sizeof(dirbuf)) /* code */

Where dirbuf is the system directory structure, and dp->fd is a valid file descriptor. On my dirbuf system dirbuf there would be a struct linux_dirent . Note that a struct linux_dirent has a flexible array member for the entry name, but suppose for simplicity that is not the case. (Working with a flexible array element in this scenario will require only a small additional template).

Linux, however, does not support this design. When using read() to read entries in directories, as described above, read() returns -1 and errno set to EISDIR .

Instead, Linux allocates a system call to read directories, namely the getdents() system call. However, I noticed that it works almost the same way as described above.

 while (syscall(SYS_getdents, fd, &dirbuf, sizeof(dirbuf)) != -1) /* code */

What was reasonable for this? This seems to have little / no benefit compared to using read() , as done in K & R.

+5

c linux unix architecture filesystem-access

Giorgian borca-tasciuc Mar 22 '16 at 1:42

source share

2 answers

In K & R (in fact, Unix before SVr2, at least possibly SVr3 ), the entries in the directory were 16 bytes, using 2 bytes for inode and 14 bytes for file names.

Using read made sense because the entries in the directory on the disk were the same size. 16 bytes (power 2) also makes sense, since hardware multiplication is not required to calculate offsets. (I remember someone telling me around 1978 that the Unix disk driver used floating points and was slow ... but this second hand, although funny).

Later, improvements to the directories gave longer names, which meant that the sizes were different (it makes no sense to make huge entries the same as the largest possible name). A newer interface, readdir , was readdir .

Linux provides a lower level interface. According to its man page :

These are not the interfaces that interest you. Take a look at readdir (3) for a POSIX compatible interface. This page documents the bare kernel system call interfaces.

As your example shows, getdents is a system call useful for implementing readdir . The method for implementing readdir unspecified. There is no particular reason why the early readdir (from about 30 years ago) could not be implemented as a library function using read and malloc and similar functions to manage long file names read from a directory.

Moving functions to the kernel was done (possibly) in this case to improve performance. Since getdents read multiple directory entries at a time (unlike readdir ), this can reduce the overhead of reading all entries for a small directory (by reducing the number of system calls).

Further reading:

Why are linux file names limited to 256 characters (i.e. 8 bits)?

+1

Thomas dickey Mar 22 '16 at 23:54

source share

Craig estey · Accepted Answer · 2016-03-24T06:35:56+0000

getdents will return struct linux_dirent . He will do this for any basic type of file system. The “on disk” format may be completely different, known only to this file system driver, so a simple read call in user space cannot work. That is, getdents can convert from native format to populate linux_dirent .

could not say the same thing about reading bytes from a file using read ()? The data format on disk in file format is not necessarily uniform for file systems or even shifted on disk - thus reading a few bytes from disk will again be something that I expect will be delegated to the file system driver.

Inconsistent file data processed by the VFS level ["virtual filesystem"]. Regardless of how FS chooses to organize the list of blocks for the file (for example, ext4 uses the nodes "inodes": "index" or "information", they use the organization "ISAM" ("sequential access method"). MS / DOS FS may have a completely different organization).

Each FS driver registers a VFS function callback table when it starts. For this operation (for example, open/close/read/write/seek ) in the table there is a corresponding record.

The VFS layer (that is, from the user space system call) will “call” the FS driver, and the FS driver will perform the operation, performing whatever it considers necessary to complete the request.

I assume that the FS driver knows about the location of the data inside a regular file on disk - even if the data was fragmented.

Yes. For example, if a read request is to read the first three blocks from a file (for example, 0,1,2), FS will look for indexing information for the file and get a list of physical blocks for reading (for example, 1,000,000, 200.37) from the surface of the disk. All this is handled transparently in the FS driver.

A user space program will simply see that its buffer is filled with the correct data, regardless of how complex FS indexing and block fetching are.

It may be [more] more correct to refer to this as inode data transfer, since there are inodes files for files (that is, inode has indexing information to “scatter / collect” FS blocks for the file). But the FS driver also uses this internally to read from the directory. That is, each directory has an index index to track indexing information for that directory.

So, for the FS driver, the directory is like a flat file that has specially formatted information. These are reference "records". This is what getdents returns. It sits on top of the index.

Directory entries can be of variable length [based on the length of the file name]. Thus, the format on the disk will be (name it "Type A"):

 static part|variable length name static part|variable length name ...

But ... some FSES are organized differently (call it "Type B"):

 <static1>,<static2>... <variable1>,<variable2>,...

Thus, an organization of type A can be read atomically by calling the user space read(2) , type B will be difficult. So calling getdents VFS handles this.

can't VFS also represent the "linux_dirent" kind of directory, such as VFS, is a "flat view" of a file?

This is what getdents are for.

And again, I assume that the FS driver knows the type of each file and thus can return linux_dirent when read () is called in a directory, and not in a series of bytes.

getdents did not always exist. When the dimension of the hard drives was fixed and there was only one FS format, the call to readdir(3) probably made read(2) under it and received a series of bytes [this is just what read(2) provides.) Actually, IIRC, at the beginning there were only readdir(2) and getdents and readdir(3) did not exist.

But what do you do if read(2) is "short" (for example, two bytes are too small)? How do you communicate with this app?

My question is more similar, since the FS driver can determine if the file is a directory or a regular file (and I assume that it can), and since it should intercept all read () calls at any time, why isn’t read () in directory implemented as reading linux_dirent?

read in the directory is not intercepted and converted to getdents , because the OS is minimalistic. He expects you to know the difference and make the appropriate syscall.

You do open(2) for files or dirs [ opendir(3) is a wrapper and open(2) bottom]. You can read / write / search files and search / receive for dirs.

But ... we do read for EISDIR returns. [Note: I forgot this in my original comments]. In the simple “flat data” model that it provides, there is no way to transfer / control everything that getdents can / does.

Thus, instead of allowing a more incomplete way to get partial / incorrect information, it is easier for the kernel and application developer to go through the getdents interface.

In addition, getdents do things atomically. If you read the directory entries in this program, there may be other programs that create and delete files in this directory or rename them - right in the middle of the getdents sequence.

getdents present an atomic view. Either the file exists or not. It has been renamed or not. This way, you do not get a “partially modified” view, no matter how much “turmoil” occurs around you. When you ask getdents for 20 entries, you will get them [or 10 if there are only a lot of them].

Side Note: A useful trick is to “exceed” the score. That is, tell getdents that you want 50,000 entries [you must provide a space]. Usually you return about 100 or so. But now you have an atomic snapshot in time for a complete catalog. I sometimes do this instead of a cycle with a count of 1 - YMMV. You still need to protect against immediate disappearance, but at least you can see it (i.e. subsequent file failure)

So, you always get “whole” records and a record for the file you just deleted. This does not mean that the file still exists, simply because it was there during getdents . Another process can instantly erase it, but not in the middle of getdents

If read(2) enabled, you will need to guess how much data to read and not know which records were fully generated in a partial state. If FS had an organization of type B above, one read could not atomically get the static part and the variable part in one step.

It would be philosophically wrong to slow down read(2) to do what getdents .

getdents , unlink , creat , rmdir and rename (etc.) operations are blocked and serialized to prevent any inconsistencies [not to mention FS corruption or FS leaks / lost blocks]. In other words, these system calls all "know each other."

If pgmA renames "x" to "z" and pgmB renames "y" to "z", they do not collide. One goes first and second, but no FS blocks are ever lost / leak. getdents gets the whole view (whether it is "xy", "yz", "xz" or "z"), but he will never see "xyz" at the same time.

Why does Linux use getdents () in directories instead of read ()?

More articles: