Create file header (file metadata) in C

The file header contains all the data about the file - metadata. I want to create an empty file with metadata, then I want to add another file to this empty file and change (change) the metadata. Is there a library in C to create a file header? How to read / write file header in C?

metadata = { file_name; file_size; file_type; file_name_size; total_files; } 
+4
source share
3 answers

There are probably several libraries that handle certain file formats, such as tar options, but not one that will be tailored to your specific header format.

You will need to decide, first, whether your metadata is fixed or variable in size.

If it's a fixed size, it's relatively easy to skip a lot of bytes at the beginning, write the rest of the file, and then rewind and fill in the metadata. If at the beginning only parts with a variable size are known, you can deal with it in the same way - write the first version, then come back when you're done, and write the final version.

If you do not know the size of the variable material to the end, it will be difficult for you. You probably ended up writing a temporary file with a large part of the file, then when everything is ready and you know all the metadata with a variable size, you write the metadata header in a new (final) file, then copy the temporary file after the metadata.

Note that you must put the size (length) of the file name in front of the actual file name in the data on disk. Then you can read how big the name is and allocate the right space and read the correct amount of data. Placing the length of the file name after the file name itself does not really help.

You also need to consider whether your header will be binary data or text. The file name component will be textual, but the number can be double-byte or 4-byte binary values ​​or equivalent ASCII lengths (plain text). It is usually easier to debug text views, but most likely you will need variable-length data if you use text. However, you can always use a fixed size with a blank padding. Another advantage over binary text is that text carries over machine architectures, while binary poses questions about large and medium sized machines, etc. Etc.

You should also consider using a "magic number" so that you can determine that the file contains the desired data type. A "number" may be an ASCII string, for example !<arch>\n , used in some versions of ar headers, for example. Or %PDF-1.3\n used at the beginning of a PDF file. Having said that tar pretty much leaves the first bytes without a magic number, but these days it's an unusual design. The file program knows a lot about magic numbers. Its data can sometimes be found in a file - for example, files under /usr/share/file for Mac OS X.


Can you explain any example?

One file format that I have in mind is for messages identified by a 32-bit (signed) number, with a variable length for messages and, therefore, variable offsets. The file is written in a neutral platform, but in binary format. The numbers are written in bigian, first MSB. Message numbers are currently limited to Β± 99,999 (therefore, the total number of messages available in the system as a whole is less than 200,000).

The file header contains:

  • 2-byte (unsigned) magic number
  • 2-byte (unsigned) counter of the number of messages contained in the file, N

It is followed by N entries, each of which describes a message:

  • 4-byte (signed) message number
  • message length 2 bytes (unsigned)
  • 4-byte (unsigned) offset to start of message

N records in sorted order of message numbers, but there is no requirement that message numbers be contiguous. Missing numbers are simply missing.

After N records, the texts of the actual messages follow, each of which consists of the corresponding number of bytes identified by the corresponding record, and the ASCII byte NUL '\0' .

As the file is created, the text of each message is written to the intermediate file in the processed order, recording the offset of the message in the file. It does not matter if messages are read or written in order; all that matters is that the offset from the end of the header is recorded in the header record. After all messages have been read, a copy of the files in memory can be sorted in numerical order, and the final file can be written. First there is the magic number and the number of messages; then N entries describing the messages; followed by the text of messages copied from an intermediate file.

Reading the message number M is quite simple. You do a binary search through N entries to find the entry for M. If it is not there, let it be so - this is an error. If it is, you know where to find it in the file and how long it will be.

The fact that the data is in a fixed binary format does not actually complicate the situation. You use the same functions on large and medium-sized machines to read numbers in your own format. Theoretically, you can optimize a large-end machine, but only if the machine does not have problems with insufficiently aligned data. It’s easier to forget that optimization can be possible and just use the same code everywhere.


If the format described above was converted to text format, then it probably would have 8 bytes (say) reserved for the magic number (which could well be a 7-letter string followed by a new line) and 6 bytes reserved for number of messages (5 digits plus a new line). Each of the message records can be reserved for 6 bytes for the message number (Β± 99,999 for the number), plus a space plus 4 bytes for the length (maximum 8 Kbytes) plus a space plus an offset of 8 bytes (7 digits plus a new line).

 MAGICNO 12345 -99999 8000 0000000 -90210 38 0008000 ... 

Again, the advantage of a text file for readability is that you can look at the file and easily see the meaning of the data.

You can have endless variations of this theme.

+4
source

The easiest way is to save it as a structure, and then save and read as such. It may be a little tricky if you support multiple versions, but it is possible, just keep this in mind by keeping a version tag of some type in the beginning. The structure looks like this:

 struct metadata= { file_name; file_size; file_type; file_name_size; total_files; } data; 

You can then save with the fwrite command and read with the freed command by opening the file first, and:

 fwrite(data,sizeof(metadata),1,FILE_POINTER); fread(data, sizeof(metadata), 1, FILE_POINTER); 
+1
source

int stat(const char *path, struct stat *buf); can provide you with the following file information

 struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* inode number */ mode_t st_mode; /* protection */ nlink_t st_nlink; /* number of hard links */ uid_t st_uid; /* user ID of owner */ gid_t st_gid; /* group ID of owner */ dev_t st_rdev; /* device ID (if special file) */ off_t st_size; /* total size, in bytes */ blksize_t st_blksize; /* blocksize for file system I/O */ blkcnt_t st_blocks; /* number of 512B blocks allocated */ time_t st_atime; /* time of last access */ time_t st_mtime; /* time of last modification */ time_t st_ctime; /* time of last status change */ }; 

If by file type you mean the Windows file extension, you can get it by reading the file name.

+1
source

Source: https://habr.com/ru/post/1400521/


All Articles