C: create archive file header

I am creating a file archiver / extractor (e.g. tar) using the POSIX API system calls in C. I made part of the archiving bit.

I would like to know if someone can help me with some C source code (using above) to create a file header for a C file (where the header acts as an index) that describes file attributes / metadata (name, time, etc.) .d.). All I have done so far is to understand (not sure if this is even correct) in order to create the header of the file that it needs a structure for storing metadata, and lseek is needed to find the beginning / end of the file as:

FileName = file.txt FileSize = 0

= FileDir./L/L

FilePerms = 000

\ n \ n

In the part of archiving the program, there is this process:

  • Get a list of all files from the command line. (I can do this part)
  • Create a structure to store metadata about each file: name (255 char), size (64-bit int), date and time, and permissions.
  • For each file, get its statistics.
  • Save statistics of each file in an array of structures.
  • Open the archive for recording. (I can do this part)
  • Write a headline structure.
  • For each file, add its contents to the archive file (at the end / beginning of each file).
  • Close the archive file. (I can do this part)

I find it difficult to create the header file as a whole, although I know what it needs to do, as indicated at the numbered points above the bits, which I cannot say (2, 3, 4, 7, 7).

Any help would be appreciated. Thanks.

+4
source share
2 answers

As ijw notes, there are several ways to create an archive file header. If cross-platform portability will be a problem at all - or if you need to switch between 32-bit and 64-bit software builds on the same platform, even then you need to make sure that the sizes and layouts of the field are fully understood on all platforms.

Metadata for each file

One way to do this is to use a fixed-format binary header with types of known sizes and contents. This is what I suggested. However, you will need to handle long file names, so you will need to save the length (possibly a 2-byte unsigned integer), and then follow it with the actual path name.

An alternative, and generally the currently preferred technique, is to use print margins (often called the ASCII format, although this is something wrong). The time is written as the decimal number of seconds since the conversion of Epoch to a string, etc. This is what modern ar archives use; this is what GNU tar does (more or less, there are some historical quirks that make it more confusing); this is what cpio -c does (usually this is the default these days). Fields can be separated by zeros or spaces; There is an easy way to detect the end of the header the header contains information about the file name (not necessarily the same as you would like or expect, but again, this is usually because the format has evolved over the years), and then the actual data follows. One way or another, you know the size of each field and the file that describes the header so you can read the data reliably.

Efficiency is a red herring. Converting to / from text format is so fast compared to first disk access that there is essentially no noticeable performance issue. And guaranteed mobility usually far exceeds the (microscopic) benefit of using a binary data format — doubly when binary data needs to be converted at the input or output in any case to turn it into a neutral architecture format.

Central index versus distributed index

Another issue to consider is whether the index of files in the archive is centralized (front or end) or distributed (metadata for each file immediately precedes the data for the file). There are some advantages for each format - as a rule, systems use a distributed version, because you can record information for each file without knowing how many files are to be processed as a whole (for example, because you recursively archive the contents of a directory). Having a central index up means that you can list files without reading the entire archive. Common metadata means you must read the entire file. However, the central index makes archive creation difficult.

Note that even with a distributed index, you usually need a header for the archive as a whole, so you can find that the file is in the expected format. As a rule, there is some information about markers ( !<arch>\n for the ar archive, usually; %PDF-1.2\n at the beginning of the PDF file, etc.), to convince you that the file contains what you expect. There may be some general (archived) metadata. Then you will have the first file metadata, followed by the file data, repeating until the end of the archive (which may or may not have a formal end marker - more metadata).


[H] ow, I would start to implement it in a binary header with a fixed format that you suggested. I am having problems with which commands / functions are needed.

I assumed that you are not coming with a binary header with a fixed format; You must use the text format of the header. If you can decide how to make a binary format, be my guest (I have done this many times over the years - this does not mean that I think this is a good idea).

So, some pointers here to the text header format.

For file metadata, you can indicate that you include:

  • the size
  • (permissions, type)
  • owner
  • groups
  • modification time
  • name length
  • name

You can reasonably decide that your file sizes are limited to unsigned 64-bit integers, which means 20 decimal digits. The mode can be printed as a 16-bit octal number, requiring 6 octal digits. The owner and group can be printed as UID and GID values ​​(not a name), in which case you can use 10 digits for each. Alternatively, you can use names, but then you must resolve the names to 32 characters. Note that names are generally more portable than numbers. Neither the name nor the number matters much to the receiving machine unless you extract the data as root (but why do you want to do this?). The modification time is classically a 32-bit signed integer representing the number of seconds since the Age (1970-01-01 00: 00: 00Z). You must resolve error Y2038 by allowing the number of seconds to increase more than the 32-bit number; you can decide that the 12 leading digits will lead you out for the Y10K crisis (4 times), and this is good enough; You can also allow the use of fractional seconds. Together, this suggests that 26 spaces for timestamps should be excessive. You can decide that each field will be separated from the next by a space (for readability - think "ease of debugging"!). You can reasonably decide that all file names will be limited to 4 decimal digits in total length.

You need to know how to format types portable - #include <inttypes.h> is your friend.

Then you create a format string for printing (writing) file metadata and a parallel string for scanning (reading) file metadata.

Print

 "%20" PRIu64 " %06o %-.32s %-.32s %26" PRIu64 " %-4d %s\n" 

It also prints a name. It ends the header with a new line. The total size is 127 bytes plus the length of the file name. This is probably excessive, but you can customize the rooms to suit yourself.

Scanning:

 "%" SCNu64 " %o %.32s %.32s %" SCNu64 "%d" 

This does not check the name; you need to create a scanner for the name carefully, not least because you need to read spaces in the name. In fact, the code for scanning the username and group name also does not contain spaces. If this is unacceptable (that is, names may contain spaces), a more complex scan format or something other than sscanf() is required to process the input.

I am assuming a 64-bit integer for the time field, instead of mixing fractional seconds, etc., even if there is enough space there to allow fractional seconds. You will probably save some space.

+9
source

Getting information for each file that you can do with the stat () system call.

There are two solutions for writing the title.

Trivial but evil:

 struct file_header { ... data you want to put in } fhdr; fwrite(file, fhdr, sizeof(fhdr)); 

This is evil, because the packaging of the structure varies from machine to machine, as does the byte order and the size of the base types of the int type. A file written by your program may not be read by your program when compiling on another machine or even with another compiler on the same computer in some cases.

Non-trivial but safe:

 char name[xxx]; uint32_t length; /* Fixed byte length across architectures */ ... fwrite(file, name, sizeof(name)); length=htonl(length); /* Or something else that converts the length to a known endianness */ fwrite(file, &length, sizeof(length); 

Personally, I am not a fan of htonl () and friends, I prefer to write something that converts uint32_t to uchar [4] using shift operators (which can be written trivially with shift operators), because C does not even align the format of an integer in mind. In practice, it will be difficult for you to find what uint32_t does not store, like 4 bytes of 8 bits, but this must be taken into account.

The variables listed above can be members of a structure in your structure. Reversing the reading process is performed as an exercise for the reader.

+4
source

Source: https://habr.com/ru/post/1333990/


All Articles