Parsing a binary file. What is a modern way?

I have a binary with some location that I know. For example, let the format be like this:

  • 2 bytes (unsigned short) - string length
  • 5 bytes (5 x characters) - string - some id name
  • 4 bytes (unsigned int) - step
  • 24 bytes (6 x float - 2 steps of 3 floats) - floating point data

The file should look like (I added spaces for reading):

5 hello 3 0.0 0.1 0.2 -0.3 -0.4 -0.5 

Here 5 is 2 bytes: 0x05 0x00. hello - 5 bytes, etc.

Now I want to read this file. I am currently doing it like this:

  • upload file to ifstream
  • read this stream to char buffer[2]
  • translate it into unsigned short: unsigned short len{ *((unsigned short*)buffer) }; . Now I have the string length.
  • read the stream to vector<char> and create std::string from this vector. Now I have a row id.
  • read the following 4 bytes in the same way and translate them into unsigned int. Now I have a step.
  • while the end of the file does not read float the same way - create a char bufferFloat[4] and draw *((float*)bufferFloat) for each float.

It works, but to me it looks ugly. Is it possible to read directly unsigned short or float or string , etc. Without creating char [x] ? If not, what is the way to work correctly (I read that the style I use is the old style)?

PS: while I was writing the question, a clearer explanation raised in my head - how is an arbitrary number of bytes from an arbitrary position in char [x] ?

Update: I forgot to explicitly indicate that the length of the line and the length of the floating file is unknown at compile time and is a variable.

+45
c ++ casting binary
Nov 10 '14 at 2:00
source share
10 answers

A C path that would work fine in C ++ would be to declare a structure:

 #pragma pack(1) struct contents { // data members; }; 

note that

  • You need to use a pragma to force the compiler to align the data somehow-in-structure in the structure;
  • This method only works with POD types.

And then create a read buffer directly into the structure type:

 std::vector<char> buf(sizeof(contents)); file.read(buf.data(), buf.size()); contents *stuff = reinterpret_cast<contents *>(buf.data()); 

Now, if your data size is variable, you can split it into several pieces. To read one binary object from the buffer, the read function is convenient:

 template<typename T> const char *read_object(const char *buffer, T& target) { target = *reinterpret_cast<const T*>(buffer); return buffer + sizeof(T); } 

The main advantage is that such a reader can be specialized for more advanced C ++ objects:

 template<typename CT> const char *read_object(const char *buffer, std::vector<CT>& target) { size_t size = target.size(); CT const *buf_start = reinterpret_cast<const CT*>(buffer); std::copy(buf_start, buf_start + size, target.begin()); return buffer + size * sizeof(CT); } 

And now in your main parser:

 int n_floats; iter = read_object(iter, n_floats); std::vector<float> my_floats(n_floats); iter = read_object(iter, my_floats); 

Note. . As Tony D noted, even if you can get aligned with the #pragma and manually fill (if necessary), you can still encounter incompatibility with aligning your processor to forms (best case) of performance or (worst) trap signals. This method is probably only interesting if you have control over the file format.

+9
Nov 10 '14 at 2:04
source share

If this is not for educational purposes, and if you have the freedom to choose a binary format, you better consider using something like protobuf , which will handle serialization for you and allow you to interact with other platforms and languages.

If you cannot use a third-party API, you can look at QDataStream for inspiration.

+13
Nov 10 '14 at
source share

At the moment I am doing it like this:

  • upload file to ifstream

  • read this stream to the character buffer [2]

  • cast it to unsigned short : unsigned short len{ *((unsigned short*)buffer) }; Now I have the length of the string.

This last one risks SIGBUS (if your character array starts with an odd address, and your CPU can only read 16-bit values ​​that are aligned at an even address), performance (some CPUs will read offset values, but slower, others like modern x86 - this normal and fast) and / or byte order problems. I would suggest reading two characters, then you can say (x[0] << 8) | x[1] (x[0] << 8) | x[1] (x[0] << 8) | x[1] (x[0] << 8) | x[1] or vice versa, using htons if you need to fix the byte order.

  • read the stream in vector<char> and create std::string from this vector . Now I have a row id.

No need ... just read directly into the line:

 std::string s(the_size, ' '); if (input_fstream.read(&s[0], s.size()) && input_stream.gcount() == s.size()) ...use s... 
  • read following 4 bytes in the same way and cast them to unsigned int . Now I have a step. while not the end of the file, read float same way - create char bufferFloat[4] and char bufferFloat[4] *((float*)bufferFloat) for each float .

It is better to read data directly through unsigned int and floats , as in this way the compiler will ensure proper alignment.

It works, but for me it looks ugly. Can I read directly in unsigned short or float or string , etc. Without creating char [x] ? If not, how to use the cast correctly (I read which style I use - is this the old style)?

 struct Data { uint32_t x; float y[6]; }; Data data; if (input_stream.read((char*)&data, sizeof data) && input_stream.gcount() == sizeof data) ...use x and y... 

Please note that the above code avoids reading data into potentially unaligned character arrays, while reinterpret_cast data in a potentially unaligned char array (including inside std::string ) is unsafe due to alignment problems. Again, you may need some conversion after reading with htonl if there is a chance that the contents of the file differ in byte order. If there is an unknown number of numbers with float , you will need to calculate and allocate sufficient storage with alignment of at least 4 bytes, and then aim Data* at it ... it is permissible to index after the declared size of the array y so long since the contents of the memory are at the address to which accessed, was part of the allocation and contains a valid view with a float read from the stream. Easier - but with extra reading, perhaps slower - first read uint32_t then new float[n] and do further read there ....

In practice, this type of approach can work, and a lot of low-level code and C code do just that. The "clean" high-level libraries that can help you read the file should end up doing something similar inside ...

+9
Nov 10 '14 at
source share

I actually implemented a quick and dirty binary analyzer format to read .zip files (after describing the Wikipedia format) only last month, and being modern, I decided to use C ++ templates.

A packaged struct may work on some specific platforms, however there are things that it does not handle well ... for example, variable-length fields. With templates, however, there is no such problem: you can get arbitrarily complex structures (and return types).

A .zip archive is relatively simple, fortunately, so I implemented something simple. Above my head:

 using Buffer = std::pair<unsigned char const*, size_t>; template <typename OffsetReader> class UInt16LEReader: private OffsetReader { public: UInt16LEReader() {} explicit UInt16LEReader(OffsetReader const or): OffsetReader(or) {} uint16_t read(Buffer const& buffer) const { OffsetReader const& or = *this; size_t const offset = or.read(buffer); assert(offset <= buffer.second && "Incorrect offset"); assert(offset + 2 <= buffer.second && "Too short buffer"); unsigned char const* begin = buffer.first + offset; // http://commandcenter.blogspot.fr/2012/04/byte-order-fallacy.html return (uint16_t(begin[0]) << 0) + (uint16_t(begin[1]) << 8); } }; // class UInt16LEReader // Declined for UInt[8|16|32][LE|BE]... 

Of course, the base OffsetReader actually has a consistent result:

 template <size_t O> class FixedOffsetReader { public: size_t read(Buffer const&) const { return O; } }; // class FixedOffsetReader 

and since we are talking about templates, you can switch types at your leisure (you can implement a proxy reader that delegates all reads of shared_ptr , which memoizes them).

Interesting, however, is the end result:

 // http://en.wikipedia.org/wiki/Zip_%28file_format%29#File_headers class LocalFileHeader { public: template <size_t O> using UInt32 = UInt32LEReader<FixedOffsetReader<O>>; template <size_t O> using UInt16 = UInt16LEReader<FixedOffsetReader<O>>; UInt32< 0> signature; UInt16< 4> versionNeededToExtract; UInt16< 6> generalPurposeBitFlag; UInt16< 8> compressionMethod; UInt16<10> fileLastModificationTime; UInt16<12> fileLastModificationDate; UInt32<14> crc32; UInt32<18> compressedSize; UInt32<22> uncompressedSize; using FileNameLength = UInt16<26>; using ExtraFieldLength = UInt16<28>; using FileName = StringReader<FixedOffsetReader<30>, FileNameLength>; using ExtraField = StringReader< CombinedAdd<FixedOffsetReader<30>, FileNameLength>, ExtraFieldLength >; FileName filename; ExtraField extraField; }; // class LocalFileHeader 

This is quite simplistic, obvious, but incredibly flexible at the same time.

An obvious axis of improvement will be improved grip, since there is a risk of accidental overlap. The code for my archived reading worked the first time I tried this, although it was sufficient proof for me that this code was enough for this task.

+5
Nov 12 '14 at 18:19
source share

I had to solve this problem once. Data files were packaged to exit FORTRAN. All offsets were incorrect. I managed to use preprocessing tricks that automatically did what you did manually: unpack the raw data from the byte buffer into the structure. The idea is to describe the data in the include file:

 BEGIN_STRUCT(foo) UNSIGNED_SHORT(length) STRING_FIELD(length, label) UNSIGNED_INT(stride) FLOAT_ARRAY(3 * stride) END_STRUCT(foo) 

Now you can define these macros to generate the code you need, for example, declare a structure, include it above, undef and define macros again to generate the decompression of functions, and then another include, etc.

NB I first saw this technique used by gcc to generate abstract syntax generation.

If CPP is not powerful enough (or such abuse of the preprocessor is not for you), replace the small lex / yacc program (or select your favorite tool).

It's amazing how often you have to think about how to generate code, rather than writing it manually, at least in low-level basic code like this.

+3
Nov 10 '14 at 14:29
source share

It is better to declare a structure (with 1-byte padding - how - depends on the compiler). Write using this structure and read the same structure. Put only the POD in the structure and therefore no std::string , etc. Use this structure only for file input / output or other interaction between processes - use a regular struct or class to hold it for later use in a C ++ program.

+2
Nov 10 '14 at 14:10
source share

Since all your data is variable, you can read the two blocks separately and still use casting:

 struct id_contents { uint16_t len; char id[]; } __attribute__((packed)); // assuming gcc, ymmv struct data_contents { uint32_t stride; float data[]; } __attribute__((packed)); // assuming gcc, ymmv class my_row { const id_contents* id_; const data_contents* data_; size_t len; public: my_row(const char* buffer) { id_= reinterpret_cast<const id_contents*>(buffer); size_ = sizeof(*id_) + id_->len; data_ = reinterpret_cast<const data_contents*>(buffer + size_); size_ += sizeof(*data_) + data_->stride * sizeof(float); // or however many, 3*float? } size_t size() const { return size_; } }; 

Thus, you can use Mr. kbok's answer to parse correctly:

 const char* buffer = getPointerToDataSomehow(); my_row data1(buffer); buffer += data1.size(); my_row data2(buffer); buffer += data2.size(); // etc. 
+2
Nov 10 '14 at 2:30
source share

I personally do this:

 // some code which loads the file in memory #pragma pack(push, 1) struct someFile { int a, b, c; char d[0xEF]; }; #pragma pack(pop) someFile* f = (someFile*) (file_in_memory); int filePropertyA = f->a; 

A very efficient way for fixed-sized structures at the beginning of a file.

+2
Nov 10 '14 at
source share

Use the serialization library. Here are a few:

+1
Nov 13 '14 at 2:04
source share

I use the ragel tool to generate procedural C source code (without tables) for microcontrollers with 1-2 KB of RAM. It did not use any io file, buffering, and produces both easily debugged code and a .dot / .pdf file with a state machine circuit.

ragel can also output go, Java, .. code for parsing, but I did not use these features.

A key feature of ragel is the ability to parse any byte data, but you cannot dig into bit fields. Another problem is that ragel is able to parse regular structures, but lacks parsing recursion and parsing grammar.

0
Mar 28 '19 at 9:15
source share



All Articles