How to analyze the contents of a serialization binary stream?

I use binary serialization (BinaryFormatter) as a temporary mechanism for storing state information in a file for a relatively complex (game) structure of an object; the files go out much more than I expect and my data structure includes recursive links - so I wonder if BinaryFormatter really stores multiple copies of the same objects or should my main number of objects and values ​​have “arithmentic” - this outside the base, or where else excessive size occurs.

Search the stack overflow I could find the specification for the Microsoft remote file format: http://msdn.microsoft.com/en-us/library/cc236844(PROT.10).aspx

What I can not find is any existing viewer that allows you to "look" into the contents of the output binary format file - get the number of objects and the total number of bytes for different types of objects in the file, etc .;

I feel like it should be my google-fu that doesn't bother me (which is not enough for me) - can anyone help? That must have been done before, right?


UPDATE : I could not find it and received no answers, so I put something relatively quickly (link to the downloadable project below); I can confirm that the BinaryFormatter does not store multiple copies of the same object, but it prints quite a lot of metadata in the stream. If you need efficient storage, create your own serialization methods.

+26
source share
4 answers

Because it may be interesting for someone, I decided to make this message about . What does the binary format of serialized .NET objects look like and how can we interpret it correctly?

I based all my research on the .NET Remoting: Binary Format Data Structure specification.



Class class:

To have a working example, I created a simple class called A that contains 2 properties, one row and one integer value, they are called SomeString and SomeValue .

Class A as follows:

 [Serializable()] public class A { public string SomeString { get; set; } public int SomeValue { get; set; } } 

For serialization, I used BinaryFormatter , of course:

 BinaryFormatter bf = new BinaryFormatter(); StreamWriter sw = new StreamWriter("test.txt"); bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 }); sw.Close(); 

As you can see, I passed a new instance of class A containing abc and 123 as values.



Examples of result data:

If we look at the serialized result in a hex editor, we get something like this:

Example result data



Let's interpret the data from the example result:

According to the above specification (here is a direct link to the PDF: [MS-NRBF] .pdf ) each record in the stream is identified by RecordTypeEnumeration . Section 2.1.2.1 RecordTypeNumeration states:

This enumeration identifies the type of record. Each entry (except MemberPrimitiveUnTyped) begins with an enumeration of the entry type. The listing size is one BYTE.



SerializationHeaderRecord:

So, if we look back at the data received, we can begin to interpret the first byte:

SerializationHeaderRecord_RecordTypeEnumeration

As indicated in 2.1.2.1 RecordTypeEnumeration , a value of 0 identifies the SerializationHeaderRecord specified in 2.6.1 SerializationHeaderRecord :

The SerializationHeaderRecord entry MUST be the first entry in binary serialization. This entry has a major and minor version of the format and identifiers of the top object and headers.

It consists of:

  • RecordTypeEnum (1 byte)
  • RootId (4 bytes)
  • HeaderId (4 bytes)
  • MajorVersion (4 bytes)
  • MinorVersion (4 bytes)



With this knowledge, we can interpret a record containing 17 bytes:

SerializationHeaderRecord_Complete

00 represents RecordTypeEnumeration , which is SerializationHeaderRecord in our case.

01 00 00 00 presents RootId

If neither BinaryMethodCall nor BinaryMethodReturn is present in the serialization stream, the value of this field MUST contain the ObjectId of the Class, Array, or BinaryObjectString record contained in the serialization stream.

So, in our case, it should be an ObjectId with a value of 1 (because the data is serialized using little-endian), which we hope to see again; -)

FF FF FF FF Presents HeaderId

01 00 00 00 Presents MajorVersion

00 00 00 00 presents MinorVersion

in the BinaryLibrary:

As indicated, each record should begin with RecordTypeEnumeration . At the end of the last recording, we must assume that a new one begins.

Let's interpret the following byte:

BinaryLibraryRecord_RecordTypeEnumeration

As we can see, in our SerializationHeaderRecord example, the BinaryLibrary entry BinaryLibrary :

The BinaryLibrary entry associates the INT32 identifier (as described in [2.2.22] MS-DTYP] with the library name. This allows other entries to reference the library name using the identifier. This approach reduces wire size when there are multiple entries that reference the same and same library name.

It consists of:

  • RecordTypeEnum (1 byte)
  • LibraryId (4 bytes)
  • LibraryName (variable number of bytes ( LengthPrefixedString ))



As stated in 2.1.1.6 LengthPrefixedString ...

LengthPrefixedString is a string value. The string has a UTF-8 encoded string length prefix in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and no more than 5 bytes. To minimize wire size, the length is encoded as a variable length field.

In our simple example, the length is always encoded using 1 byte . With this knowledge, we can continue to interpret the bytes in the stream:

BinaryLibraryRecord_RecordTypeEnumeration_LibraryId

0C represents a RecordTypeEnumeration that identifies a BinaryLibrary record.

02 00 00 00 represents LibraryId , which is 2 in our case.



Now LengthPrefixedString follows:

BinaryLibraryRecord_RecordTypeEnumeration_LibraryId_LibraryName

42 represents LengthPrefixedString information that contains a LibraryName .

In our case, information about the length of 42 (decimal 66) tells us that we need to read the next 66 bytes and interpret them as LibraryName .

As already mentioned, the UTF-8 string is encoded, so the result of the bytes above will be something like this: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null



ClassWithMembersAndTypes:

And the recording is complete again, so we interpret the RecordTypeEnumeration following:

ClassWithMembersAndTypesRecord_RecordTypeEnumeration

05 identifies the ClassWithMembersAndTypes entry. Section 2.3.2.1 ClassWithMembersAndTypes states:

The ClassWithMembersAndTypes entry is the most verbose of the Class entries. It contains metadata about members, including the names and types of deleted items. It also contains a library identifier that references the name of the class library.

It consists of:

  • RecordTypeEnum (1 byte)
  • ClassInfo (variable number of bytes)
  • MemberTypeInfo (variable number of bytes)
  • LibraryId (4 bytes)



ClassInfo:

As stated in 2.3.1.1 ClassInfo , an entry consists of:

  • ObjectId (4 bytes)
  • Name (variable number of bytes (again, LengthPrefixedString ))
  • MemberCount (4 bytes)
  • MemberNames (which is a sequence of LengthPrefixedString , where the number of elements MUST be equal to the value specified in the MemberCount field.)



Back to the original data, step by step:

ClassWithMembersAndTypesRecord_RecordTypeEnumeration_ClassInfo_ObjectId

01 00 00 00 represents the ObjectId . We already saw this, it was listed as RootId in SerializationHeaderRecord .

ClassWithMembersAndTypesRecord_RecordTypeEnumeration_ClassInfo_ObjectId_Name

0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name class, which is represented using LengthPrefixedString . As already mentioned, in our example, the length of the string is determined with 1 byte, so the first byte 0F indicates that 15 bytes should be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so I used StackOverFlow as the namespace name.

ClassWithMembersAndTypesRecord_RecordTypeEnumeration_ClassInfo_ObjectId_Name_MemberCount

02 00 00 00 represents MemberCount , it tells us that 2 members will follow, both of which are represented by LengthPrefixedString .

First Member Name: ClassWithMembersAndTypesRecord_MemberNameOne

1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName , 1B is again a string length of 27 bytes, which leads to something like this : <SomeString>k__BackingField .

Second Member Name: <T411>

1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName , 1A indicates that the string length is 26 bytes. This leads to something like this: <SomeValue>k__BackingField .



MemberTypeInfo:

ClassInfo followed by MemberTypeInfo .

Section 2.3.1.2 - MemberTypeInfo indicates that the structure contains:

  • BinaryTypeEnums (variable in length)

A sequence of BinaryTypeEnumeration values ​​that represents the passed member types. Array MUST:

  • Have the same number of elements as the MemberNames field of the ClassInfo structure.

  • We will arrange so that the BinaryTypeEnumeration matches the name of the member in the MemberNames field of the ClassInfo structure.

  • Additionally BinaryTpeEnum (variable in length), depending on BinaryTpeEnum additional information may or may not be.

| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |

Therefore, given this, we are almost there ... We expect 2 BinaryTypeEnumeration values ​​(because MemberNames had 2 members).



MemberTypeInfo go back to the source data of the full MemberTypeInfo record:

ClassWithMembersAndTypesRecord_MemberTypeInfo

01 represents the BinaryTypeEnumeration first member, according to 2.1.2.2 BinaryTypeEnumeration can be expected a String , and it is represented using the LengthPrefixedString .

00 represents the BinaryTypeEnumeration second element, and, again, according to the specification, this is Primitive . As stated above, Primitive followed by additional information, in this case a PrimitiveTypeEnumeration . Therefore, we need to read the next byte, which is 08 , compare it with the table specified in 2.1.2.3 PrimitiveTypeEnumeration , and be surprised that we can expect Int32 , which is represented by 4 bytes, as indicated in some other document on basic data types.



LibraryId:

After MemerTypeInfo follows LibraryId , it is represented by 4 bytes:

ClassWithMembersAndTypesRecord_LibraryId

02 00 00 00 represents LibraryId , which is 2.



Values:

As stated in 2.3 Class Records :

The values ​​of class members MUST be serialized as records that follow this record, as described in section 2.7. The order of entries MUST match the order of MemberNames, as specified in the ClassInfo structure (section 2.3.1.1).

That is why we can now expect member values.

Let's look at the last few bytes:

BinaryObjectStringRecord_RecordTypeEnumeration

06 identifies a BinaryObjectString . It represents the value of our SomeString property ( <SomeString>k__BackingField , to be precise).

According to 2.5.7 BinaryObjectString it contains:

  • RecordTypeEnum (1 byte)
  • ObjectId (4 bytes)
  • Value (variable length represented as LengthPrefixedString )



Therefore, knowing this, we can clearly determine that

BinaryObjectStringRecord_RecordTypeEnumeration_ObjectId_MemberOneValue

03 00 00 00 represents the ObjectId .

03 61 62 63 represents Value , where 03 is the length of the string itself, and 61 62 63 are the bytes of the content, which translate to abc .

I hope you remember that there was a second member, Int32 . Knowing that Int32 is represented using 4 bytes, we can conclude that

BinaryObjectStringRecord_RecordTypeEnumeration_ObjectId_MemberOneValue_MemberTwoValue

must be the Value our second member. 7B hexadecimal equivalent of 123 decimal characters, which apparently matches our example.

So here is the complete ClassWithMembersAndTypes entry: ClassWithMembersAndTypesRecord_Complete



MessageEnd:

MessageEnd_RecordTypeEnumeration

Finally, the last byte 0B represents the MessageEnd record.

+48
source

Vasily is right that in the end I will need to implement my own formatting / serialization process in order to better handle version control and output a much more compact stream (before compression).

I really wanted to understand what was going on in the stream, so I wrote a (relatively) fast class that does what I wanted:

  • analyzes its path through the stream, creating collections of object names, quantities and sizes.
  • after execution, displays a brief description of what he found - classes, counts, and overall dimensions in the stream.

It’s not very convenient for me to place it somewhere visible, like codeproject, so I just dropped the project into a zip file on my website: http://www.architectshack.com/BinarySerializationAnalysis.ashx

In my particular case, it turns out that the problem is twofold:

  • BinaryFormatter is VERY verbose (this is known, I just did not understand to what extent)
  • I had problems in my class, it turned out that I stored objects that I did not need.

Hope this helps someone at some point!


Update: Ian Wright contacted me with a problem with the source code, where it crashed when the source object contained "decimal" values. This is now fixed, and I used the case to move the code to GitHub and grant it a (permissive, BSD) license.

+7
source

Our application uses massive data. It can take up to 1-2 GB of RAM, for example, in your game. We are faced with the problem of "storing multiple copies of the same objects." Binary serialization also stores too much metadata. When it was first implemented, the serialized file took about 1-2 GB. Currently, I have managed to reduce the cost - 50-100 MB. What have we done.

The short answer is don't use .Net binary serialization, create your own binary serialization mechanism. We have our own BinaryFormatter class and the ISerializable interface (with two Serialize, Deserialize methods).

The same object should not be serialized more than once. We save its unique identifier and restore the object from the cache.

I can share some code if you ask.

EDIT: It seems you are right. See the following code - this proves that I was wrong.

 [Serializable] public class Item { public string Data { get; set; } } [Serializable] public class ItemHolder { public Item Item1 { get; set; } public Item Item2 { get; set; } } public class Program { public static void Main(params string[] args) { { Item item0 = new Item() { Data = "0000000000" }; ItemHolder holderOneInstance = new ItemHolder() { Item1 = item0, Item2 = item0 }; var fs0 = File.Create("temp-file0.txt"); var formatter0 = new BinaryFormatter(); formatter0.Serialize(fs0, holderOneInstance); fs0.Close(); Console.WriteLine("One instance: " + new FileInfo(fs0.Name).Length); // 335 //File.Delete(fs0.Name); } { Item item1 = new Item() { Data = "1111111111" }; Item item2 = new Item() { Data = "2222222222" }; ItemHolder holderTwoInstances = new ItemHolder() { Item1 = item1, Item2 = item2 }; var fs1 = File.Create("temp-file1.txt"); var formatter1 = new BinaryFormatter(); formatter1.Serialize(fs1, holderTwoInstances); fs1.Close(); Console.WriteLine("Two instances: " + new FileInfo(fs1.Name).Length); // 360 //File.Delete(fs1.Name); } } } 

It looks like BinaryFormatter uses object.Equals to search for the same objects.

Have you ever looked at generated files? If you open "temp-file0.txt" and "temp-file1.txt" from the sample code, you will see that it has a lot of metadata. Therefore, I recommended that you create your own serialization engine.

Sorry for being ripe.

+4
source

Perhaps you could run your program in debug mode and try adding a breakpoint.

If this is not possible due to the size of the game or other dependencies, you can always look after a simple / small application that includes deserialization code and viewing from debug mode.

0
source

Source: https://habr.com/ru/post/950835/


All Articles