Stream on serialization with Cap'n'Proto

Consider the Cap'n'Proto scheme as follows:

struct Document { header @0 : Header; records @1 :List(Record); // usually large number of records. footer @2 :Footer; } struct Header { numberOfRecords : UInt32; /* some fields */ }; struct Footer { /* some fields */ }; struct Record { type : UInt32; desc : Text; /* some more fields, relatively large in total */ } 

Now I want to serialize (i.e. build) an instance of the document and transfer it to the remote destination.

Since the document is usually very large, I do not want to completely create it in memory before sending it. Instead, I am looking for a builder who directly sends structure by structure through a wire. Thus, the extra memory buffer is constant (i.e. O (max (sizeof (Header), sizeof (Record), sizeof (Footer))).

Looking at the training material, I do not find such a builder. It seems that MallocMessageBuilder creates everything in memory (then you call it writeMessageToFd ).

Does the Cap'n'Proto API support this use case?

Or is Cap'n'Proto more intended to be used for messages that fit into memory before being sent?

In this example, the structure of the document can be omitted, and then you can simply send a sequence of one header message, n entries and one footer. Since the Cap'n'Proto message is self-delimiting, this should work. But you lose your document root - perhaps this is sometimes not an option.

+5
source share
1 answer

The solution you indicated - sending parts of the document as separate messages is probably best for your use case. Essentially, Cap'n Proto is not intended for streaming fragments of a single message, since it will not fit its random access properties well (for example, what happens when you try to follow a pointer pointing to a fragment that you haven’t received yet?) . Instead, when you want to stream, you should split a large message into a series of small messages.

However, Cap'n Proto, unlike other similar systems (for example, Protobuf), does not strictly require that messages fit into memory. In particular, you can do some tricks using mmap(2) . If your document data comes from a file on disk, you can mmap() save the file in memory and then include it in your message. Using mmap() operating system does not actually read data from disk until you try to access memory, and the OS can also clear pages from memory after they are accessed, since it knows that it still has a copy on disk. This often allows you to write much simpler code, since you no longer need to think about memory management.

To include mmap() ed chunk in a Cap'n Proto post, you'll want to use capnp::Orphanage::referenceExternalData() . For example, given:

 struct MyDocument { body @0 :Data; # (other fields) } 

You can write:

 // Map file into memory. void* ptr = (kj::byte*)mmap( nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0); if (ptr == MAP_FAILED) { KJ_FAIL_SYSCALL("mmap", errno); } auto data = capnp::Data::Reader((kj::byte*)ptr, size); // Incorporate it into a message. capnp::MallocMessageBuilder message; auto root = message.getRoot<MyDocument>(); root.adoptDocumentBody( message.getOrphanage().referenceExternalData(data)); 

Since Cap'n Proto has a null copy, it eventually writes mmap() ed-memory directly to the socket, without receiving any access. Then he goes to the OS to read the contents from disk and, accordingly, to the socket.

Of course, you still have a problem on the receiving side. It will be much more difficult for you to construct the receiving side for reading in mmap() ed memory. One strategy may be to first upload the entire stream directly to a file (without the participation of the Cap'n Proto library), then mmap() this file and use capnp::FlatArrayMessageReader to read the mmap() ed data in place.

I describe all this because it is a neat thing that is possible with Cap'n Proto, but not with most other serialization structures (for example, you could not do this with Protobuf). It is sometimes useful to use tricks with mmap() - I have successfully used this in several places in the Sandstorm , Cap'n Proto parent project. However, I suspect that for your use case, splitting a document into multiple messages probably makes more sense.

+7
source

Source: https://habr.com/ru/post/1240747/


All Articles