Mmap vs malloc: weird performance

I write code that analyzes log files, with the caveat that these files are compressed and should be uncompressed on the fly. This code is a somewhat highly sensitive piece of code, so I am trying to use various methods to find the right one. I essentially have as much RAM as the program needs, no matter how many threads I use.

I found a method that seems to work quite well, and I'm trying to figure out why it offers the best performance.

Both methods have a reader stream that reads from the gzip process in the channel and writes to a large buffer. This buffer is then lazily parsed when the next line of the log is requested, returning what is essentially a structure of pointers to where the different fields are in the buffer.

The code is in D, but it is very similar to C or C ++.

General variable:

shared(bool) _stream_empty = false;; shared(ulong) upper_bound = 0; shared(ulong) curr_index = 0; 

Code Analysis:

 //Lazily parse the buffer void construct_next_elem() { while(1) { // Spin to stop us from getting ahead of the reader thread buffer_empty = curr_index >= upper_bound -1 && _stream_empty; if(curr_index >= upper_bound && !_stream_empty) { continue; } // Parsing logic ..... } } 

Method 1: Malloc - a buffer large enough to store the unpacked file.

 char[] buffer; // Same as vector<char> in C++ buffer.length = buffer_length; // Same as vector reserve in C++ or malloc 

Method 2: Use an anonymous memory card as a buffer

 MmFile buffer; buffer = new MmFile(null, MmFile.Mode.readWrite, // PROT_READ || PROT_WRITE buffer_length, null); // MAP_ANON || MAP_PRIVATE 

Reading:

 ulong buffer_length = get_gzip_length(file_path); pipe = pipeProcess(["gunzip", "-c", file_path], Redirect.stdout); stream = pipe.stdout(); static void stream_data() { while(!l.stream.eof()) { // Splice is a reference inside the buffer char[] splice = buffer[upper_bound..upper_bound + READ_SIZE]; ulong read = stream.rawRead(splice).length; upper_bound += read; } // Clean up } void start_stream() { auto t = task!stream_data(); t.executeInNewThread(); construct_next_elem(); } 

I get significantly better performance from method 1, even an order of magnitude

 User time (seconds): 112.22 System time (seconds): 38.56 Percent of CPU this job got: 151% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:39.40 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 3784992 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 5463 Voluntary context switches: 90707 Involuntary context switches: 2838 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 

vs.

 User time (seconds): 275.92 System time (seconds): 73.92 Percent of CPU this job got: 117% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:58.73 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 3777336 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 944779 Voluntary context switches: 89305 Involuntary context switches: 9836 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 

Getting additional page errors using method 2.

Can someone help me shed some light on why such a dramatic decrease in performance when using mmap?

If anyone knows of any better solutions to this problem, I would love to hear that.

EDIT -----

Method 2 changed:

  char * buffer = cast(char*)mmap(cast(void*)null, buffer_length, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); 

now gains a 3x performance boost through the use of a simple MmFile. I'm trying to figure out what could lead to such a dramatic difference in performance that it is essentially just a wrapper around mmap.

Perf numbers for easy use of direct char * mmap vs Mmfile, the path is less than page errors:

 User time (seconds): 109.99 System time (seconds): 36.11 Percent of CPU this job got: 151% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:36.20 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 3777896 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 2771 Voluntary context switches: 90827 Involuntary context switches: 2999 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 
+6
source share
1 answer

You get pagefaults and slow-downs because mmap by default only loads the page after you try to access it.

Read how you know what you read sequentially, so it picks up the pages ahead of time before you ask them.

Look at the madvise call - it is designed to tell the kernel how you are going to access the mmap, ed file, and you can set different strategies for different parts of mmap memory - for example, you have an index block that you want to keep in memory [MADV_WILLNEED], but access to the material is arbitrary and at the request of [MADV_RANDOM] or you are looping, although the memory is sequentially scanned [MADV_SEQUENTIAL]

However, the OS is completely free to ignore the strategy you are setting, so YMMW

0
source

Source: https://habr.com/ru/post/980694/


All Articles