Does heap solution outperform stack?

I am coding a simulation C, in which, given the sequence of rules for checking, we break it into β€œslices” and check each slice. (The basic idea is that order is important, and the actual value of a rule is influenced by some rules above it, we can make a β€œslice” with each rule and only those rules that overlap with it. Then we check the slices, which are usually much smaller the whole sequence.)

My problem is as follows.

I have a structure (policy) that contains an array of structures (rules) and int (length). My initial implementation used malloc and realloc liberally:

struct{ struct rule *rules; int length; }policy; ... struct policy makePolicy(int length) { struct policy newPolicy; newPolicy.rules = malloc(length * sizeof(struct rule)); newPolicy.length = length; return newPolicy; } ... struct policy makeSlice(struct policy inPol, int rulePos) { if(rulePos > inPol.length - 1){ printf("Slice base outside policy \n"); exit(1); } struct slice = makePolicy(inPol.length); //create slice, loop counter gets stored in sliceLength slice.rules = realloc(slice.rules, sliceLength * sizeof(struct rule)); slice.length = sliceLength; return slice; } 

Since it uses malloc'ed memory, I assume it uses a lot of heap. Now I am trying to connect to an experimental parallel machine that does not have malloc.

Unfortunately, I went and allocated everything using fixed-size arrays.

Now here is the shocker.

New code is slower. Much slower.

(The source code used to wait for minutes in a row when the fragment length was 200, or maybe an hour more than 300 ... now this happens when the length of the slice is 70, 80 ... and was counting the hours, say 120. Not yet 200.)

The only thing is that the slices are now assigned the same memory as the full policy (MAXBUFLEN - 10000), but the whole seems to have absolutely no memory. "top" shows that the total memory consumed is rather modest, with dozens of megabytes, as before. (And, of course, since I keep the length, I don't get hung up on everything, just the part with the real rules.)

Can anyone really help explain why it suddenly got so much slower?

+6
source share
1 answer

It appears that when you set the size of the structure to a larger size (say 10,000 rules), your cache locality can become much worse than the original one. You can use the profiler (oprofile or cachegrind in Valgrind) to make sure the cache problem.

In the source program, one cache line can contain no more than 8 struct policy (on a 32-bit machine with a length of 64 bytes). But in a modified check, it can only hold one, since now it is much larger than the size of the cache line.

Moving the length field can improve performance in this case, since now length and the first few struct rule can fit on the same cache line.

 struct policy{ int length; struct rule rules[10000]; }; 

To solve this problem, you need to write your own custom allocator to ensure cache locality. If you are writing a parallel version of this program, also remember to isolate the memory used by different threads in different lines of the cache to avoid competition in the line of the cache.

+3
source

Source: https://habr.com/ru/post/946059/


All Articles