C ++: optimizing the order of variables?

I read a blog post by a game coder for Introversion , and he is diligently trying to compress every CPU so that it can exit the code. One trick he mentions is

"reorder class member variables to the most used and least used."

I am not familiar with C ++ and not compiling, but I was wondering if

  • Is this statement true?
  • How why?
  • Does this apply to other (compiled / scripted) languages?

I know that the amount (CPU) of time saved by this trick would be minimal, this is not an interruption of the transaction. But, on the other hand, in most functions it would be pretty easy to determine which variables will be used most often, and just start coding this way by default.

+41
c ++ performance optimization embedded
May 21 '09 at 12:48
source share
11 answers

Two questions here:

  • Regardless of whether certain fields are stored together, this is an optimization.
  • How to do it in reality.

The reason this may help is because the memory is loaded into the processor cache into chunks called "cache lines." This takes time, and usually the more cache lines loaded for your object, the longer it takes. In addition, the more other things are thrown out of the cache to free up space, which slows down other code in an unpredictable way.

The size of the cache line is processor dependent. If it is large compared to the size of your objects, then very few objects will cover the boundary of the cache line, so all optimization does not matter. Otherwise, you can leave, sometimes having only part of your object in the cache, and the rest in the main memory (or, possibly, in the L2 cache). It’s good if your most common operations (those that refer to commonly used fields) use as little cache as possible for the object, so grouping these fields gives you a better chance of this.

The general principle is called link locality. The closer the different memory addresses are to each other, the more you get access to your program, the better your chances of getting good cache behavior. It is often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multithreading means that you often don’t know what will be in the cache, etc. But we can talk about what can happen, most of the time. If you want to know something, you should usually measure it.

Please note that there are some errors. If you use processor-based atomic operations (which usually have atomic types in C ++ 0x), you may find that the CPU locks the entire cache line to block this field. Then, if you have several atomic fields close to each other, with different threads running on different kernels and working in different fields at the same time, you will find that all these atomic operations are serialized because they block the same memory location while working in different areas. If they worked on different lines of the cache, they would work in parallel and work faster. In fact, as Glen points out (via Herb Sutter), in a coherent cache architecture, this happens even without atomic operations and can completely ruin your day. Thus, link locality is not always a good thing when multiple cores are involved, even if they share a cache. You can expect this to be due to the fact that cache misses are usually a source of lost speed, but in your particular case they will be terribly wrong.

Now, despite the difference between commonly used and less used fields, the smaller the object, the less memory it takes (and therefore less cache). This is good news all over the world, at least where you have no fierce competition. The size of an object depends on the fields in it and on any padding that must be inserted between the fields to ensure that they are correctly aligned for the architecture. C ++ (sometimes) places restrictions on the order which fields should appear in the object, depending on what order they are declared. This is done to simplify low-level programming. So, if your object contains:

  • int (4 bytes, 4-aligned)
  • and then char (1 byte, any alignment)
  • followed by int (4 bytes, 4-aligned)
  • and then char (1 byte, any alignment)

then most likely it will take 16 bytes in memory. The size and alignment of int are not the same on every platform, by the way, but 4 is very common, and this is just an example.

In this case, the compiler inserts 3 bytes of padding before the second int to correctly align it, and 3 bytes of padding at the end. The size of an object must be a multiple of its alignment, so that objects of the same type can be contiguous in memory. That the entire array is in C / C ++, related objects in memory. If the structure were int, int, char, char, then the same object could be 12 bytes, because char has no alignment requirement.

I said that if int 4-aligned depends on the platform: on ARM it is absolutely necessary, since unaudited access causes a hardware exception. On x86, you can access ints unligned, but it is generally slower and IIRC non-atomic. So, compilers usually (always?) 4-align ints on x86.

The rule of thumb when writing code, if you care about packaging, is to look at the alignment requirement for each member of the structure. Then first order the fields with the largest aligned types, then the next smallest, etc., up to members that do not require elegance requirements. For example, if I try to write portable code, I can come up with the following:

struct some_stuff { double d; // I expect double is 64bit IEEE, it might not be uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know uint32_t i; // 4 bytes, usually 4-aligned int32_t j; // same short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment char d; // 1 byte, any alignment }; 

If you do not know the alignment of the field or write portable code, but want to do everything in your power without serious deception, then you assume that the requirement of alignment is the biggest requirement of any fundamental type in the structure, and that the requirement of alignment of the main types is their the size. So, if your structure contains uint64_t or a long long one, then the best estimate is 8-aligned. Sometimes you are mistaken, but you will be right many times.

Please note that game programmers, such as your blogger, often know everything about their processor and hardware, and therefore they don’t need to guess. They know the size of the cache line, they know the size and alignment of each type, and they know the layout rules used by their compiler (for POD and non-POD types). If they support several platforms, then, if necessary, they can be special for each. They also spend a lot of time thinking about which objects in their game will benefit from increased productivity, as well as using profilers to find out where the real bottlenecks are. But even in this case, it is not such a bad idea to have a few rules of thumb that you apply whether an object is needed or not. Until the code becomes clear, "put commonly used fields at the beginning of an object" and "sort by alignment request" are two good rules.

+54
May 21 '09 at 17:13
source share

Depending on the type of program you are using, this tip can lead to increased performance or can significantly slow down your work.

Performing this task in a multi-threaded program means that you will increase the chances of a "false exchange".

Check out Herb Sutters related articles here

I have already spoken about this, and I will continue to talk about it. The only real way to get a real increase in performance is to measure your code and use tools to identify the real bottleneck instead of arbitrarily changing the material in your code base.

+10
May 21 '09 at 13:00
source share

This is one way to optimize your working size. There is a good article by John Robbins on how you can speed up application performance by optimizing the size of the working set. Of course, this requires a careful selection of the most commonly used use cases that the end user can perform with the application.

+6
May 21 '09 at 13:04
source share

We have slightly different recommendations for the members here (the goal of the ARM architecture is mainly the 16-bit THUMB codec for various reasons):

  • on demand alignment (or, for beginners, a "group by size" usually does the trick)
  • the smallest of the first

“alignment” is somewhat obvious and goes beyond the scope of this issue; it avoids filling, uses less memory, etc.

The second bullet, however, stems from the small 5-bit “immediate” field size in bytes THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword) and LDR (Load Register).

5 bits means that offsets 0-31 can be encoded. Effectively, assuming that "this" is convenient in a register (which is usually there):

  • 8-bit bytes can be loaded in one command, if they exist at the same time + 0 through this + 31
  • 16-bit half-words, if they exist at the same time + 0 through this + 62;
  • 32-bit machine words, if they exist in this case + 0 through this + 124.

If they are outside this range, you need to create several instructions: either an ADD sequence with direct data to accumulate the corresponding address in the register, or, even worse, loading from the literal pool at the end of the function.

If we get into the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means, at least, the value of caching loads from the main memory for the first access to the letter pool, and then a lot of potential eviction problems and the invalidity between the d-cache and i-cache if the literal pool does not start in its own cache (i.e. if the actual code does not end at the end of the cache line).

(If I had a few wishes for the compiler we are working with, one of them is a way to force the creation of literal pools to start at the cache boundaries.)

(Not to mention that one of the things we do to avoid using a literal pool is to store all our “globals” in a single table. This means that one literal pool searches for “GlobalTable”, not multiple search for each If you're really smart, you might be able to save your GlobalTable in some kind of memory that you can access without loading a personal pool entry - was that .sbss?)

+3
May 21 '09 at 17:44
source share

While link locality for improving data access cache behavior is often an important consideration, there are several other reasons for layout control when optimization is needed - especially in embedded systems, even if the processors used in many embedded systems do not even have a cache.

- field alignment in structures

Alignment recommendations are pretty well understood by many programmers, so I won’t go into details.

In most processor architectures, the fields in the structure should be accessible using their own alignment to increase efficiency. This means that if you mix fields of different sizes, the compiler must add padding between the fields to properly match alignment requirements. Therefore, to optimize the memory used by the structure, it is important to remember this and lay out the fields in such a way that the largest fields are followed by smaller fields in order to minimize the required filling. If the structure should be “packed” to prevent filling, access to unaudited fields is achieved with a high execution cost, since the compiler must access uneven fields using a series of calls to smaller parts of the field along with shifts and masks to assemble the field value in the register .

- Offset of frequently used fields in the structure

Another consideration that may be important for many embedded systems is to often refer to the fields at the beginning of the structure.

Some architectures have a limited number of bits available in the instruction for encoding an offset to access the pointer, so if you are accessing a field whose offset exceeds this number of bits, the compiler will have to use several instructions to form a pointer to the field. For example, the ARM Thumb architecture has 5 bits for encoding the offset, so it can access a word size field in one command only if this field is within 124 bytes from the start. Therefore, if you have a large structure, the optimization that the built-in engineer would like to keep in mind is to place frequently used fields at the beginning of the structure structure.

+3
May 23 '09 at 3:26
source share

Well, the first member does not need an offset added to the pointer to access it.

+2
May 21 '09 at 12:52
source share

In C #, the order of the member is determined by the compiler, unless you put the [LayoutKind.Sequential / Explicit] attribute, which causes the compiler to lay out the structure / class as you tell it.

As far as I can tell, the compiler seems to minimize packaging when aligning data types in their natural order (i.e. 4 int bytes start with 4 byte addresses).

+1
May 21 '09 at 2:28 p.m.
source share

In theory, it can reduce cache misses if you have large objects. But it's usually best to group items of the same size together so you have a denser package.

0
May 21 '09 at 12:52
source share

hmmm, that sounds like a very dubious practice, why doesn't the compiler take care of this?

0
May 21 '09 at 12:54
source share

I highly doubt that it will have any impact on the CPU improvement - perhaps readability. You can optimize the executable code if the usually executed base blocks that run in this frame are in the same set of pages. This is the same idea, but I don’t know how to create basic blocks in the code. I assume that the compiler puts the functions in the order in which they see them without optimization, so you can try and combine common functions.

Try running profiler / optimizer. First you compile with some profiling, and then run your program. Once the profiled exe is complete, it will upload some profiled information. Take this dump and run it through the optimizer as an input.

I have long gone from this line of work, but not much has changed how they work.

0
May 21 '09 at 13:08
source share

I focus on performance, speed, and not memory usage. The compiler without any optimizing switch will display the variable storage area using the same order of declarations in the code. Imagine,

  unsigned char a; unsigned char b; long c; 

Big mess? without leveling switches, low memory operators. et al., we will have an unsigned char, using a 64-bit word on your DDR3-dimm, and another 64-bit word for another, and yet inevitable for a long one.

So this is a sample for each variable.

However, packing it or reordering it, you will get one selection and one masking of MI to use unsigned characters.

Thus, in speed, on the current 64-bit machine with memory, alignments, reordering, etc., no-nos. I deal with the microcontroller, and there the differences in the packed / unpacked state are differently different (we are talking about 10 Mbit / s processors, 8-bit dictionary memory)

At the same time, it has long been known that the engineering effort needed to tune code for performance is different from what a good algorithm gives you directions, and what the compiler can optimize often leads to burning rubber without real effects. This entry is only for syntactically duplicate code.

The last step forward in optimization that I saw (in uPs, I don’t think it is applicable for PC applications) is to compile your program as a single module, optimize its compiler (a much more general view of speed / resolution / memory pointer etc.), and they have unnecessary library functions, methods, etc.

0
May 21 '09 at 1:41
source share



All Articles