Two questions here:
- Regardless of whether certain fields are stored together, this is an optimization.
- How to do it in reality.
The reason this may help is because the memory is loaded into the processor cache into chunks called "cache lines." This takes time, and usually the more cache lines loaded for your object, the longer it takes. In addition, the more other things are thrown out of the cache to free up space, which slows down other code in an unpredictable way.
The size of the cache line is processor dependent. If it is large compared to the size of your objects, then very few objects will cover the boundary of the cache line, so all optimization does not matter. Otherwise, you can leave, sometimes having only part of your object in the cache, and the rest in the main memory (or, possibly, in the L2 cache). It’s good if your most common operations (those that refer to commonly used fields) use as little cache as possible for the object, so grouping these fields gives you a better chance of this.
The general principle is called link locality. The closer the different memory addresses are to each other, the more you get access to your program, the better your chances of getting good cache behavior. It is often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multithreading means that you often don’t know what will be in the cache, etc. But we can talk about what can happen, most of the time. If you want to know something, you should usually measure it.
Please note that there are some errors. If you use processor-based atomic operations (which usually have atomic types in C ++ 0x), you may find that the CPU locks the entire cache line to block this field. Then, if you have several atomic fields close to each other, with different threads running on different kernels and working in different fields at the same time, you will find that all these atomic operations are serialized because they block the same memory location while working in different areas. If they worked on different lines of the cache, they would work in parallel and work faster. In fact, as Glen points out (via Herb Sutter), in a coherent cache architecture, this happens even without atomic operations and can completely ruin your day. Thus, link locality is not always a good thing when multiple cores are involved, even if they share a cache. You can expect this to be due to the fact that cache misses are usually a source of lost speed, but in your particular case they will be terribly wrong.
Now, despite the difference between commonly used and less used fields, the smaller the object, the less memory it takes (and therefore less cache). This is good news all over the world, at least where you have no fierce competition. The size of an object depends on the fields in it and on any padding that must be inserted between the fields to ensure that they are correctly aligned for the architecture. C ++ (sometimes) places restrictions on the order which fields should appear in the object, depending on what order they are declared. This is done to simplify low-level programming. So, if your object contains:
- int (4 bytes, 4-aligned)
- and then char (1 byte, any alignment)
- followed by int (4 bytes, 4-aligned)
- and then char (1 byte, any alignment)
then most likely it will take 16 bytes in memory. The size and alignment of int are not the same on every platform, by the way, but 4 is very common, and this is just an example.
In this case, the compiler inserts 3 bytes of padding before the second int to correctly align it, and 3 bytes of padding at the end. The size of an object must be a multiple of its alignment, so that objects of the same type can be contiguous in memory. That the entire array is in C / C ++, related objects in memory. If the structure were int, int, char, char, then the same object could be 12 bytes, because char has no alignment requirement.
I said that if int 4-aligned depends on the platform: on ARM it is absolutely necessary, since unaudited access causes a hardware exception. On x86, you can access ints unligned, but it is generally slower and IIRC non-atomic. So, compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packaging, is to look at the alignment requirement for each member of the structure. Then first order the fields with the largest aligned types, then the next smallest, etc., up to members that do not require elegance requirements. For example, if I try to write portable code, I can come up with the following:
struct some_stuff { double d; // I expect double is 64bit IEEE, it might not be uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know uint32_t i; // 4 bytes, usually 4-aligned int32_t j; // same short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment char d; // 1 byte, any alignment };
If you do not know the alignment of the field or write portable code, but want to do everything in your power without serious deception, then you assume that the requirement of alignment is the biggest requirement of any fundamental type in the structure, and that the requirement of alignment of the main types is their the size. So, if your structure contains uint64_t or a long long one, then the best estimate is 8-aligned. Sometimes you are mistaken, but you will be right many times.
Please note that game programmers, such as your blogger, often know everything about their processor and hardware, and therefore they don’t need to guess. They know the size of the cache line, they know the size and alignment of each type, and they know the layout rules used by their compiler (for POD and non-POD types). If they support several platforms, then, if necessary, they can be special for each. They also spend a lot of time thinking about which objects in their game will benefit from increased productivity, as well as using profilers to find out where the real bottlenecks are. But even in this case, it is not such a bad idea to have a few rules of thumb that you apply whether an object is needed or not. Until the code becomes clear, "put commonly used fields at the beginning of an object" and "sort by alignment request" are two good rules.