Will the compiler expand this loop?

I create a multidimensional vector (mathematical vector), where I allow the basic mathematical operations +, -, /, *, =. The template takes two parameters, one the type (int, float, etc.), and the other the size of the vector. I am currently applying operations through a for loop. Now, given that the size is known at compile time, will the compiler expand the loop? If not, is there a way to deploy it without (or a minimum) performance penalty?

template <typename T, u32 size> class Vector { public: // Various functions for mathematical operations. // The functions take in a Vector<T, size>. // Example: void add(const Vector<T, size>& vec) { for (u32 i = 0; i < size; ++i) { values[i] += vec[i]; } } private: T values[size]; }; 

Before anyone marks Profile then optimize , please note that this is the basis for my 3D graphics engine and it should be fast. Secondly, I want to know for the sake of education.

+6
source share
6 answers

You can perform the following disassembly trick to find out how a particular code is compiled.

  Vector<int, 16> a, b; Vector<int, 65536> c, d; asm("xxx"); // marker a.Add(b); asm("yyy"); // marker c.Add(d); asm("zzz"); // marker 

Now compile

 gcc -O3 1.cc -S -o 1.s 

And look at the mess

  xxx # 0 "" 2 #NO_APP movdqa 524248(%rsp), %xmm0 leaq 524248(%rsp), %rsi paddd 524184(%rsp), %xmm0 movdqa %xmm0, 524248(%rsp) movdqa 524264(%rsp), %xmm0 paddd 524200(%rsp), %xmm0 movdqa %xmm0, 524264(%rsp) movdqa 524280(%rsp), %xmm0 paddd 524216(%rsp), %xmm0 movdqa %xmm0, 524280(%rsp) movdqa 524296(%rsp), %xmm0 paddd 524232(%rsp), %xmm0 movdqa %xmm0, 524296(%rsp) #APP # 36 "1.cc" 1 yyy # 0 "" 2 #NO_APP leaq 262040(%rsp), %rdx leaq -104(%rsp), %rcx xorl %eax, %eax .p2align 4,,10 .p2align 3 .L2: movdqa (%rcx,%rax), %xmm0 paddd (%rdx,%rax), %xmm0 movdqa %xmm0, (%rdx,%rax) addq $16, %rax cmpq $262144, %rax jne .L2 #APP # 38 "1.cc" 1 zzz 

As you can see, the first cycle was small enough to unfold. The second is a cycle.

+9
source

First: modern processors are pretty smart at predicting branches, so loop unfolding may not help (and may even hurt).

Secondly: Yes, modern compilers know how to expand the loop in this way, if it is a good idea for your target CPU.

Third: modern compilers can even auto-vectorize a loop, which is even better than deploying.

Bottom line. Do not think that you are smarter than your compiler if you do not know about the processor architecture lot . Write your code in a simple, easy way and don't worry about micro-optimization until your profiler tells you.

+4
source

First of all, it is not at all necessary that the deployment of the cycle would be useful.

The only possible answer to your question is “it depends” (on compiler flags, size , etc.).

If you really want to know, ask your compiler: compile into assembly code with typical size values ​​and with optimization flags that you will use for real, and analyze the result.

+1
source

The only way to understand this is to try it on your compiler using your own optimization options. Make one test file with the code "it deploys it", test.cpp :

 #include "myclass.hpp" void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) { a.add( b ); } 

then the link.cpp link code snippet:

 #include "myclass.hpp" void doSomething(Vector<double, 3>& a, Vector<double, 3>& b) { a[0] += b[0]; a[1] += b[1]; a[2] += b[2]; } 

and now use GCC to compile them and spit out only the assembly:

 for x in *.cpp; do g++ -c "$x" -Wall -Wextra -O2 -S -o "out/$xs"; done 

In my experience, GCC will roll out loops of 3 or less by default when using loops whose duration is known at compile time; using -funroll-loops will cause it to expand again.

+1
source

A loop can be deployed using a recursive template instance. It may or may not be faster in your C ++ implementation.

I slightly adjusted your example so that it compiles.

 typedef unsigned u32; // or something similar template <typename T, u32 size> class Vector { // need to use an inner class, because member templates of an // unspecialized template cannot be explicitly specialized. template<typename Vec, u32 index> struct Inner { static void add(const Vec& a, const Vec& b) { a.values[index] = b.values[index]; // triggers recursive instantiation of Inner Inner<Vec, index-1>::add(a,b); } }; // this specialization terminates the recursion template<typename Vec> struct Inner<Vec, 0> { static void add(const Vec& a, const Vec& b) { a.values[0] = b.values[0]; } }; public: // PS! this function should probably take a // _const_ Vector, because the argument is not modified // Various functions for mathematical operations. // The functions take in a Vector<T, size>. // Example: void add(Vector<T, size>& vec) { Inner<Vector, size-1>::add(*this, vec); } T values[size]; }; 
+1
source

Many compilers will expand this loop, I don’t know, the “compiler” you are talking about will be. There is more than one compiler in the world.

If you want to ensure that it is deployed, then TMP (with inlay) can do this. (This is actually one of the more trivial TMP applications, often used as an example of metaprogramming).

0
source

Source: https://habr.com/ru/post/889133/


All Articles