For implementations of the C ++ Vector3 utility class, is the array faster than the structure and class?

just out of curiosity, I implemented vector3 utilities in three ways: array (with typedef), class and struct

This is an array implementation:

typedef float newVector3[3]; namespace vec3{ void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3); void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3); void dot(const newVector3& first, const newVector3& second, float& out_result); void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3); } // implementations, nothing fancy...really void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3) { out_newVector3[0] = first[0] + second[0]; out_newVector3[1] = first[1] + second[1]; out_newVector3[2] = first[2] + second[2]; } void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3){ out_newVector3[0] = first[0] - second[0]; out_newVector3[1] = first[1] - second[1]; out_newVector3[2] = first[2] - second[2]; } void dot(const newVector3& first, const newVector3& second, float& out_result){ out_result = first[0]*second[0] + first[1]*second[1] + first[2]*second[2]; } void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3){ out_newVector3[0] = first[0] * second[0]; out_newVector3[1] = first[1] * second[1]; out_newVector3[2] = first[2] * second[2]; } } 

And the implementation of the class:

 class Vector3{ private: float x; float y; float z; public: // constructors Vector3(float new_x, float new_y, float new_z){ x = new_x; y = new_y; z = new_z; } Vector3(const Vector3& other){ if(&other != this){ this->x = other.x; this->y = other.y; this->z = other.z; } } } 

Of course, it contains other functions that usually appear in the Vector3 class.

And finally, the implementation of the structure:

 struct s_vector3{ float x; float y; float z; // constructors s_vector3(float new_x, float new_y, float new_z){ x = new_x; y = new_y; z = new_z; } s_vector3(const s_vector3& other){ if(&other != this){ this->x = other.x; this->y = other.y; this->z = other.z; } } 

Again, I omitted some of the other common features of Vector3. Now I have allowed all three of them to create 9,000,000 new objects and make 9,000,000 times of cross-product (I wrote a huge chunk of data for caching after one of them ends to avoid cache assistance).

Here is the test code:

 const int K_OPERATION_TIME = 9000000; const size_t bigger_than_cachesize = 20 * 1024 * 1024; void cleanCache() { // flush the cache long *p = new long[bigger_than_cachesize];// 20 MB for(int i = 0; i < bigger_than_cachesize; i++) { p[i] = rand(); } } int main(){ cleanCache(); // first, the Vector3 struct std::clock_t start; double duration; start = std::clock(); for(int i = 0; i < K_OPERATION_TIME; ++i){ s_vector3 newVector3Struct = s_vector3(i,i,i); newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct); } duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC; printf("The struct implementation of Vector3 takes %f seconds.\n", duration); cleanCache(); // second, the Vector3 array implementation start = std::clock(); for(int i = 0; i < K_OPERATION_TIME; ++i){ newVector3 newVector3Array = {i, i, i}; newVector3 opResult; vec3::cross(newVector3Array, newVector3Array, opResult); } duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC; printf("The array implementation of Vector3 takes %f seconds.\n", duration); cleanCache(); // Third, the Vector3 class implementation start = std::clock(); for(int i = 0; i < K_OPERATION_TIME; ++i){ Vector3 newVector3Class = Vector3(i,i,i); newVector3Class = Vector3::cross(newVector3Class, newVector3Class); } duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC; printf("The class implementation of Vector3 takes %f seconds.\n", duration); return 0; } 

The result is amazing.

struct and class implementation completes the task in about 0.23 seconds, while array implementation takes only 0.08 seconds!

If an array has a significant performance advantage like this, although its syntax will be ugly, it should be used in many cases.

So I really want to make sure this has to happen? Thanks!

+5
source share
2 answers

Short answer: it depends. As you can see, there is a difference if it is compiled without optimization.

When I compile (all functions in lines) optimized for ( -O2 or -O3 ), there is no difference (read on to see that it is not so simple).

  Optimization Times (struct vs. array) -O0 0.27 vs. 0.12 -O1 0.14 vs. 0.04 -O2 0.00 vs. 0.00 -O3 0.00 vs. 0.00 

There is no guarantee what kind of optimization your compiler can do, so the full answer is "dependent on your compiler." At first, I would be sure that my compiler would do the β€œRight Thing”, otherwise I have to start programming the assembly. Only if this part of the code is the real neck of a bottle is it worth considering helping the compiler.

If compiled with -O2 , your code takes exactly 0.0 seconds for both versions, but this is because the optimizers see that these values ​​are not used at all, so it just discards all the code!

Suppose this does not happen:

 #include <ctime> #include <cstdio> const int K_OPERATION_TIME = 1000000000; int main(){ std::clock_t start; double duration; start = std::clock(); double checksum=0.0; for(int i = 0; i < K_OPERATION_TIME; ++i){ s_vector3 newVector3Struct = s_vector3(i,i,i); newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct); checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z; // actually using the result of cross-product! } duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC; printf("The struct implementation of Vector3 takes %f seconds.\n", duration); // second, the Vector3 array implementation start = std::clock(); for(int i = 0; i < K_OPERATION_TIME; ++i){ newVector3 newVector3Array = {i, i, i}; newVector3 opResult; vec3::cross(newVector3Array, newVector3Array, opResult); checksum+=opResult[0] +opResult[1]+opResult[2]; // actually using the result of cross-product! } duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC; printf("The array implementation of Vector3 takes %f seconds.\n", duration); printf("Checksum: %f\n", checksum); } 

You will see the following changes:

  • The cache is not involved (no cache misses), so I just deleted the code responsible for resetting it.
  • There is no difference between class and structure from performance (after compilation, there really is no difference, the whole difference is in the syntactic sugar syndrome in the public and private sectors), so I look only at the structure.
  • The result of the cross product is actually used and cannot be optimized.
  • Now there is 1e9 iteration to get meaningful times.

With this change, we see the following timings (Intel compiler):

  Optimization Times (struct vs. array) -O0 33.2 vs. 17.1 -O1 19.1 vs. 7.8 -Os 19.2 vs. 7.9 -O2 0.7 vs. 0.7 -O3 0.7 vs. 0.7 

I'm a little disappointed that -Os has such poor performance, but otherwise you can see that if you optimize, there is no difference between structures and arrays!


Personally, I like -Os lot because it creates an assembly that I can understand, so let's see why it is so slow.

The most obvious, not looking at the resulting assembly: s_vector3::cross returns the s_vector3 object, but we assign the result to an existing object, so if the optimizer does not see that the old object is not, it will no longer be able to use RVO. Therefore replace

 newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct); checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z; 

with:

 s_vector3 r = s_vector3::cross(newVector3Struct, newVector3Struct); checksum+=rx +r.y+rz; 

Now the results: 2.14 (struct) vs. 7.9 2.14 (struct) vs. 7.9 - it has pretty improved!

My trick is from there: the optimizer does an excellent job, but we can help a little if necessary.

+7
source

In this case, no. As for the processor, classes, structures and arrays are just memory layouts, and the layout in this case is identical. In non-release builds, if you use the built-in methods, they can be compiled into actual functions (primarily to help the debugger step in the methods), so this can have a slight effect.

An add-on is not a good way to test performance like Vec3. Point and / or cross product is usually the best way to test.

If you really care about performance, you basically want to use an approach based on the structure of arrays (instead of an array of structures, as you already above). This typically allows the compiler to use automatic vector rendering.

i.e. instead of this:

 constexpr int N = 100000; struct Vec3 { float x, y, z; }; inline float dot(Vec3 a, Vec3 b) { return ax*bx + ay*by + az*bz; } void dotLots(float* dps, const Vec3 a[N], const Vec3 b[N]) { for(int i = 0; i < N; ++i) dps[i] = dot(a[i], b[i]); } 

You would do this:

 constexpr int N = 100000; struct Vec3SOA { float x[N], y[N], z[N]; }; void dotLotsSOA(float* dps, const Vec3SOA& a, const Vec3SOA& b) { for(int i = 0; i < N; ++i) { dps[i] = ax[i]*bx[i] + ay[i]*by[i] + az[i]*bz[i]; } } 

If you compile with -mavx2 and -mfma, then the latest version will be very well optimized.

+3
source

Source: https://habr.com/ru/post/1272113/


All Articles