Cache a lot of callbacks and then call them all without considering the cost of v-table

C1 , C2 , ... are callback classes.
They are derived from the CBase common interface with the CBase::f() .
They all override CBase::f() with the final modifier.

I need to register ~ 50 instances of any class derived from C1 , and ~ 50 instances of any class derived from C2 .
> (see @@ in the code below)

The main goal: When calling allF() you need to call C1::f() / C2::f() all registered instances.

Here is a simplified version, it works ( Full demo ): -

 #include <iostream> #include <vector> class CBase{ public: virtual void f(){std::cout<<"CBase"<<std::endl;} }; class C1 : public CBase{ public: virtual void f() final{std::cout<<"C1"<<std::endl;} }; class C2 : public CBase{ public: virtual void f() final{std::cout<<"C2"<<std::endl;} }; 

This is callback registration: -

 //-------- begin registering ----- std::vector<CBase*> cBase; void regis(CBase* c){ cBase.push_back(c); } void allF(){ //must be super fast for(auto ele:cBase){ ele->f(); //# } } int main() { C1 a; C1 b; C2 c; //@@ //or ... class C2Extend : public C2{}; C2Extend c; regis(&a); regis(&b); regis(&c); allF(); //print C1 C1 C2 } 

Problem

According to the profile result, if I can avoid the cost of the v-table in # , I get a significant increase in performance.

How to do it elegantly?

My bad decision

Possible workaround: create multiple arrays to store each CX ( Full demo ): -

 //-------- begin registering ----- std::vector<C1*> c1s; std::vector<C2*> c2s; void regis(C1* c){ c1s.push_back(c); } void regis(C2* c){ c2s.push_back(c); } void allF(){ //must be super fast for(auto ele:c1s){ ele->f(); //# } for(auto ele:c2s){ ele->f(); //# } } int main() { C1 a; C1 b; C2 c; regis(&a); regis(&b); regis(&c); allF(); //print C1 C1 C2 } 

It's very fast. However, this is not very good. After several developmental cycles, C3 , C4 , etc. were born.
I need to create std::vector<C3*> , std::vector<C4*> , ... manually
My approach leads to a follower of support.

Additional Information (Edited)

In the worst case, no more than 20 classes. ( C1 to C20 )

In the real case, C1 , C2 , ... are a special type of data structures.
All of them require special initialization ( f() ) at exactly the right time.

Their instances are built on different .cpp .
Thus, caching the array std::vector<CBase*> cBase; caching all of them would be helpful.

For example, C1 is map 1:1 , C2 is map 1:N , C3 is map N:N
Together with a custom dispenser, I can achieve unearthly data locality.

Note: I do not need the callback order. (Thanks Fire Lancer)

+5
source share
3 answers

Your "bad decision" begins to look much better when you automate it with templates. Our goal: save c1s , c2s , etc. In one vector.

To do this, we need to match the derived types with integers. An easy way to do this is to use a global counter and a function template that increments and saves it every time it is created.

 static std::size_t typeIndexCounter = 0; template <class> std::size_t indexForType() { static std::size_t const index = typeIndexCounter++; return index; } 

The first call to indexForType<T>() reserve a new index for T and will return the same number on subsequent calls.

Then we need a way to remove enough information about our callback vectors so that we can store them and call them correct f .

 struct Group { using CbVec = std::vector<void *>; void (*call)(CbVec &); CbVec callbacks; }; static std::vector<Group> groups; 

call will contain a function that iterates over the pointers, omitting them, and calls f . Like your solution, this excludes all calls of the same type to just one virtual call.

CbVec may contain CBase * instead of void * , but I will explain this choice later.

Now we need a function to populate groups after a Group request for some type:

 template <class T> Group &groupFor() { std::size_t const index = indexForType<T>(); if(index < groups.size()) // Group already exists, return it return groups[index]; assert( index == groups.size() && "Something went wrong... Did someone call detail_callbacks::indexForType?" ); // Register the new group, with its downcasting function groups.push_back({ [](Group::CbVec &callbacks) { for(void *p : callbacks) static_cast<T*>(p)->f(); }, {} }); // Return the new group return groups.back(); } 

Here you can see that we are using the lambda expression to generate downcasting functions. The reason I chose to save void * instead of CBase * is because the downcast that slows down in it becomes no-op, while creating a base-to-derivative may require pointer adjustment (and further complications in the case of virtual inheritance) .

Finally, an open API. All of the above has been defined inside namespace detail_callbacks , and we just need to collect the fragments:

 template < class T, class = std::enable_if_t<std::is_base_of<CBase, T>::value> > void regis(T *callback) { detail_callbacks::groupFor<T>().callbacks.push_back(static_cast<void*>(callback)); } void allF() { for(auto &group : detail_callbacks::groups) group.call(group.callbacks); } 

And here you are! Newly received callbacks are now automatically registered.

Look at the living at Coliru

+4
source

You can pull globals into a class templated by derived type, and when it is created, make sure that it is part of the general call.

 typedef void(*Action)(); // function pointer type, receives static call std::set<Action> allFs; template<typename T> struct FRegistry { static std::vector<T*> ts; static void doF() { // loop over the derived type, so no need for virtual for (T * t : ts) { t->f(); } } static void regis(T * item) { allFs.insert(&FRegistry::doF); // Ensure the global call includes this instantiation ts.push_back(t); // register the instance } } template<typename T> std::vector<T*> FRegistry<T>::ts = {}; // member initialisation template <typename T> regis(T * t) { FRegistry<T>::regis(t); } void allF() { for (Action a : allFs) { a(); } // call each doF } 

Usage does not change

 int main() { C1 a; C1 b; C2 c; regis(&a); regis(&b); regis(&c); allF(); //print C1 C1 C2 } 
+3
source

A virtual call is a very simple and fast implementation, so if this is a problem, nothing will be fast enough without structural changes. In particular, I would not expect that a simple std::function or manual work with function pointers would be a big win. Essentially, a virtual call might look like this:

 class CBase{ // Compiler generated struct Vtable { void (CBase::*f)(); }; public: virtual void f(){std::cout<<"CBase"<<std::endl;} // Compiler addded instance field Vtable *vtable; }; class C1 : public CBase{ public: virtual void f() final{std::cout<<"C1"<<std::endl;} // Compiler generated static data to initialise vtable member static Vtable C1::type_vtable = { &C1::f }; }; CBase *ptr = vector.front(); ptr->f(); // Gets compiled as ptr->(*ptr->vtable->f)(); 

So, at the code level, its additional memory reads, and then calls the function through a pointer to the function. However, this prevents many optimizations. At compiler level, it can no longer inline a function. At the CPU level, you need ptr->vtable be in the CPU cache and jeopardize its deviation from the branch prediction, both of which have much higher costs than a direct function call than several memory reads might seem. This is especially important if you have a lot of base classes, and they are ordered quite randomly in the container (the processor will most likely continue to wonder what will happen next).

The most optimal solution without changing the design is more like what you showed. Completely get rid of virtual / indirect functions and save them in separate containers. This allows the compiler to embed a function call if it considers it worth it and simplifies the processor. Perhaps you can use overloads or templates, so in the worst case you only need one place to call (and with templates, depending on the needs of something even smarter).

 class Register { std::vector<C1*> c1; std::vector<C2*> c2; void regis(C1 *c1); void regis(C2 *c2); //etc. }; 

Remember that you have changed the order in which objects are called. You sorted them by class type, but you used to be in the same order as regis .

Just sorting by class type (can use typeid or such) will most likely help the processor somewhat, but you will lose the attachment anyway.

“Profiled optimization” (PGO, look at the compiler, MSVC and GCC, for example, you can do this) can also help with some additional time spent collecting. This allows you to optimize the compiler based on the code that actually runs. I did not look in detail at the generated code for real projects, but I understand that MSVC can at least “embed” ordinary virtual calls using the switch statement on typeid , which allows us to better optimize and, possibly, work better with modern processors.

A larger design change is to avoid small virtual functions. Make virtual functions at a higher level (for example, pass the entire container to a virtual function, not to each element).

+1
source

Source: https://habr.com/ru/post/1271320/


All Articles