Tensorflow new Op CUDA core memory management

I implemented a rather complicated new Op in Tensorflow with the CUDA GPU core. This Op requires a lot of dynamic memory allocation of variables that are not tensors and are freed up after the op operation is performed, more specifically, this is due to the use of a hash table.

Now I use cudaMalloc() and cudaFree() , but I noticed that Tensorflow has its own type called Eigen::GPUDevice , which has the ability to allocate and free memory on the GPU.

My questions:

  • It is best to use Eigen::GPUDevice to manage the memory of the GPU;
  • Using Eigen::GPUDevice instead of the CUDA API, I automatically enable multiple GPU support, as various GPUDevices can be transferred to Op;
  • Should I extend this idea to the processor core and see if there is a CPUDevice type that also manages memory instead of C ++ syntax (ie auto var = new int[100]; delete[] var )
+5
source share
2 answers

This is not a direct public recommendation on this issue. I usually just let TensorFlow distribute this information with

 template<typename Device, typename Dtype> class MyOp: public OpKernel { { public: explicit MyOp(OpKernelConstruction *context) : OpKernel(context) { // ... } void Compute(OpKernelContext *context) override { Tensor* tmp_var = nullptr; Tensor* output = nullptr; TensorShape some_shape, some_shape2; // temparily use this space OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var)); // allocate memory for output tensor OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output)); 
  • what is required for memory should be assigned by the TensorFlow context, not by custom calls to cudaMalloc or new type[num] .
  • context should provide information for Allocator
  • see below

For simplicity, consider just adding two matrices ( full example ). TensorFlow-Operations typically contain the following structure:

Description Op having REGISTER_OP responsible for validating the form and setting up the output form ( example )

OpKernel is responsible for allocating memory, getting a pointer to inputs and configuration files (see above or this )

Functor for the implementation itself, for example

 Tensor* output = nullptr; Tensor* tmp_var = nullptr; OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output)); OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var)); // the function does not need to care about the memory allocation as everything is already setup at this point ::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output); 

It remains only to implement

  // gpu version template <typename Dtype> struct MyFunctor<GPUDevice, Dtype> { void operator ()(::tensorflow::OpKernelContext* ctx,...) // cpu version template <typename Dtype> struct MyFunctor<CPUDevice, Dtype> { void operator ()(::tensorflow::OpKernelContext* ctx,...) 

change

  • allocate_persistent: use this if you need your data between Op calls, such as one-time index structures. [ example ]
  • allocate_temp is only tmp memory that will not be saved at the end of the life of the Compute method. [ example ]

But I highly recommend reading the comment in the source code here , and then decided depending on your use case.

+6
source

Best practice is to use the OpKernelContext::allocate_persistent() method to allocate memory in the form of tensorflow::Tensor , which survives one call to OpKernel::Compute() . It uses the appropriate Allocator* for the device, so if the kernel runs on the GPU device, it will allocate GPU memory for that particular device, and if it runs on the processor, it will allocate CPU memory.

+2
source

Source: https://habr.com/ru/post/1275124/


All Articles