Just for everyone who understands this, spending half an hour on the CUDA API in one hand, and the PyCUDA documentation in the other, does wonders. Its much simpler than my initial experiments.
Runtime Runtime Information
Incoming lazy lazy code
... kernel=mod.get_function("foo") meminfo(kernel) ... def meminfo(kernel): shared=kernel.shared_size_bytes regs=kernel.num_regs local=kernel.local_size_bytes const=kernel.const_size_bytes mbpt=kernel.max_threads_per_block print("""=MEM=\nLocal:%d,\nShared:%d,\nRegisters:%d,\nConst:%d,\nMax Threads/B:%d"""%(local,shared,regs,const,mbpt))
Output example
=MEM= Local:24, Shared:64, Registers:18, Const:0, Max Threads/B:512
Static Device Information
Incoming lazy lazy code
import pycuda.autoinit import pycuda.driver as cuda (free,total)=cuda.mem_get_info() print("Global memory occupancy:%f%% free"%(free*100/total)) for devicenum in range(cuda.Device.count()): device=cuda.Device(devicenum) attrs=device.get_attributes()
Output example
Global memory occupancy:70.000000% free ===Attributes for device 0 MAX_THREADS_PER_BLOCK:512 MAX_BLOCK_DIM_X:512 MAX_BLOCK_DIM_Y:512 MAX_BLOCK_DIM_Z:64 MAX_GRID_DIM_X:65535 MAX_GRID_DIM_Y:65535 MAX_GRID_DIM_Z:1 MAX_SHARED_MEMORY_PER_BLOCK:16384 TOTAL_CONSTANT_MEMORY:65536 WARP_SIZE:32 MAX_PITCH:2147483647 MAX_REGISTERS_PER_BLOCK:8192 CLOCK_RATE:1500000 TEXTURE_ALIGNMENT:256 GPU_OVERLAP:1 MULTIPROCESSOR_COUNT:14 KERNEL_EXEC_TIMEOUT:1 INTEGRATED:0 CAN_MAP_HOST_MEMORY:1 COMPUTE_MODE:DEFAULT MAXIMUM_TEXTURE1D_WIDTH:8192 MAXIMUM_TEXTURE2D_WIDTH:65536 MAXIMUM_TEXTURE2D_HEIGHT:32768 MAXIMUM_TEXTURE3D_WIDTH:2048 MAXIMUM_TEXTURE3D_HEIGHT:2048 MAXIMUM_TEXTURE3D_DEPTH:2048 MAXIMUM_TEXTURE2D_ARRAY_WIDTH:8192 MAXIMUM_TEXTURE2D_ARRAY_HEIGHT:8192 MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES:512 SURFACE_ALIGNMENT:256 CONCURRENT_KERNELS:0 ECC_ENABLED:0 PCI_BUS_ID:1 PCI_DEVICE_ID:0 TCC_DRIVER:0
source share