Recently, a colleague needed to use NVML to request device information, so I downloaded the Tesla 3.304.5 development kit and copied the nvml.h file to / usr / include. To check, I compiled the sample code in tdk_3.304.5 / nvml / example, and it worked fine.
Over the weekend, something has changed in the system (I canβt determine what has been changed, and I'm not the only one who has access to the machine), and now any code that uses nvml.h, for example example code, fails the following error:
Failed to initialize NVML: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING: You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (eg machine that you build your application doesn't have to have Display Driver installed). !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
However, I can still run nvidia-smi and read my K20m status information, and as far as I know, nvidia-smi is just a collection of nvml.h calls. The error message I get is somewhat cryptic, but I believe that it tells me that the nvidia-ml.so file should match the Tesla driver that I installed on my system. To make sure everything is correct, I reloaded CUDA 5.0 and installed the driver, CUDA runtime, and test files. I am sure that the nvidia-ml.so file matches the driver (both 304.54), so I am pretty confused as to what might happen wrong. I can compile and run the test code using nvcc, and also run my own CUDA code if it does not contain nvml.h.
Has anyone come across this error or had any thoughts on fixing the problem?
$ ls -la /usr/lib/libnvidia-ml* lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1 lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54 -rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54 $ ls -la /usr/lib64/libnvidia-ml* lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1 lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54 -rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54 $ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012 GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) $ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2012 NVIDIA Corporation Built on Fri_Sep_21_17:28:58_PDT_2012 Cuda compilation tools, release 5.0, V0.2.1221 $ whereis nvml.h nvml: /usr/include/nvml.h $ ldd example linux-vdso.so.1 => (0x00007fff2da66000) libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000) libc.so.6 => /lib64/libc.so.6 (0x000000300e400000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000) libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000) /lib64/ld-linux-x86-64.so.2 (0x000000300e000000)
EDIT: The solution was to remove all additional instances of libnvidia-ml.so. For some reason there were many.
$ sudo find / -name 'libnvidia-ml*' /usr/lib/libnvidia-ml.so.304.54 /usr/lib/libnvidia-ml.so /usr/lib/libnvidia-ml.so.1 /usr/opt/lib/libnvidia-ml.so /usr/opt/lib/libnvidia-ml.so.1 /usr/opt/lib64/libnvidia-ml.so /usr/opt/lib64/libnvidia-ml.so.1 /usr/opt/nvml/lib/libnvidia-ml.so /usr/opt/nvml/lib/libnvidia-ml.so.1 /usr/opt/nvml/lib64/libnvidia-ml.so /usr/opt/nvml/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so.304.54 /usr/lib64/libnvidia-ml.so /usr/lib64/libnvidia-ml.so.1 /lib/libnvidia-ml.so.old /lib/libnvidia-ml.so.1