Cannot run CUDA code that NVML requests - error regarding libnvidia-ml.so

Recently, a colleague needed to use NVML to request device information, so I downloaded the Tesla 3.304.5 development kit and copied the nvml.h file to / usr / include. To check, I compiled the sample code in tdk_3.304.5 / nvml / example, and it worked fine.

Over the weekend, something has changed in the system (I can’t determine what has been changed, and I'm not the only one who has access to the machine), and now any code that uses nvml.h, for example example code, fails the following error:

Failed to initialize NVML: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING: You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (eg machine that you build your application doesn't have to have Display Driver installed). !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

However, I can still run nvidia-smi and read my K20m status information, and as far as I know, nvidia-smi is just a collection of nvml.h calls. The error message I get is somewhat cryptic, but I believe that it tells me that the nvidia-ml.so file should match the Tesla driver that I installed on my system. To make sure everything is correct, I reloaded CUDA 5.0 and installed the driver, CUDA runtime, and test files. I am sure that the nvidia-ml.so file matches the driver (both 304.54), so I am pretty confused as to what might happen wrong. I can compile and run the test code using nvcc, and also run my own CUDA code if it does not contain nvml.h.

Has anyone come across this error or had any thoughts on fixing the problem?

 $ ls -la /usr/lib/libnvidia-ml* lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1 lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54 -rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54 $ ls -la /usr/lib64/libnvidia-ml* lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1 lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54 -rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54 $ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012 GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) $ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2012 NVIDIA Corporation Built on Fri_Sep_21_17:28:58_PDT_2012 Cuda compilation tools, release 5.0, V0.2.1221 $ whereis nvml.h nvml: /usr/include/nvml.h $ ldd example linux-vdso.so.1 => (0x00007fff2da66000) libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000) libc.so.6 => /lib64/libc.so.6 (0x000000300e400000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000) libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000) /lib64/ld-linux-x86-64.so.2 (0x000000300e000000) 

EDIT: The solution was to remove all additional instances of libnvidia-ml.so. For some reason there were many.

 $ sudo find / -name 'libnvidia-ml*' /usr/lib/libnvidia-ml.so.304.54 /usr/lib/libnvidia-ml.so /usr/lib/libnvidia-ml.so.1 /usr/opt/lib/libnvidia-ml.so /usr/opt/lib/libnvidia-ml.so.1 /usr/opt/lib64/libnvidia-ml.so /usr/opt/lib64/libnvidia-ml.so.1 /usr/opt/nvml/lib/libnvidia-ml.so /usr/opt/nvml/lib/libnvidia-ml.so.1 /usr/opt/nvml/lib64/libnvidia-ml.so /usr/opt/nvml/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so.304.54 /usr/lib64/libnvidia-ml.so /usr/lib64/libnvidia-ml.so.1 /lib/libnvidia-ml.so.old /lib/libnvidia-ml.so.1 
+4
source share
3 answers

You get this error because an application that is trying to use nvml loads the stub library, which is located in:

 ...tdk_install_path/lib64/libnvidia-ml.so 

instead of what is in:

 /usr/lib64/libnvidia-ml.so 

I was able to reproduce your error when I added the stub library path to my LD_LIBRARY_PATH environment LD_LIBRARY_PATH . Thus, this is one of the possible sources of errors if someone added the path to the stub library that comes with the tdk distribution to your LD_LIBRARY_PATH environment LD_LIBRARY_PATH , but this is probably not the only way this can happen. If someone in an unusual way copied a library of stubs onto some system path, this could also be a problem.

You need to try and find out why your system loads this stub library instead of the correct one in /usr/lib64 . Alternatively, for detection purposes, you can try to remove all instances of the stub library anywhere in your system (leave the correct libraries only in /usr/lib and /usr/lib64 ) and you can observe the correct behavior.

+4
source

I solved the problem this way on the GTX 1070 using Windows 10: go to the device manager, select the GPU that has the problem, turn off the GPU and turn it back on.

+1
source

I had the same or similar problem with EWBF Cuda Miner for zCash.

Here is a way to automatically apply the Pro7ech answer (which worked for me) for WIN10:

Install WDK for Windows 10 if you don’t already have it: this will give you the opportunity to use devcon.exe, which allows you to manipulate devices using batch scripts: https://docs.microsoft.com/en-us/windows-hardware/drivers / download-the-wdk

You may also need the Windows SDK if you do not have a visual studio with Desktop development with a C ++ workload: https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk

To simplify the task, you can add the installation path to the PATH environment variable: https://www.howtogeek.com/118594/how-to-edit-your-system-path-for-easy-command-line-access/

Devcon.exe was installed for me:

 C:\Program Files (x86)\Windows Kits\10\Tools\x64 

So now run this or the like at the cmd.exe prompt to get the device ID:

 devcon findall * | find /i "nvidia" 

Here's what mine looks like:

 C:\Users\Soenhay>devcon findall * | find /i "nvidia" HDAUDIO\FUNC_01&VEN_10DE&DEV_0083&SUBSYS_38426674&REV_1001\5&1C277AD4&0&0001: NVIDIA High Definition Audio SWD\MMDEVAPI\{0.0.0.00000000}.{574980C3-9747-42EF-A78C-4C304E070B81}: SAMSUNG (NVIDIA High Definition Audio) ROOT\UNNAMED_DEVICE\0000 : NVIDIA Virtual Audio Device (Wave Extensible) (WDM) PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000: NVIDIA GeForce GTX 1070 

From this, I see that my graphic device id is:

 PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000 

So, I create a batch file with the following to disable and enable the driver again:

 devcon disable "PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000" devcon enable "PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000" 

Now, when I get the NVML error when starting the miner, I just run this batch file and it fixes it. You can also simply add these 2 lines to the beginning of your start.bat file to do this every time, but I found that the error does not always happen every time I restart the miner's time.


Literature:

superuser post

devcon commands

devcon examples

Note: This stopped working for me for an unknown reason, so I returned to using the Pro7ech solution.

0
source

Source: https://habr.com/ru/post/1492732/


All Articles