Disabled ECC support for Tesla C2070 and Ubuntu 12.04

I have a headless workstation with Ubuntu 12.04 server and recently installed a new Tesla C2070 board, but when I run the examples from the CUDA SDK, I get the following error:

NVIDIA_GPU_Computing_SDK/C/bin/linux/release% ./reduction [reduction] starting... Using Device 0: Tesla C2070 Reducing array of type int 16777216 elements 256 threads (max) 64 blocks reduction.cpp(473) : cudaSafeCallNoSync() Runtime API error 39 : uncorrectable ECC error encountered. 

Actually, this error occurs in all other examples except "deviceQuery".

I am using kernel 3.2.0, nvidia 295.41 driver and Cuda 4.2.9.

After a great search, a suggestion was found to disable ecc support:

  nvidia-smi -g 0 --ecc-config=0 

who worked. But the question is, how reliable will GPU computing with disabled ecc support be?

Any advice, suggestions or solutions would be highly appreciated.

-Konstantin

+4
source share
4 answers

I am wondering if this could be some kind of compatibility issue, not a bad map. I am suffering from the same issue with the Tesla C2075, the same version of Ubuntu. We contacted nVidia and they told us that two-bit ECC errors (as seen from using nvidia-smi -q on linux) mean that the card was probably broken. We got a replacement, but it has exactly the same problems.

It seems unlikely that both the boards that I had are broken the same way, so we will try this on another machine if we find a suitable one.

I will post something interesting that we learn.

+3
source

I will repeat what aland said and add your own experience.

I worked with several computing clusters equipped with Fermi and tested them with ECC on and off. We did this to increase the amount of available memory and the speed of calculations, which was noticeable. nvidia-smi never reported ECC errors for these cards with ECC, and we did not encounter run-time errors that indicated problems with ECC.

If your card detects irreparable problems with ECC, this indicates a lack of hardware, and disabling ECC only masks the problem. The runtime correctly warns you that something bad has gone wrong and you cannot depend on the results.

In any case, you can try to carry out your calculations and see what happens, but be prepared for something completely insane for no real reason. One bit is flipped over here or it can have huge consequences for floating point math, for example, and can align your kernel if the command is corrupted.

If you can, I would try replacing the card instead of masking the symptoms.

+1
source

It turned out that my case is the same as that of carthurs. I also replaced my card, but the error did not disappear. Only after installing the motherboard on board the VGA as the main BIOS did it disappear. The Tesla installation guide should have a warning about this!

Thank you all for your help.

+1
source

As soon as an ECC uncorrectable error occurs with the GPU, the GPU may be in an unstable state (for example, data corruption could occur not only in the memory allocated by the user, but also in the memory area necessary for the GPU to work). To restore the GPU, you need to either turn on the power, or reboot the system, or try using the nvidia-smi Reset nvidia-smi

 nvidia-smi -h ... -r --gpu-reset Trigger secondary bus reset of the GPU. Can be used to reset GPU HW state in situations that would otherwise require a machine reboot. Typically useful if a double bit ECC error has occurred. --id= switch is mandatory for this switch 

Type man nvidia-smi for more help on this topic

0
source

Source: https://habr.com/ru/post/1432666/


All Articles