JVM error due to nfs file lock after network shutdown

Question

JVM error due to nfs file lock after network shutdown

The following code snippet causes the JVM to fail: if a network failure occurs after a network failure

while (true) { //file shared over nfs String filename = "/home/amit/mount/lock/aLock.txt"; RandomAccessFile file = new RandomAccessFile(filename, "rws"); System.out.println("file opened"); FileLock fileLock = file.getChannel().tryLock(); if (fileLock != null) { System.out.println("lock acquired"); } else { System.out.println("lock not acquired"); } try { //wait for 15 sec Thread.sleep(30000); } catch (InterruptedException e) { e.printStackTrace(); } System.out.println("closing filelock"); fileLock.close(); System.out.println("closing file"); file.close(); }

Observation: The JVM receives a KILL signal (9) and exits with an exit code of 137 (128 + 9).

Perhaps after reconnecting the network connection, something goes wrong in the file descriptor tables. This behavior is reproduced using the flock (2) system call and the shell command utility (1).

Any suggestion / work questions?

PS: using Oracle JDK 1.7.0_25 with NFSv4

EDIT : This lock will be used to determine which process is active in a distributed high-availability cluster. The exit code is 137. What am I waiting for? way to detect a problem. close the file and try to re-purchase.

+6

java nfs file-locking

Amit g 18 sept. '13 at 10:06

source share

3 answers

Exit code 138 does NOT hint at SIGKILL - it is signal 10, which can be SIGBUS (on a tanning bed) or SIGUSR1 (on linux). Unfortunately, you are not telling us which one you are using.

In theory, nfs should handle everything transparently - the machine crashes, reboots and clears locks. In practice, I have never seen this work in NFS3, and NFS4 (which you use) makes things even more complicated, since there are no separate lockd () and statd ().

I would recommend you run a farm (Solaris) or strace (linux) in your Java process, and then pull out the network plug to find out what is really happening. But to be honest, locking NFS file systems is what people have recommended against while I use Unix (more than 25 years now), and I highly recommend you write a small server program that handles the “who does what” thing. Let your clients connect to the server, let them send messages “starting with X” and “stop doing” to the server, and the server gracefully disconnects the connection if the client does not respond for more than, say, 5 minutes. I'm 99% sure that this will take you less time than trying to fix an NFS lock.

+5

Guntram blohm Dec 12 '13 at 14:53

source share

This behavior is reproduced using the flock (2) system call and shell (1).

Since you can reproduce it outside of Java, this sounds like an infrastructure issue. You did not provide too much information about your NFS server or client OS, but one thing I saw causes strange behavior with NFS is the wrong DNS configuration.

Verify that the output of "uname -n" and "hostname" on the client matches your DNS records. Verify that the NFS server resolves DNS correctly.

Like Guntram, I also do not recommend using NFS for this kind of thing. I would use Hazlecast (without a server, instances of dynamic clusters) or ZooKeeper (you need to configure the server).

With Hazlecast, you can do this to get an exclusive cluster lock:

 import com.hazelcast.core.Hazelcast; import java.util.concurrent.locks.Lock; Lock lock = Hazelcast.getLock(myLockedObject); lock.lock(); try { // do something here } finally { lock.unlock(); }

It also supports timeouts:

 if (lock.tryLock (5000, TimeUnit.MILLISECONDS)) { try { // do some stuff here.. } finally { lock.unlock(); } }

+1

John r Dec 17 '13 at 21:23

source share

Dan kruchinin · Accepted Answer · 2013-12-17T22:40:53+0000

After rebooting the NFS server, all clients that have active file locks begin the lock recovery procedure, which does not last the so-called grace period (just a constant). If the cancellation procedure is not performed during the grace period, the NFS client (usually the beast of the kernel) sends SIGUSR1 to the process, which could not restore its locks. This is the root of your problem.

When the lock completes on the server side, rpc.lockd on the client system requests another rpc.statd daemon to monitor the NFS server that implements the lock. If the server crashes and then recovers, an rpc.statd message will be reported. Then it tries to restore all active locks. If the NFS server crashes and restores, and rpc.lockd cannot restore the lock, it sends a signal (SIGUSR1) to the process requesting the lock.

http://menehune.opt.wfu.edu/Kokua/More_SGI/007-2478-010/sgi_html/ch07.html

You are probably wondering how to avoid this. Well, there are several ways, but no one is perfect:

The increase in the grace period. AFAIR, on linux it can be changed with / proc / fs / nfsd / nfsv 4leasetime.
Create a SIGUSR1 handler in your code and do something smart. For example, in the signal handler, you can set a flag indicating that the restoration of locks failed. If this flag is set, your program may try to wait until the NFS server is ready (as long as necessary), and then try to restore the locks themselves. Not very fruitful ...
Don't use NFS lock anymore. If possible, switch to zookeeper as suggested earlier.

JVM error due to nfs file lock after network shutdown

More articles: