256
Logo

Java Bad File Descriptor Close Bug

We have been fighting for a number of weeks with a bad file descriptor bug on a search project at work. Although we initially assumed it was our buggy software to blame, a closer look at the problem seems to point elsewhere. Our current research seems to indicate that a bug in Sun's JDK under Linux (maybe others) is at fault here.

Background

We have been seeing a large number of the 'Bad file descriptor' exceptions in various places around our software. We initially thought that we were someone double closing or otherwise improperly using file handles. However, after a more aggressively review of the code in question as well as the JDK, we came to the conclusion that aside from creating input or output streams using the FileDescriptor constructor (which we were not doing) there should be no way to get these problems. In addition, many of the places which were throwing the exceptions (see below) were in 3rd party libraries including Lucene and Berkeley Database JE.

Our application is a search system which is using ActiveMQ JMS to receive, index, and store documents. We use 10s of JMS queues and topics to coordinate a distributed collection of search nodes. This translates into 10s of sockets which are mostly TCP but some UDP and multicast packets. Local disk descriptors to standard SCSI disks or hardware/software raid configurations. No network filesystems are being used. Mostly EXT3, some XFS, and Some temporary filesystems but otherwise nothing exotic.

Possible Solution

In November 2007, a guy sent me mail mentioning that they had experiences the same problem with their software. After much work with their Sun support contract, the bug has been fixed. Here's the Sun bug database entry. It states that the fix will be in Java 1.5u14 and 1.6u4. To quote:

When a close() operation on an OS file descriptor is succesful, the record of the FD is set to -1, and on subsequent close() operations the actual call to the OS' close() is not performed.

If an error occurs in the OS' close(), we do not set our record of the FD to -1, so subsequent calls will re-attempt the close().

We should not re-attempt the close().

As of February 26, 2007, more testing has revealed that our problems seem to be coming from concurrency issues in Sun's 1.5 JDK (through v.11) combined with Lucene v2.0's calling close() in FSIndexInput.finalize() and FSIndexOutput.finalize() methods -- a real no-no. Looks like v2.1 has 2 bug fixes that would resolve this (669 and 750). We think that the problem is a race condition between a user thread which has caused close and the finalizer which is also calling close. The fact that the FileDescriptor has already been closed is not properly synchronized between the 2 threads so a double close is performed, closing a file that was just opened.

This problem also seems to be fixed in Sun's v1.6 JDK. Looking through the change logs, we see that a number of garbage collection bugs have been fixed. Right now our speculation is that the problems stem from a synchronization bug in 1.5 versions between the GC (maybe just the finalizer parts) and user threads. The finalizer thread does not know that a user thread had closed a file descriptor -- or vice versa. Although we are seeing 30%+ worse performance with 1.6, we have not seen any Bad File Descriptor exceptions. If they fixed some problems by adding synchronization or memory barriers to the GC, this also may account for the performance losses.

JDK Analysis

One of the things that we noticed in looking at the JDK is that if the close() system call fails, the FileDescriptor is not reset to -1 even though the system's file descriptor is invalid (see Linux kernel source). This could mean that if the close syscall fails, it might cause a close of either an invalid file descriptor, or a valid descriptor which happened to call open and get the latest file descriptor that was just closed. This may be the ultimate problem although we are unsure of when, aside from NFS which we are not using, close returns failure. This bug was not in Sun's 1.4 JDK, but was introduced in 1.5 when the error code from close was detected. The bug in is all 1.5 versions through 1.5.11 as well as 1.6. Since the Linux version is using the Solaris code for close, we wonder if the Solaris kernel does something different of it the Solaris version has a bug as well.

We decided initially to copy and override the FileDescriptor class and the classes which open them so we can instrument, and hopefully protect against, the problems. We also wanted to put some synchronized blocks around the closes to make sure that we don't have a concurrency problem. Initial takes at this seemed to have reduced but not removed the problems. As of February 2007 we are working on rev 3 of these classes however these were unable to fix the problem although they gave us more information of the call graphs at fault.

Linux Kernel

Below is the source for the close() syscall from the 2.6.18.3 Linux kernel. As you can see, if the file descriptor is valid, there is no way to leave this routine without it being closed and made available for the next call to open(). If filep_close returns an error, and the JDK does not set the file descriptor to -1, a double close will most likely happen in the future.

asmlinkage long sys_close(unsigned int fd)
{
  struct file * filp;
  struct files_struct *files = current->files;
  struct fdtable *fdt;
  
  spin_lock(&files->file_lock);
  fdt = files_fdtable(files);
  if (fd >= fdt->max_fds)
    goto out_unlock;
  filp = fdt->fd[fd];
  if (!filp)
    goto out_unlock;
  rcu_assign_pointer(fdt->fd[fd], NULL);
  FD_CLR(fd, fdt->close_on_exec);
  __put_unused_fd(files, fd);
  spin_unlock(&files->file_lock);
  return filp_close(filp, files);

 out_unlock:
  spin_unlock(&files->file_lock);
  return -EBADF;
}

Exceptions Seen

We have seen a number of exceptions in our code including failing in FileOutStream.close0(), RandomAccessFile.writeBytes(), RandomAccessFile.seek(), and FileInputStream.readBytes(). We have seen exceptions in Lucene (v2.0.0) calls:

2007-01-29 04:22:02,947 ERROR [IndexMergeAgent] (IndexAdjuster.java:54) -
   IndexAdjuster caught throwable: Bad file descriptor
java.io.IOException: Bad file descriptor
       at java.io.RandomAccessFile.writeBytes(Native Method)
       at java.io.RandomAccessFile.write(RandomAccessFile.java:456)
       at org.apache.lucene.store.FSIndexOutput.flushBuffer(FSDirectory.java:503)
       at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:84)
       at org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:98)
       at org.apache.lucene.store.FSIndexOutput.close(FSDirectory.java:506)
       at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:415)
       at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:90)
       at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709)
       at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686)
       at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:543)
       at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:579)

Another:

2007-01-26 09:22:01,723 ERROR [IngestionNode-1-100] (IndexResourceManager.java:393) -
    Could not close Lucene index: Bad file descriptor
java.io.IOException: Bad file descriptor
       at java.io.RandomAccessFile.close0(Native Method)
       at java.io.RandomAccessFile.close(RandomAccessFile.java:532)
       at org.apache.lucene.store.FSIndexInput.close(FSDirectory.java:465)
       at org.apache.lucene.index.CompoundFileReader.close(CompoundFileReader.java:110)
       at org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:207)
       at org.apache.lucene.index.IndexReader.close(IndexReader.java:600)
       at org.apache.lucene.search.IndexSearcher.close(IndexSearcher.java:73)

Berkeley JE (3.2.17) calls:

2007-01-26 20:04:03,791 ERROR [AssetIngester-1-300] (IndexAdjuster.java:54) - IndexAdjuster caught throwable:
   (JE 3.2.17) Failed while attempting to commit transaction 45, aborted instead. Original exception = java.io.IOException:
   Bad file descriptor com.sleepycat.je.DatabaseException: (JE 3.2.17) Failed while attempting to commit transaction 45,
   aborted instead. Original exception = java.io.IOException: Bad file descriptor
       at com.sleepycat.je.txn.Txn.commit(Txn.java:544)
       at com.sleepycat.je.txn.Txn.commit(Txn.java:364)
       at com.sleepycat.je.Transaction.commit(Transaction.java:80)

Interesting Links

Free Spam Protection   Android ORM   Simple Java Zip   JMX using HTTP   Great Eggnog Recipe