ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Linux Compatibility on BSD for the PPC Platform: Part 4
Pages: 1, 2, 3

Running ktrace(1) against javac_g with full logging enabled, Hendricks was able to discover that the hang was caused by a spurious SIGIO. We then tried a few C programs that reproduced what the JDK was doing, and we ended with this test program:



/*
 * sigio2.c -- Test asynchronous I/O for pipes
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <fcntl.h>

void io_sighandler (int sig) {
  printf ("pid=%d got sigio\n", getpid ());
  printf ("I GOT SIGIO\n");
  exit (-1);
}

int main (int argc, char** argv) {
  struct sigaction aio;
  int fdsync[2];
  int err;
  char c;
  sigset_t set;

  sigemptyset(&set);
  sigaddset(&set,SIGIO);

  aio.sa_flags = SA_RESTART;
  aio.sa_handler = io_sighandler;
  sigemptyset(&aio.sa_mask);
  if (sigaction(SIGIO, &aio, 0) == -1) {
      printf("Error: Bad return value from sigaction call\n");
      exit(1);
  }

  if (pipe(fdsync) < 0) {
      printf("Error: bad pipe call\n");
      exit(1);
  }

  /* now set the pipe write end to be non-blocking async */
  fcntl(fdsync[1],F_SETFL, O_NONBLOCK | FASYNC);
  fcntl(fdsync[1],F_SETOWN, getpid());

  err = write(fdsync[1], "AAAA", 4);
  if (err < 0) {
      printf("write() got err=%d\n", err);
  }
  printf ("written %d bytes\n", err);

  sleep (1);

  do
      err = read(fdsync[0], &c, 1);
  while (!err);
  printf ("readen %d bytes\n", err);

  printf("NO SIGIO\n");
  exit(0);
}

This test program reproduces in one process what it takes the JVM two processes to synchronize: It makes use of asynchronous I/O through a pipe. Running this program natively on Linux and NetBSD gives different results. When we ran the program as a Linux binary on NetBSD, we got the NetBSD behavior and not the Linux behavior. This is where was the problem: NetBSD delivers a SIGIO on the read() call, whereas Linux does not. That is how the JVM got confused by the unexpected SIGIO.

At a glance, this may appear to be a bug in the way NetBSD handles asynchronous I/O. The writer has written 4 bytes to the pipe, and it is not blocked on the write operation. Thus, there is no reason why it needs to know that the reader has read one byte. Linux seems to implement a better behavior here.

In fact, this is not a real bug because it seems that there is no standard standard such as POSIX available that explains how asynchronous I/O is supposed to work. And with no such standard available, there can't be a "standard" behavior, and thus, this isn't considered a bug in the way NetBSD handles asynchronous I/O for native programs. Of course, there is still a bug in the way NetBSD was emulating asynchronous I/O for Linux binaries.

The lack of a standard is reflected by the diversity of the behaviors implemented by the different Unix systems. Running the test program on eight different operating systems gives the following results:

  • Do trigger SIGIO on read(): NetBSD, Digital Unix, MacOS X.
  • Do not trigger SIGIO on read(): FreeBSD, OpenBSD, Linux, Solaris, AIX.

The systems triggering SIGIO on read() for pipes are in fact the Unix systems that still use the original pipe implementation from Berkley's BSD Unix, which is based on a pair of Unix domain sockets. Solaris uses an AT&T's Unix System V implementation that does not implement asynchronous I/O on pipes, and Linux has an implementation written from scratch that also ignores asynchronous I/O requests on pipes.

Digital Unix and MacOS X both have strong BSD roots, and it is not surprising that they behave in the same way that NetBSD does. What is surprising is that the two operating systems that are the closest to NetBSD, that is, FreeBSD and OpenBSD, implement a different behavior. This is just because they both use a new optimized pipe implementation, written by John Dyson for the FreeBSD project. This new implementation does not implement asynchronous I/O on read() operations for pipes. FreeBSD pipes are currently being integrated in NetBSD, and using the new pipe implementation leads to the same behavior that Linux and other OSes have.

Once the problem has been identified, it is time to propose a fix. Let's have a look to the way the SIGIO is issued.

When a process makes a read() system call, the kernel runs the sys_read() function, which is located in sys/kern/sys_generic.c. sys_read() in turn calls dofileread() from the same file.

dofileread() invokes the function pointed at by the fo_read field of the file operation structure. This file operation structure, struct fileops, is defined in struct file, in sys/sys/file.h. For pipes, it is initialized when the pipe() system call is called. pipe() is implemented in the kernel as sys_pipe(), which is located in the sys/kern/uipc_syscalls.c file. sys_pipe() sets the file operation structure for pipes to a static fileops structure called "socketops."

The underlying idea of the fileops structure is to use an object-oriented scheme for file handling. All files are handled the same way by the kernel, through the standard methods available in the struct fileops: read, write, ioctl, etc. The struct fileops is initialized when the file object (that is, the struct file) is created, and the methods in it depend on the file type. A read operation on a pipe, a plain file, or a block device are hence requested the same way, but implemented by different functions.

Although this is not really related to the compatibility subsystem, it is probably worth mentioning that this scheme is widely used in the Unix kernel. The most popular application is probably the Virtual File System (VFS) interface. The VFS uses pointers to functions to provide the same programming interface to access regular files and directories stored on various filesystem types. The operations are pointed by the v_op field of the struct vnode (defined in sys/sys/vnode.h). Depending if the regular file is on an FFS, NFS, or NTFS filesystem, file operations are requested the same way through pointers to filesystem-specific methods, and the operations are implemented by different functions depending on the filesystem.

But let's come back to pipe file operations. socketops contains the file operations functions for all sockets. It is defined in sys/kern/sys_socket.c, and its fo_read field is a pointer to the soo_read() function, located in sys/kern/sys_socket.c. soo_read invokes a function pointed to by the so_receive field of the struct socket (defined in sys/sys/socket.h) defining the receive method for the socket on which we want to read (remember pipes are implemented as Unix domain sockets in NetBSD).

We need to make one more long journey into the kernel sources to find out where so_receive is pointing. In sys_pipe(), we can see that the kernel is creating two Unix sockets to build the pipe. It does this by invoking socreate(), which is located in sys/kern/uipc_socket.c. In this function, there is some black magic to set the so_receive field. Its value is copied from the pr_usrreq field of a struct protosw variable. struct protosw is defined in sys/sys/protosw.h. It defines per protocol properties for sockets. The struct protosw variable used by socreate() is obtained by a call to pffindproto(). pffindproto() can be found in sys/kern/uipc_domain.c and its job is to return the struct protosw for a given protocol.

The protosw structures are statically initialized in sys/kern/uipc.c. For a Unix socket, the pr_usrreq field is pointing to [XXX] sys/kern/uipc_usrreq.c:uipc_usrreq(). Now we finally know that so_receive is pointing to uipc_usrreq().

uipc_usrreq() is responsible for dispatching various sockets operations: receiving, sending, connecting, and so on. On receive operation (case PRU_RCVD in the function), it ends by calling sowwakeup(), which is a macro defined in sys/sys/socketvar.h, and which calls sowakeup() in sys/kern/uipc_socket2.c. sowakeup()'s job is to wake up the peer process, issue a SIGIO and make any appropriate upcall.

Modifying something in uipc_usrreq() is not a good idea, it is complex enough. Care should be taken to fold in our fix somewhere else. In fact, the easiest way of fixing the problem would just be to ignore asynchronous I/O requests for binaries of operating systems that do not implement it for pipes. This fix would take place in the fcntl() implementation. Let's have a look at the kernel sources.

sys_fcntl() is implemented in sys/kern/kern_descrip.c. It basically calls the function pointed by the fo_ioctl of the struct fileops of the underlying object. Here, it is a Unix socket, and we saw the struct fileops was implemented as the socketops static variable defined in sys/kern/sys_socket.c. Thus, fo_ioctl points to soo_ioctl(), which is also defined in sys/kern/sys_socket.c. To request asynchronous I/O, the calling process calls fnctl() with the FIOASYNC command. In soo_ioctl(), the FIOASYNC command was implemented like this:

case FIOASYNC:
    if (*(int *)data)) {
      so->so_state |= SS_ASYNC;
      so->so_rcv.sb_flags |= SB_ASYNC;
      so->so_snd.sb_flags |= SB_ASYNC;
   } else {
      so->so_state &= ~SS_ASYNC;
      so->so_rcv.sb_flags &= ~SB_ASYNC;
      so->so_snd.sb_flags &= ~SB_ASYNC;
   }
return (0);

We wanted to prevent soo_ioctl() from setting asynchronous flags when the socket was in fact a pipe and when the emulation was not NetBSD or Digital Unix. (There is no MacOS X emulation yet.) To achieve this, we needed to recognize sockets implementing a pipe. This was done by adding a SS_ISAPIPE flag to the so_flags field of struct socket. SS_ISAPIPE is defined in sys/sys/socketvar.h:

#define  SS_ISAPIPE     0x800 /* socket is implementing a pipe */

This flag is set in sys_pipe(), in sys/kern/uipc_syscalls.c so that we will be able to tell that this socket is a pipe:

if ((error = socreate(AF_LOCAL, &rso, SOCK_STREAM, 0)) != 0)
   return (error);
if ((error = socreate(AF_LOCAL, &wso, SOCK_STREAM, 0)) != 0)
   goto free1;
/* remember this socket pair implements a pipe */
wso->so_state |= SS_ISAPIPE;
rso->so_state |= SS_ISAPIPE;

Then we needed to know if a given emulation required the original BSD behavior for pipes or not. This was done by introducing another new flag, this time in the e_flags field of struct emul, which is defined in sys/sys/proc.h:

/*
 * No BSD style async I/O pipes. Aync I/O request through
 * fcntl() for pipes will be ignored.
 */
#define  EMUL_NO_BSD_ASYNCIO_PIPE   0x002

This flag is enabled or not in the struct emul definition for each OS. For NetBSD native, the struct emulsw is called emul_netbsd, and it is initialized in in sys/kern/kern_exec.c. For Linux, it is emul_linux, initialized in sys/compat/linux/common/linux_exec.c, and so on, the scheme is similar for other emulations.

With theses two additional flags, we can now do the job, and we end up with this implementation of FIOASYNC in soo_ioctl():

case FIOASYNC:
   if (
#ifndef __HAVE_MINIMAL_EMUL
     (!(so->so_state & SS_ISAPIPE) ||
     (!(p->p_emul->e_flags & EMUL_NO_BSD_ASYNCIO_PIPE))) &&
#endif
     *(int *)data) {
       so->so_state |= SS_ASYNC;
       so->so_rcv.sb_flags |= SB_ASYNC;
       so->so_snd.sb_flags |= SB_ASYNC;
    } else {
       so->so_state &= ~SS_ASYNC;
       so->so_rcv.sb_flags &= ~SB_ASYNC;
       so->so_snd.sb_flags &= ~SB_ASYNC;
    }
 return (0);

The __HAVE_MINIMAL_EMUL ifdef is here because the e_flags field in struct emul is also in a __HAVE_MINIMAL_EMUL ifdef.

With this implementation, the pipe behavior was fixed for Linux and Solaris binaries, and probably other emulations as well. This fixed our problem with Jakarta-Ant build, and it greatly improved the usability of Jakarta-Tomcat, because it was then able to work with JDK-1.3.0 and Green Threads. Thanks to Linux emulation, it is now possible to play with servlets and JSP on NetBSD/PowerPC.

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.

Previously in this series

Linux Compatibility on BSD for the PPC Platform -- The Linux compatibility layer allows BSD to run Linux binary applications. Emmanuel Dreyfus explains how he implemented this on NetBSD for the PowerPC.

Linux Compatibility on BSD for the PPC Platform: Part 2 -- Emmanuel Dreyfus takes a look at how to prevent dynamic Linux binary compatibility problems on the NetBSD/PowerPC platform.

Linux Compatibility on BSD for the PPC Platform: Part 3 -- Signals are the interactions between the kernel and the user program -- a program can't run without them. Emmanuel Dreyfus explains how to make your signals Linux-compatible.


Return to ONLamp.com.





Sponsored by: