ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Linux Compatibility on BSD for the PPC Platform: Part 5

by Emmanuel Dreyfus
08/09/2001

Debugging the debugger

At various stages of Linux emulation debugging, the lack of a strong debugging tool such as gdb is a big issue. Getting native threads working with the Java Virtual Machine (JVM) is so tricky that it really requires a working gdb to understand what is going on. NetBSD's gdb is able to work on Linux processes, but it is not able to work with dynamic Linux programs, because it knows nothing about Linux's ld.so. Thus having Linux's gdb working is highly desirable. In this article, we will take a look at Linux emulation fixes needed to have a fully functional Linux gdb.

Spurious terminal hangup

The first issue we had with gdb was rather rude: gdb loaded successfully, displayed the credit lines, the prompt, and then it exited, taking the whole session down at the same time. When running Linux gdb in a telnet session, I was simply logged off.

Using ktrace(1) on gdb, we were able to discover that the reason was a hangup signal (SIGHUP) issued to gdb, and probably to all the processes in the process group operating on the terminal, because all were killed.

The question was: Where was this spurious signal coming from? The kernel trace showed no kill() system call from gdb, therefore gdb was not requesting the whole session to die. The decision to send the signal therefore had to be made in the kernel.

Getting a spurious SIGHUP is quite unusual. Most of the time, runaway processes get unexpected SIGSEGV or SIGBUS signals because they attempted to access invalid memory locations in their address spaces.

SIGHUP is rare enough, so we were able to locate where it was coming from in the kernel by using grep for SIGHUP in the kernel sources. There are basically four places in the NetBSD kernel where a hangup signal is sent to a process. These are

By adding a few printf commands before each of these locations in the kernel, it was possible to discover that the problem was coming from sys/kern/tty.c:ttioctl().

As its name suggests, ttioctl() is an ioctl method. By having a look to the kernel trace just before the SIGHUP is caught by gdb, we have a better idea of where the problem was coming from:

  1594 gdb      CALL  ioctl(0,TIOCGWINSZ,0x7fffe138)
  1594 gdb      RET   ioctl 0
  1594 gdb      CALL  ioctl(0,TIOCSWINSZ,0x7fffe138)
  1594 gdb      RET   ioctl 0
  1594 gdb      CALL  ioctl(0,TIOCGETA,0x7fffe028)
  1594 gdb      RET   ioctl 0
  1594 gdb      CALL  ioctl(0,TIOCSETAW,0x7fffe0e8)
  1594 gdb      RET   ioctl 0
  1594 gdb      CALL  rt_sigprocmask(0x2,0x101c29bc,0,0x8)
  1594 gdb      RET   rt_sigprocmask0
  1594 gdb      PSIG  SIGHUP caught handler=0x1003c9c4 mask=(20)
code=0x0SIGHUP SIG_DFL

Knowing that the problem is related to ioctl() calls, we wanted to have a deeper look to the four ioctl() calls that occur before the hangup signal. We started with the last one, the ioctl() TIOCSETAW command, which happened to be the ioctl() call leading to the spurious hangup signal generation. But first, let us introduce this ioctl() command.

Previously in this series

Linux Compatibility on BSD for the PPC Platform: Part 4 -- Emmanuel Dreyfus explains difficulties discovered in porting the Linux compatibility layer to run the Java Virtual Machine.

Linux Compatibility on BSD for the PPC Platform: Part 3 -- Signals are the interactions between the kernel and the user program -- a program can't run without them. Emmanuel Dreyfus explains how to make your signals Linux-compatible.

Linux Compatibility on BSD for the PPC Platform: Part 2 -- Emmanuel Dreyfus takes a look at how to prevent dynamic Linux binary compatibility problems on the NetBSD/PowerPC platform.

Linux Compatibility on BSD for the PPC Platform -- The Linux compatibility layer allows BSD to run Linux binary applications. Emmanuel Dreyfus explains how he implemented this on NetBSD for the PowerPC.

As we explained in part three of this series, ioctl() is used to perform various non standard operations on files -- this is different from read, write, etc. The Linux TIOCSETAW ioctl() command is used to set terminal properties, but after current I/O operation has finished. Using this system call, gdb just tries to adjust a terminal setting.

Let us now see how this ioctl() call happens to invoke ttioctl(). The ioctl() system call is implemented as sys/compat/linux/common/linux_termios.c:linux_sys_ioctl() for Linux processes. Like most Linux wrapper functions, the job of linux_sys_ioctl() is to make appropriate translations and then call the native ioctl implementation. The native ioctl() implementation depends on the file on which the ioctl() system call was made. linux_sys_ioctl() loads the appropriate function address in the *bsdioctl function pointer, like this:

bsdioctl = fp->f_ops->fo_ioctl;

Then linux_sys_ioctl() tests the command argument (com) of the ioctl() system call, doing different Linux to BSD translations depending on the command. The Linux ioctl() command we are looking after, TIOCSETAW is implemented as two NetBSD ioctl() commands: TIOCGETA and TIOCSETAW. Both of theses commands are executed using *bsdioctl.

The ioctl() operation is done on file descriptor zero (first argument of the ioctl() system call), which is the standard input. If the standard input is a terminal (as opposed to a regular file or a pipe), its ioctl() method is the ioctl() method for terminals, which happens to be ttioctl(). In the ttioctl() implementation, we can see that SIGHUP is issued when executing the TIOCSETAW command, and if the terminal output speed is null.

Our problem here is that Linux sometimes has a null output speed for a terminal because it does not need to have a value for a virtual terminal, whereas NetBSD uses this value to detect a terminal hangup.

The fix was to fool the NetBSD kernel into thinking that the terminal output speed was not null, whereas it was apparently set to zero for the Linux process. This was achieved by modifying the linux_termio_to_bsd_termios() and bsd_termios_to_linux_termios() functions from sys/compat/linux/common/linux_termios.c, whose job is to translate between Linux and NetBSD termios structures. The fix is simple: When a Linux process stores a null value in the output speed field c_ospeed, we set the field to -1 so that the NetBSD kernel will not hangup the terminal:

 /*
  * A null c_ospeed causes NetBSD to hangup the terminal. 
  * Linux does not do this, and it sets c_ospeed to zero
  * sometimes. If it is null, we store -1 in the kernel
  */ 
 if (bts->c_ospeed == 0)
         bts->c_ospeed = -1;

And when the Linux process reads a struct termios from the kernel, if c_ospeed is -1 then we translate it back to 0. The Linux process thus has a consistent value for c_ospeed:

/*
 * A null c_ospeed causes NetBSD to hangup the terminal. 
 * Linux does not do this, and it sets c_ospeed to zero
 * sometimes. If it is null, we store -1 in the kernel
 */ 
if (bts->c_ospeed == -1)
    bts->c_ospeed = 0;

The value -1 is arbitrary, it was chosen negative so that it cannot interfere with any valid value for c_ospeed. With this fix, gdb was able to startup without immediately hanging up the whole session. Next step was to actually use it.

Emulating the U-dot zone

Linux's gdb was able to load, present its prompt to the user, and it was able to display the online help, but it was not possible to do anything actually useful with it. It was even impossible to launch a Linux program. This was not suprising because the Linux ptrace() system call emulation was not yet implemented for the PowerPC.

In Unix systems, the ptrace() system call is used almost exclusively by debuggers such as gdb. It provides facilities for reading or writing the CPU register values during the program execution, stepping the program, and reading or writing the memory allocated to the traced program. All these operations are requested through ptrace() commands such as PEEKTEXT, POKETEXT, GETRETGS, SETREGS, and so on. You can have a look to the ptrace (2) man page if you want more information about ptrace() commands.

Linux ptrace() emulation is split into two parts. On one hand, a machine-independent part, located in sys/compat/linux/common/linux_misc.c, and on the other hand, a machine-dependent part linux_sys_ptrace_arch(), activated through the LINUX_SYS_PTRACE_ARCH macro in sys/compat/linux/common/linux_trace.h.

The linux_sys_ptrace_arch() function is located in sys/compat/linux/arch/powerpc/linux_ptrace.c for the PowerPC. The machine-independent part can handle some commands, such as reading or writing to the traced process memory, by calling the NetBSD native ptrace implementation: sys_ptrace().

The machine-independent part of ptrace() emulation, linux_sys_ptrace_arch() should ideally implement the PEEKUSER, POKEUSER, GETREGS, SETGREGS, GETFPREGS, and SETFPREGS. The easier way to write the linux_sys_ptrace_arch() function for the PowerPC was obviously to pick up the i386 version, and change what was really machine dependent. This includes all reference to CPU registers, and all references to data structures that do not exist on the PowerPC, for instance, the u_debugreg field of struct linux_user.

Operations on registers are quite straightforward to implement. Linux binaries expect reading and writing through a pt_regs structure, defined in Linux's header linux/include/asm-ppc/ptrace.h. The job is to get the registers and rearrange them appropriately.

The two tricky operations that help reading and writing the user structure are PEEKUSER and POKEUSER. Before explaining how we emulate these two commands, let us first introduce the user structure, also known as the U-dot zone.

When running several processes at once, the Unix kernel needs to maintain some information for each process. This process information is split into kernel-memory-based and user-process-memory-based parts. The kernel part of the information is stored in the struct proc, which is defined in sys/sys/proc.h on NetBSD. This structure contains data that must remain in main memory at all times (kernel memory is never swapped out). Kernel-based process information includes, for instance, the user owning the process. That information must always remain resident in main memory because we do not want a ps -aux to cause some pages of each swapped out process to be reloaded into main memory.

The user-based, or "userland", process information is called the user structure. The information contained in the user structure is only needed when the process is running. On NetBSD, the user structure is defined as struct user, in sys/sys/user.h. On Linux, this is the struct user, defined in linux/include/asm-ppc/user.h. In kernel code, user structures used to be named "u", and therefore accessed in C through the u.<field> syntax. Hence the "U-dot" name.

The NetBSD U-dot zone is rather small, because most of the fields in this structure were moved to other locations, including the kernel stack or struct proc. On the other hand, Linux stores lots of information in the U-dot zone, such as text, data, and stack location and sizes. It also uses the U-dot zone to save user values of CPU registers of the traced process when entering kernel space. Linux's gdb reads the U-dot zone to get and set the register values of the traced processes. For reading, this works because the traced process is stopped when gdb does the operation. gdb reads the latest values of the traced process registers before it was stopped and the CPU entered kernel space, saving the registers in the U-dot zone. For writing, it works because when the kernel runs the traced process again, it will restore the modified registers from the U-dot zone.

Now, let us examine how PEEKUSER and POKEUSER are emulated in NetBSD.

These two ptrace() commands are used with three other arguments: the PID of the traced process, the address of the target field in the U-dot zone relative to the beginning of the U-dot zone, and a data field, used for write operations. As you can imagine, it is not trivial to emulate operations on the U-dot zone, because they involve manipulating fields of the U-dot zone that do not exist in NetBSD's U-dot zone: registers, stack location and size, and so on. We therefore have to check the target address, and return a value from another place in the kernel depending on the target address.

The LUSR_OFF macro helps. It returns the address of a given field in the U-dot zone. Here is the definition of LUSR_OFF, from

sys/compat/linux/arch/powerpc/linux_ptrace.c 
#define LUSR_OFF(member) offsetof(struct linux_user, member)

And here is some code that emulates reading the stack size, code location, and stack location from Linux's U-dot zone. As you can see, we grab the revelant information from locations in the struct proc (p is a pointer to the struct proc of the current process):

if (addr == LUSR_OFF(u_ssize))
    *retval = p->p_vmspace->vm_ssize;
else if (addr == LUSR_OFF(start_code))
    *retval = (register_t) p->p_vmspace->vm_taddr;
else if (addr == LUSR_OFF(start_stack))
    *retval = (register_t) p->p_vmspace->vm_minsaddr;

And here is a code snippet that emulates reading traced process registers from the U-dot zone:

    error = process_read_regs(t, regs);
/* (snip) */
    if (addr == LUSR_REG_OFF(lnip))
        *retval = regs->pc;
    else if (addr == LUSR_REG_OFF(lctr))
        *retval = regs->ctr;
    else if (addr == LUSR_REG_OFF(llink))
        *retval = regs->lr;

With ptrace() implemented, gdb was able to start the traced program, but there was a remaining bug that made it unable to get a backtrace or to trace the program. We will examine the problem in the next section

Peek into the traced program

In this section we will explain the bug that prevented Linux's gdb from getting a backtrace on a traced program. In fact, ptrace() emulation was even more broken than this, because gdb was not even able to tell where the program stopped when it received a signal. The ouptut was:

Program received signal SIGIO, I/O possible.
0x0 in ?? ()
gdb>

Here is a kernel trace of what gdb attempted to do in order to display the address and the name of the function where the traced process was stopped.

161 gdb   RET write 1
161 gdb   CALL  write(0x1,0x50374000,0x2d)
161 gdb   GIO   fd 1 wrote 45 bytes
   "Program received signal SIGIO, I/O possible.
   "
161 gdb   RET   write 45/0x2d
161 gdb   CALL  ptrace(PTRACE_PEEKUSER,0xa2,0x4,0x7fffdc3c)
161 gdb   RET   ptrace 2147477168/0x7fffe6b0
161 gdb   CALL  ptrace(PTRACE_PEEKUSER,0xa2,0x90,0x7fffdc0c)
161 gdb   RET   ptrace 268437452/0x100007cc
161 gdb   CALL  ptrace(PTRACE_PEEKTEXT,0xa2,0xfffffffc,0x7fffdc3c)
161 gdb   RET   ptrace 0
161 gdb   CALL  ptrace(PTRACE_PEEKTEXT,0xa2,0,0x7fffdc3c)
161 gdb   RET   ptrace -1 errno 22 Invalid argument
161 gdb   CALL  ptrace(PTRACE_PEEKTEXT,0xa2,0xfffffffc,0x7fffdc3c)
161 gdb   RET   ptrace 0
161 gdb   CALL  ptrace(PTRACE_PEEKTEXT,0xa2,0,0x7fffdc3c)
161 gdb   RET   ptrace -1 errno 22 Invalid argument
161 gdb   CALL  ptrace(PTRACE_PEEKTEXT,0xa2,0xffffbca0,0x7fffdc34)
161 gdb   RET   ptrace 0

The first PEEKUSER operation reads the register GPR4 (address 0x4 in Linux's U-dot zone). The returned value (0x7fffe6b0) is an address in the user stack -- it seems valid. The second PEEKUSER call reads the Link register (address 0x90 in Linux's U-dot zone). Here we get an address (0x100007cc) which is obviously located in the process text segment -- it also seems valid. Then gdb attempts to read the function names from the program text with PEEKTEXT commands, but there is obvously something wrong because the requested address (0xfffffffc) is not located in the user address space (user addresses range from 0x00000000 to 0x7fffffff on NetBSD/PowerPC). The next PEEKTEXT attempts are even more malformed, and they fail with an invalid argument error.

The surprising thing was that the first two PEEKUSER calls seemed correct, and the next PEEKTEXT call using the PEEKUSER results was obvously wrong. Using printf() in the kernel to display the correct values confirmed that the PEEKUSER results were right.

It seems that something is wrong in PEEKTEXT or PEEKUSER requests. Here we try the following sample program to check the PEEKTEXT operation.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/ptrace.h>
#include <sys/errno.h>
#include <sys/wait.h>

void handler (void) {
    printf ("in handler\n");
    return;
}
int main (int argc, char** argv) {
    int spot = 0x88888888;
    int err;
    int pid;
    int status;
    long data;

    pid = fork();
    switch (pid) {
        case -1:
            perror ("fork failed");
            exit (-1);
            break;
        case 0:
            spot = 0x77777777;
            signal (SIGUSR1, (void*)(*handler));
            err = ptrace (PTRACE_TRACEME, 0, 0, 0);
            printf ("ptrace PTRACE_TRACEME returned %d, errno=%d\n",
                err, errno);
            kill (getpid(), SIGUSR1);
            sleep (1);
            printf ("child quitting\n");
            break;
        default:
            spot = 0x99999999;
            wait (&status);
            printf ("parent: PTRACE_PEEK on %d\n", pid);
            errno = 0;
            data = ptrace (PTRACE_PEEKTEXT, pid, &spot, 0);
            if (errno != 0)
                printf ("ptrace returned %d, errno=%d\n", data, errno);
            printf ("readen 0x%lx\n", data);
            printf ("data=0x%lx\n&data=0x%lx\n", data, &data);
            break;
    }
    return 0;
}

This sample program was interesting because it exhibited result values for PEEKTEXT operations that were different in the program output and in the kernel trace. On the kernel trace, we had the correct value, and in the program output, the wrong value. The explanation of this kind of phenomenon is that the system-call wrapper in glibc altered the return value.

Looking at glibc sources, the answer was obvious. The system call wrapper for ptrace() is defined in glibc/sysdeps/unix/sysv/linux/ptrace.c. Here are the sources of this wrapper

  if (request > 0 && request < 4)
  data = &ret;

res = INLINE_SYSCALL (ptrace, 4, request, pid, addr, data);
if (res >= 0 && request > 0 && request < 4)
  {
    __set_errno (0);
    return ret;
  }

return res;

The test on the request (between 0 and 4) selects the PEEKUSER, PEEKTEXT, and PEEKDATA operations. For these three operations, glibc replaces the return value by the value of the data argument. For other operations, the result is just the return value of the ptrace() system call. It is also interesting to look at the ptrace() implementation in Linux kernel sources, in linux/arch/ppc/kernel/ptrace.c:sys_ptrace(), where we discover the same trick:

  /* when I and D space are separate, these will need to be fixed. */
case PTRACE_PEEKTEXT: /* read word at location addr. */
case PTRACE_PEEKDATA: {
  unsigned long tmp;
  int copied;
  copied = access_process_vm(child, addr, &tmp, sizeof(tmp), 0);
  ret = -EIO;
  if (copied != sizeof(tmp))
    break;
  ret = put_user(tmp,(unsigned long *) data);
  break;
}

Here, for PEEKTEXT and PEEKDATA, the value that will be returned to the calling program is copied at the location of the data argument, and the address of this data argument is returned to userland. As we saw, glibc will bring back the expected return value in the value returned to the calling program.

The reason why Linux does this is probably that on most platforms, the Linux kernel uses negative return values when there is an error. We already had a look to this problem in part three of this series. Hence, on the i386, if ptrace() was returning a value such as 0xfffffffe, glibc would see a negative value and would assume it is the opposite of an error code. It would therefore set errno to the opposite of 0xfffffffe, which is 2, and we would see an ENOENT error (ENOENT is errno 2). To avoid the problem, Linux must use this kludge with the data argument.

The bug here was that Linux emulation of ptrace() operations PEEKTEXT, PEEKDATA, and PEEKUSER, were not emulating this Linux-specific behavior correctly. It was just returning the requested value to userland instead of copying it at the location of the data argument and returning the address of the data argument. This problem needed two fixes. One in machine-independent code, for PEEKTEXT and PEEKDATA operations, and one in machine-dependent code for PEEKUSER. Here is the fix for PEEKTEXT/PEEKDATA, in sys/compat/linux/common/ptrace.c:linux_sys_ptrace()

  error = sys_ptrace(p, &pta, retval);
if (!error) 
  switch (request) {
    case LINUX_PTRACE_PEEKTEXT:
    case LINUX_PTRACE_PEEKDATA:
      error = copyout (retval,
        (caddr_t)SCARG(&pta, data),
        sizeof retval);
        *retval = SCARG(&pta, data);
      break;
    default:    
      break;
  }
return error;

The fix to the PEEKUSER operation stands in sys/compat/linux/arch/powerpc/linux_ptrace.c, in linux_sys_ptrace_arch(), and it is similar.

With this fix done, Linux's gdb was fully functional. It was able to trace Linux programs, and get a backtrace when a signal was caught. This functionality is especially useful because it helps us understand how Opera, or the JVM with native threads, failed getting bus errors or segmentation fault signals.

Conclusion

In this series, we examined all the different problems involved in porting Linux compatibility to NetBSD/PowerPC. Most of the problems described here were completely unexpected when I started to work on this project. My understanding of the problem was quite basic: It was just about remapping system calls. The conclusion may be that it is not mandatory to fully understand a kernel subsystem prior starting work on it, you just need an idea of how it works so that you know where you are heading. There are a lot of things that can be learned. Actually, there are number of things about kernels I learned while working on Linux compatibility.

Acknowledgements

I would like to thank Manuel Bouyer for giving me the first clue on Linux compatibility ("it works by remapping system calls"); the NetBSD tech-kern and port-powerpc mailing lists contributors for supporting me when I was integrating the Linux compatibility code for NetBSD/PowerPC; Carl Alexander, for providing me an account to a LinuxPPC machine; Kevin B. Hendricks, for his valuable help on tracking bugs that broke the JVM; Hubert Feyrer; Vincent Guillard; and Thomas Klausner for reviewing this paper; and of course, the NetBSD community, without whom this paper would not even exist.

References

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.

Previously in this series

Linux Compatibility on BSD for the PPC Platform: Part 4 -- Emmanuel Dreyfus explains difficulties discovered in porting the Linux compatibility layer to run the Java Virtual Machine.

Linux Compatibility on BSD for the PPC Platform: Part 3 -- Signals are the interactions between the kernel and the user program -- a program can't run without them. Emmanuel Dreyfus explains how to make your signals Linux-compatible.

Linux Compatibility on BSD for the PPC Platform: Part 2 -- Emmanuel Dreyfus takes a look at how to prevent dynamic Linux binary compatibility problems on the NetBSD/PowerPC platform.

Linux Compatibility on BSD for the PPC Platform -- The Linux compatibility layer allows BSD to run Linux binary applications. Emmanuel Dreyfus explains how he implemented this on NetBSD for the PowerPC.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.