BSD DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Diving into Gcc: OpenBSD and m88k
Pages: 1, 2

Two Wrongs Don't Make a Right

Even with the function calls fixed, gcc 2.95 was unable to recompile itself: the build would die when running the helper program genrecog, which would segfault and dump core.



The genrecog tool is used to generate insn-recog.c from the machine-dependent backend script, in our case config/m88k/m88k.md. It shares a lot of code with other genfoo tools extracting the various information from the m88k.md machine description file. In fact, I very quickly tracked the breakage to genrecog.c. Compiling every other part of genrecog except genrecog.c with gcc 2.95 and compiling genrecog.c with gcc 2.8 produced a working binary.

While I could continue the build (and did several times!) either by cross-generating insn-recog.c on a different machine or simply by reverting to gcc 2.8 for genrecog.c build only, I had to dive in this problem.

genrecog provides a lot of information (in fact, C source code) on standard output. When compiled with gcc 2.95, it would die very quickly with this output:

$ cd obj
$ ./genrecog /usr/src/gnu/egcs/gcc/config/m88k/m88k.md
/* Generated automatically by the program `genrecog'
from the machine description file `md'.  */

#include "config.h"
#include "system.h"
#include "rtl.h"
#include "insn-config.h"
#include "recog.h"
#include "real.h"
#include "output.h"
#include "flags.h"

extern rtx gen_split_1 ();
extern rtx gen_split_2 ();
Memory fault (core dumped)
$

By searching for this kind of output (such as extern rtx gen_split) in the source, I could quickly trace the segfault to this snippet from main():

while (1)  
{
      c = read_skip_spaces (infile);
      if (c == EOF)
        break;
      ungetc (c, infile);

      desc = read_rtx (infile); 
      if (GET_CODE (desc) == DEFINE_INSN)
        recog_tree = merge_trees (recog_tree,
                                  make_insn_sequence (desc, RECOG));
      else if (GET_CODE (desc) == DEFINE_SPLIT)
        split_tree = merge_trees (split_tree,
                                  make_insn_sequence (desc, SPLIT));
      if (GET_CODE (desc) == DEFINE_PEEPHOLE   
          || GET_CODE (desc) == DEFINE_EXPAND)
        next_insn_code++;
      next_index++;
}

The function responsible for the extern rtx gen_split lines is make_insn_sequence. The program would then die between the second and the third merge_trees invocation.

make_insn_sequence might produce an insn sequence or whatever. To continue diving into genrecog, all we need to know is what kind of return value it provides. A quick glance at the source shows:

static struct decision_head make_insn_sequence PROTO((rtx, enum routine_type));
...
static struct decision_head merge_trees PROTO((struct decision_head,
                                               struct decision_head));

These functions work on decision_head structures, which are defined as:

/* Data structure for a listhead of decision trees.  The alternatives   
   to a node are kept in a doubly-linked list so we can easily add nodes
   to the proper place when merging.  */
   
struct decision_head { struct decision *first, *last; };

The best way to know at which point invalid data would be generated was to add a trace of the decision_head structures.

Adding traces to genrecog is very easy: since it outputs C code, you just have to output your traces as C comments. Adding simple traces to merge_trees proved very quickly that it was given an invalid argument. But then traces in make_insn_sequence would prove that it would produce valid data. The output would end like this:

...
extern rtx gen_split_1 ();
/* make_insn_sequence -> 2e000.2e000 */
/* merge_trees <- 0.0 8.4008 */
/* merge_trees -> 8.4008 */
extern rtx gen_split_2 ();
/* make_insn_sequence -> 2e380.2e380 */
/* merge_trees <- 8.4008 18.4018 */
Memory fault (core dumped)
$

This is proof that the problem lies in the way short structures are passed to and returned from functions. Consider the following example:

$ cat maze.c
#include <stdio.h>

struct maze { const char *item1, *item2; };

struct maze
builder(void)
{
    struct maze m;

    m.item1 = "you are lost";
    m.item2 = "in the maze.";

    printf("builder: m.item1 = %p, item2 = %p\n", m.item1, m.item2);

    return m;
}

void
checker(struct maze m)
{
    printf("checker: m.item1 = %p, item2 = %p\n", m.item1, m.item2);
}

main()
{
    checker(builder());
}
$

When compiled by gcc 2.8, it would output, for example:

$ gcc28 -O0 -o maze maze.c
$ ./maze
builder: m.item1 = 0x10f8, item2 = 0x1108
checker: m.item1 = 0x10f8, item2 = 0x1108
$

However, gcc 2.95 produces:

$ gcc295 -O0 -o maze maze.c
$ ./maze
builder: m.item1 = 0x10f8, item2 = 0x1108
checker: m.item1 = 0x0, item2 = 0x0
$

Trial and error showed that this problem only affected structures from 8 to 32 bytes, inclusive.

If you look at the generated code for main(), aside from the prologue and epilogue code, you'll see that gcc 2.8 generates the following correct code:

    or      r12,r0,r31
    bsr     _builder
    bsr     _checker

gcc 2.95 adds two extra instructions:

    ld.d    r24,r0,r31
    or      r12,r0,r31
    bsr     _builder
    st.d    r24,r0,r31
    bsr     _checker

Before we go further, we need to know more about the m88k calling convention. The standard is never to pass the structures in registers, always on the stack. If the function returns a struct itself, the address of the returned struct should be set in r12 by the caller which will have allocated the appropriate space.

In the gcc 2.8 code, the compiler has already optimized the call flow so that the structure is immediately on the stack frame and is thus immediately usable in checker(). After r12 is set to point to the temporary location, builder() and checker() are invoked.

gcc 2.95, on the other hand, adds two extra statements. The first one saves the stack area which is about to be used by builder() in registers r24 and r25. The second statement restores this area immediately before invoking checker(), effectively making it check uninitialized memory.

If we use a temporary variable to store the builder() result, such as:

#if 0
        checker(builder());
#else
        struct maze m = builder();
        checker(m);
#endif

then gcc 2.8 and gcc 2.95 will produce the same, correct, code:

    addu    r12,r30,8
    bsr     _builder
    or      r13,r0,r31
    addu    r11,r30,8
    ld      r12,r11,0
    ld      r11,r11,4
    st      r12,r13,0
    st      r11,r13,4
    bsr     _checker

In this case, the variable m is at r30(8). After builder() returns, the structure is copied to r31(0), which will be the frame for checker(). (Note that the structure could be copied with one ld.d and one st.d instruction, rather than two ld and two st...)

So we have a problem which affects structures only when they are not forced to be in memory. It makes then sense to hunt for a bug in the RETURN_IN_MEMORY macro, which will decide whether a particular object can be returned in a register:

$ cd config/m88k
$ head -1001 m88k.h | tail -8
/* Disable the promotion of some structures and unions to registers. */
#define RETURN_IN_MEMORY(TYPE) \
  (TYPE_MODE (TYPE) == BLKmode \
   || ((TREE_CODE (TYPE) == RECORD_TYPE || TREE_CODE(TYPE) == UNION_TYPE) \
       && !(TYPE_MODE (TYPE) == SImode \
            || (TYPE_MODE (TYPE) == BLKmode \
                && TYPE_ALIGN (TYPE) == BITS_PER_WORD \
                && int_size_in_bytes (TYPE) == UNITS_PER_WORD))))
$

But if we look back at the code for FUNCTION_ARG, the comments say:

   Structures and unions which are not 4 byte, and word
   aligned are passed in memory rather than registers, even if they
   would fit completely in the registers under OCS rules. 

and the code indeed does:

    if (type != 0                 /* undo putting struct in register */
      && (TREE_CODE (type) == RECORD_TYPE || TREE_CODE (type) == UNION_TYPE))
    mode = BLKmode;
...
  if (mode == BLKmode
           && (TYPE_ALIGN (type) != BITS_PER_WORD
               || bytes != UNITS_PER_WORD))
    return (rtx) 0;

which really prevents any structure larger than 4 bytes from being passed in a register.

A more realistic version of RETURN_IN_MEMORY becomes then:

/* Disable the promotion of some structures and unions to registers.
   Note that this matches FUNCTION_ARG behavior. */
#define RETURN_IN_MEMORY(TYPE) \
  (TYPE_MODE (TYPE) == BLKmode \
   || ((TREE_CODE (TYPE) == RECORD_TYPE || TREE_CODE(TYPE) == UNION_TYPE) \
       && (TYPE_ALIGN (TYPE) != BITS_PER_WORD || \
           GET_MODE_SIZE (TYPE_MODE (TYPE)) != UNITS_PER_WORD)))

Unfortunately, this fix, albeit an improvement, did not solve the issue. However, this problem was specific to the m88k compiler and could not be triggered on various other architectures, from Alpha to Vax. Investigating the differences between m88k and the other CPUs was the next logical step.

Depending on the various way function calls work across the different cpus, with various calling conventions and stack layouts, not even counting register windows, the machine-independent part of gcc tries to offer as much flexibility as possible, letting architecture-dependent configuration files define the different behaviors they provide or expect, using a bunch of macros.

Some of these macros must be defined for every architecture, such as LIBCALL_VALUE which defines where a function returns its result (on m88k, in register r2), while other macros are only defined if the architecture requires it. The m88k is unique in defining REG_PARM_STACK_SPACE and OUTGOING_REG_PARM_STACK_SPACE (actually, a few other architectures do this as well), as well as STACK_PARMS_IN_REG_PARM_AREA.

What do these macros tell the compiler? Let's quote from the manuals:

REG_PARM_STACK_SPACE (FNDECL)

Define this macro if functions should assume that stack space has been allocated for arguments even when their values are passed in registers.

The value of this macro is the size, in bytes, of the area reserved for arguments passed in registers for the function represented by FNDECL, which can be zero if GNU CC is calling a library function.

This space can be allocated by the caller, or be a part of the machine-dependent stack frame:

OUTGOING_REG_PARM_STACK_SPACE says which.

OUTGOING_REG_PARM_STACK_SPACE
Define this if it is the responsibility of the caller to allocate the area reserved for arguments passed in registers.

Nothing fancy here. The m88k-specific backend indeed will automagically allocate 32 bytes on the stack during the function prologue. Hey, wait, 32 bytes, exactly like the structure size threshold, where bigger structures don't get clobbered.

A simple experiment is to comment out OUTGOING_REG_PARM_STACK_SPACE. Compiling gcc with these settings will produce a stack-wasting compiler, because every function prologue will now automagically allocate an extra 64 bytes on the stack: 32 from the m88k-specific prologue and 32 from the architecture-independent prologue code, since it has been now instructed to do so. However, when using this compiler, the problem disappears completely, whatever the size of the structure.

Another path worth trying would be to remove this automatic stack allocation, and undefine REG_PARM_STACK_SPACE. However, I am afraid this could break some 88Open rule, or, even worse, some implicit assumption hidden in the m88k-specific backend code.

Now it's time to look at STACK_PARMS_IN_REG_PARM_AREA. Quoting the manuals again:

STACK_PARMS_IN_REG_PARM_AREA

Define this macro if REG_PARM_STACK_SPACE is defined, but the stack parameters don't skip the area specified by it.

Normally, when a parameter is not passed in registers, it is placed on the stack beyond the REG_PARM_STACK_SPACE area.

Defining this macro suppresses this behavior and causes the parameter to be passed on the stack in its natural location.

This is very interesting, and also very obscure (I'll know what to blame for my next headache). No other architecture defines this. If I understand this correctly, it means that the REG_PARM_STACK_SPACE area can be shared by both function call parameters, and local variables.

While gcc 2.8 would not care about this area being shared, and assumes we know what we are doing, gcc 2.95 is more strict, and will explicitly save any variable of this shared area, when it might be clobbered. This is exactly what we witnessed. The area on the stack which has been saved before invoking builder() and restored after it returned is the implicit location for a temporary struct maze.

A real fix would be to teach gcc to handle the shared area as gcc 2.8 did, but then this behaviour is objectionable, and probably more a bug than a feature. My workaround was to stop defining STACK_PARMS_IN_REG_PARM_AREA. The generated code becomes:

    addu    r12,r30,8
    bsr     _builder
    addu    r13,r31,32
    addu    r11,r30,8
    ld      r12,r11,0
    ld      r11,r11,4
    st      r12,r13,0
    st      r11,r13,4
    bsr     _checker

which is the same code as when the result of builder() was stored in an explicit temporary variable, except for the stack location: instead of being at the beginning of the REG_PARM_STACK_SPACE area, it is now past this area. So there is still some stack waste, but as told before I do not want to change REG_PARM_STACK_SPACE at the moment.

With these bugs fixed, I reached a point where -O0 would apparently always produce correct code. It was time to make optimization reliable as well!

References

Miod Vallat makes all sort of strange telephony hardware work reliably for a living.


Return to the BSD DevCenter.




Sponsored by: