ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Getting Familiar with GCC Parameters

by Mulyadi Santosa
04/05/2007

gcc (GNU C Compiler) is actually a collection of frontend tools that does compilation, assembly, and linking. The goal is to produce a ready-to-run executable in a format acceptable to the OS. For Linux, this is ELF (Executable and Linking Format) on x86 (32-bit and 64-bit). But do you know what some of the gcc parameters can do for you? If you're looking for ways to optimize the resulted binary, prepare for a debugging session, or simply observe the steps gcc takes to turn your source code into an executable, getting familiar with these parameters is a must. So, please read on.

Recall that gcc does multiple steps, not just one. Here is a brief explanation of their meaning:

The gcc parameters described here directly and indirectly touch all those four stages, so for clarity, this article is organized as follows:

Before that, let's meet the accompanying tools that will help us to sneak inside the resulting code:

The following explanations can be applied to gcc version 3.x and 4.x, so it's pretty generic. Let's start digging, shall we?

Parameters Related To Code Optimization

gcc offers a "no-brainer" way to do a bunch of optimizations for you: the -O option. It either makes your code run faster or squeezes the final code's size. There are five variants of it:

How much speedup do we really get from them? Well, suppose we have this code:

#include<stdio.h>
   int main(int argc, char *argv[])
   {
   int i,j,k
   unsigned long acc=0; 
   for(i=0;i<10000;i++)
        for(j=0;j<5000;j++)
                for(k=0;k<4;k++)
                        acc+=k;
   printf("acc = %lu\n",acc);
   return 0;
   }

Listing 1. Simple code to be optimized.

Using gcc, four different binaries are created, using each -O variants (except -Os). The time tool records their execution time, e.g.:

$ time ./non-optimized
  No optimization -O1 -O2 -O3
real 0.728 0.1 0.1 0.1
user 0.728 0.097 0.1 0.1
sys 0.000 0.002 0.000 0.000

For simplicity's sake, we use the following conventions:

As you can see, between the non-optimized case and the one compiled using -O1, the elapsed time is seven times shorter. Notice that -O1, -O2, and -O3 don't really give too much additional benefit--in fact, they are pretty much the same. So what's the magic of -O1 here?

After a quick sightseeing trip to the source code, you would expect that such code is a dead-end for optimization. First, let's look at a brief comparison between the disassembled versions of non-optimized and optimizedO1.

$ objdump -D non-optimized
$ objdump -D optimizedO1

(Note: You might get different results, so use this as general guideline.)

Non-optimized OptimizedO1
mov 0xfffffff4(%ebp),%eax add $0x6,%edx
add %eax,0xfffffff8(%ebp) add $0x1,%eax
addl $0x1,0xfffffff4(%ebp) cmp $0x1388,%eax
cmpl $0x3,0xfffffff4(%ebp)  

The above snippets implement the innermost loop (for (k=0;k<4;k++)). Notice the clear difference: the non-optimized code directly loads and stores from a memory address, while optimizedO1 uses CPU registers as the accumulator and loop counter. As you may already be aware, registers can be accessed hundreds or thousands times faster than RAM cells.

Not satisfied to just exploit CPU registers for temporary storage, gcc uses another optimization trick. Let's see the disassembled code of optimizedO1 again, specifically in the main() function:

......
   08048390 <main>:
   ...
   80483a1:       b9 00 00 00 00          mov    $0x0,%ecx
   80483a6:       eb 1f                   jmp    80483c7 <main+0x37>
   80483a8:       81 c1 30 75 00 00       add    $0x7530,%ecx

0x7530 is 30,000 in decimal form, so we can quickly guess the loop is simplified. This code represents the innermost loop and the outermost loop ("for(j=0;j<5000;j++) ... for(k=0;k<4;k++)") because that is literally a request to do 30,000 loops. Note that you just need to loop three times inside. When k=0, acc is still the same, so the first loop can be discarded.

   80483ae:       81 f9 00 a3 e1 11       cmp    $0x11e1a300,%ecx
   80483b4:       74 1a                   je     80483d0 <main+0x40>
   80483b6:       eb 0f                   jmp    80483c7 <main+0x37>

Hmm, now it is compared to 300,000,000 (10,000*5,000*6). This represents all three loops. After we reach that number of loops, we jump straight to the printf() to print the sum (address 0x80483d0 - 0x80483db)

   80483b8:       83 c2 06                add    $0x6,%edx
   80483bb:       83 c0 01                add    $0x1,%eax
   80483be:       3d 88 13 00 00          cmp    $0x1388,%eax
   80483c3:       74 e3                   je     80483a8 <main+0x18>
   80483c5:       eb f1                   jmp    80483b8 <main+0x28>

Six is added in the accumulator in each iteration. %edx will eventually hold the total sum after completion of all the loops. The third and fourth lines shows us that after this is done 5,000 times, it should return to address 0x80483a8 (as mentioned previously).

We can conclude that gcc creates a simplification here. Instead of looping three times in the innermost loop, it simply adds six for every middle loop. This sounds simple, but it makes your program do only 100,000,000 loops instead of 300,000,000. This simplification, technically called loop unrolling, is one of the jobs done by -O1/2/3. Of course, you can do it on your own, but sometimes it is good to know that gcc can detect such things and optimize them.

With -O2 or -O3, gcc also tries to optimize the branch. This is usually achieved via reordering [5] and code transformation. The goal of this procedure is to eliminate as many mispredicted branches as possible, thus improving pipeline utilization. For example, we can compare how non-optimized and optimizedO2 do most outer loops.

 80483d4:       83 45 ec 01             addl   $0x1,0xffffffec(%ebp)
 80483d8:       81 7d ec 0f 27 00 00    cmpl   $0x270f,0xffffffec(%ebp)
 80483df:       7e c4                   jle    80483a5 <main+0x21>

Non-optimized binary use jle to execute the jump. Mathematically, it means that the probability of taking the branch is 50 percent. Other the other hand, the optimizedO2 version uses this:

80483b4:       81 c1 30 75 00 00       add    $0x7530,%ecx
80483ba:       81 f9 00 a3 e1 11       cmp    $0x11e1a300,%ecx
80483c0:       75 e1                   jne    80483a3 <main+0x13>

Instead of jle, now jne is used. Assuming any integer could be compared in the previous cmp, you could easily conclude that it boosts the chance of taking the branch to almost 100 percent. It is a small but useful hint for the processor to indicate which code should be executed. However, for most modern processors, this kind of transformation is not terribly necessary because the branch predictor is smart enough on its own.

To prove how much this transformation can help, OProfile comes to the rescue. Oprofile is executed to record the number of retired branches and retired mispredicted branches. Retired here means it is executed inside the CPU pipeline.

$ opcontrol --event=RETIRED_BRANCHES_MISPREDICTED:1000 --event=RETIRED_BRANCHES:1000;

We run non-optimized and optimizedO2 five times each. Then we take out the maximum and minimum of the samples. We calculate the misprediction rate using this formula (adapted from this):

Misprediction ratio = retired mispredicted branch / retired branch

The average misprediction ratio is now calculated for each binary. Non-optimized gets 0.5117 percent, while optimizedO2 gets 0.4323 percent--a very small gain for our case. The actual gain may vary in a real-world case, because gcc itself can't do much without external hints. Please read about __builtin_expect() in the gcc online documentation for further information.

Options Related To Function Calling

gcc basically offers you several ways to manage how a function is called. Let's take a look at inlining first. By inlining, you reduce the cost of a function call because the body of the function is directly substituted into the caller. Please note that this is not done by default, only when you use -O3 or at least -finline-functions.

How does the finished binary look when gcc does inlining? Observe Listing 2:

#include<stdio.h>

inline test(int a, int b, int c)
{
        int d;
        d=a*b*c;
        printf("%d * %d * %d is %d\n",a,b,c,d);
}



static inline test2(int a, int b, int c)
{
         int d;
         d=a+b+c;
         printf("%d + %d + %d is %d\n",a,b,c,d);
}

int main(int argc, char *argv[])
{
        test(1,2,3);
        test2(4,5,6);
}

Listing 2. Inline in action

Compile Listing 2 using the following parameter:

$ gcc -S -O3 -o <result-file.s> <listing-2.c>

-S makes gcc stop right after compilation stage (we'll cover it later in this article). The results are as follows:

....
test:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ebx
....
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
...
        movl    $6, 16(%esp)
        movl    $3, 12(%esp)
        movl    $2, 8(%esp)
        movl    $1, 4(%esp)
        movl    $.LC0, (%esp)
        call    printf
...
        movl    $15, 16(%esp)
        movl    $6, 12(%esp)
        movl    $5, 8(%esp)
        movl    $4, 4(%esp)
        movl    $.LC1, (%esp)
        call    printf
...

Both test() and test2() are indeed inlined, but you also see test(), which stays outside main(). This is where the static keyword plays a role. By saying a function is static, you tell gcc that this function won't be called by any outside object file, so there is no need to emit the codes on its own. Thus, it is a space saver if you can mark them as static whenever possible. On the other hand, be wise when deciding which function should be inlined. Increasing size for a small speedup isn't always worthwhile.

With certain heuristics, gcc decides whether a function should be inlined or not. One of the considerations is the function size in term of pseudo-instructions. By default, the limit is 600. You can change this limit via -finline-limit. Experiment to find better inline limits for your own case. It is also possible to override the heuristics so gcc always inlines the function. Simply declare your function like this:

__attribute__((always_inline)) static inline test(int a, int b, int c)

Now, on to parameter passing. In x86 architectures, parameters are pushed to the stack and later popped inside the function for further processing. But gcc gives you a chance to change this behavior and instead use registers. Functions with up to three parameters could use this feature by passing -mregparm=<n>, where <n> is the number of registers we want to use. If we apply this parameter (n=3) to Listing 2, take out the inline attribute, and use no optimization, we get this:

...
test:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $56, %esp
        movl    %eax, -20(%ebp)
        movl    %edx, -24(%ebp)
        movl    %ecx, -28(%ebp)
...
main:
...
        movl    $3, %ecx
        movl    $2, %edx
        movl    $1, %eax
        call    test

Instead of stack, it uses EAX, EDX, and ECX to hold the first, second, and third parameter. Because register access time is faster than RAM, it is one way to reduce runtime. However, you must pay attention to these issues:

You probably notice this kind of sequence at the beginning of every function:

push   %ebp
mov    %esp,%ebp
sub    $0x28,%esp

This sequence, also known as the function prologue, is written to set up the frame pointer (EBP). It is useful to help the debugger do a stack trace. The structure below helps you visualize this [6]:

[ebp-01] Last byte of the last local variable
[ebp+00] Old ebp value
[ebp+04] Return address
[ebp+08] First argument

Can we omit it? Yes, with -fomit-frame-pointer, the prologue will be shortened so the function just begins with a stack reservation (if there are local variables):

sub    $0x28,%esp

If the function gets called very frequently, cutting out the prologue saves your program several CPU cycles. But be careful: by doing this, you also make it hard for the debugger to investigate the stack. For example, let's add test(7,7,7) at the end of test2() and recompile with -fomit-frame-pointer and no optimization. Now fire up gdb to inspect the binary:

$ gdb inline
(gdb) break test
(gdb) r
Breakpoint 1, 0x08048384 in test ()
(gdb) cont
Breakpoint 1, 0x08048384 in test ()
(gdb) bt
#0  0x08048384 in test ()
#1  0x08048424 in test2 ()
#2  0x00000007 in ?? ()
#3  0x00000007 in ?? ()
#4  0x00000007 in ?? ()
#5  0x00000006 in ?? ()
#6  0x0000000f in ?? ()
#7  0x00000000 in ?? ()

On the second call of test, the program is stopped and gdb prints the stack trace. Normally, main() should come up in Frame #2, but we only see question marks. Recall what I said about the stack layout: the absence of a frame pointer prevents gdb from finding the location of the saved return address in Frame #2.

Options Related To Debugging

Everybody needs to debug his or her code sometimes. When that time comes, usually you fire up gdb, put breakpoints here and there, analyze backtraces, and so on, to pinpoint the exact location of the offending code(s). And what exactly do you get? Assuming you haven't used any debugging option, you likely just get an address pointed to by an EIP register.

The problem is, you don't really want an address. You want gdb or another debugger to simply show the related lines. But gdb can't do that without some kind of hint. This hint, specifically called Debugging With Attributed Record Formats (DWARF), helps you do source-level debugging.

How do you do it? Use -g when you compile into object code, e.g.:

  gcc -o -g test test.c

What does gcc actually add so that the debugger can correlate an address with source code? You need dwarfdump [7] to find out. This tool is packaged inside the "libdwarf" tarball or RPM, so you won't find it as a standalone package. Compile it on your own, or simply install from your distro's online repository; both should work. In this section, I use the 20060614 RPM version.

Using readelf, you notice that there are 28 sections inside the non-debug version of Listing 1:

 $ readelf -S ./non-optimized

But the debug version has 36 sections. The new sections added are:

You don't need to dig into all of the above sections; taking a look into .debug_line is enough for quick observation. The command you need is:

$ /path/to/dwarfdump -l <object file>

Here is an example of what you'll get:

 .debug_line: line number info for a single cu
   Source lines (from CU-DIE at .debug_info offset 11):
 <source>        [row,column]    <pc>    //<new statement or basic block
 /code/./non-optimized.c:  [  3,-1]        0x8048384       // new statement
 /code/./non-optimized.c:  [  5,-1]        0x8048395       // new statement
 ...............

The interpretation of the above messages is quite straightforward. Take the first entry (below the <source> string) as an example:

  line number 3 in file non-optimized.c is located in address 0x8048384.

gdb itself gives the same information:

 $ gdb non-optimized-debugging
   (gdb) l *0x8048384
   0x8048384 is in main (./non-optimized.c:3).

readelf also provides similar information, by using --debug-info:

 $ readelf --debug-dump=line <object file>
   Line Number Statements:
   Extended opcode 2: set Address to 0x8048384
   Special opcode 7: advance Address by 0 to 0x8048384 and Line by 2 to 3
   Advance PC by constant 17 to 0x8048395
   Special opcode 7: advance Address by 0 to 0x8048395 and Line by 2 to 5
   ....

Both readelf and dwarfdump can analyze debug information, so you're free to choose.

What you should realize is that the source code itself isn't embedded into the object file. In fact, the debugger must check a separate source code file. The entry in the <source> column helps determine where to load the source file. Notice that it contains a full path--meaning if the file is moved somewhere or renamed, gdb can't locate it.

gcc itself has the ability to produce many debugging information. Besides DWARF, there are:

Each of the above formats is described in the footnotes ([8], [9], and [10]), but in x86-compatible architecture, without a doubt, you would use DWARF format. The latest DWARF specification is DWARF version 3, and gdb can produce it via -gdwarf-2. This could be misleading for first-time users, because you might think it creates DWARF 2-based information. In fact, it is DWARF 2 coupled with some DWARF 3 features. Not every debugger supports version 3, so use it with caution.

However, things do not always go smoothly. When you combine -O and -g, it is necessary for line information to relate to the actual code in the mentioned address offset. An example can clarify this--take a file (I use Listing 1 again) and compile it:

 $ gcc -O2 -ggdb -o debug-optimized listing-one.c
   $ readelf --debug-dump=line debug-optimized
   ..
   Special opcode 107: advance Address by 7 to 0x80483aa and Line by 4 to 11
   ...

But what does gdb say?

 $ gdb debug-optimized
   (gdb) l *0x80483aa
   0x80483aa is in main (./non-optimized.c:11).
   ...
   11              printf("acc = %lu\n",acc);
   ...
   (gdb) disassemble main
   ...
   0x080483aa <main+26>:   add    $0x6,%edx
   0x080483ad <main+29>:   cmp    $0x1388,%eax
   ...

There you see a complete disagreement. By inspecting debug information alone, you would expect the related address to contain something like a CALL instruction. But in reality, you get ADD and CMP instructions, more likely a loop construction. This is the side effect of the jobs done by optimization--instruction reordering in this case. So as a rule of thumb, don't mix -g (or its variants) with -O.

Options Controlling Compilation Stages

For learning purposes, sometimes you want to know how your source code is transformed into an executable. Fortunately, gcc provides you options to stop at any processing stage. Recall that gcc has several stages to be accomplished--for example, linking. The options are:

-c is mostly used when you have multiple source files and combine them to create the final executable. So, instead of:

$ gcc -o final-binary test1.c test2.c

it would be better to split them as:

$ gcc -c -o test1.o test1.c
$ gcc -c -o test2.o test2.c

and then:

$ gcc -o final-binary ./test1.o ./test1.o

You probably notice the same sequence if you build the program using a Makefile. The advantage of using -c is clear: you only need to recompile the changed source files. The only phase that has to be redone is linking all the object files, and that greatly saves time, especially in large projects. An obvious example of this is the Linux kernel.

-E is useful if you want to see how your code really looks after macros, definitions, and such are expanded. Take Listing 3 as an example.

#include<stdio.h>

#define A 2
#define B 4
#define calculate(a,b) a*a + b*b

void plain_dummy()
{
    printf("Just a dummy\n");
}

static inline justtest()
{
    printf("Hi!\n");
}

int main(int argc, char *argv[])
{
#ifdef TEST
    justtest();
#endif
    printf("%d\n", calculate(A,B));
    return 0;
}

Listing 3. Code contains #define and #ifdef

We compile it like this:

$ gcc -E -o listing2.e listing2.c

Notice that we don't pass any -D parameters, which means TEST is undefined. So what do we have in the preprocessed file?

void plain_dummy()
{
    printf("Just a dummy\n");
}

static inline justtest()
{
    printf("Hi!\n");
}

int main(int argc, char *argv[])
{
    printf("%d\n", 2*2 + 4*4);
    return 0;
}

Where is the call to justtest() inside main()? Nowhere. TEST is undefined--that's why the code is eliminated. You can also see that the calculate() macro is already expanded into multiplication and addition of constant numbers. In final executable form, this number will be replaced with the operation result. As you see, -E is quite handy to double-check the correctness of directives.

Notice that plain_dummy() is still there even though it is never called. No surprise since no compilation happens here, therefore dead code elimination doesn't happen at this stage. stdio.h is also expanded but it isn't shown in the above listing.

I found an interesting application of -E as an HTML authoring tool [11]. In short, it helps you to adopt common programming practices such as code modularization and macros into the HTML world--something that cannot be done using plain HTML coding.

-S gives you assembly code, much like what you see with objdump -d/-D. However, with -S, you still see directives and symbol names, which makes it easier to study the code. For example, a call like printf("%d\n", 20) could be transformed into:

.section        .rodata.str1.1,"aMS",@progbits,1
.LC0:
        .string "%d\n"
...
        movl    $20, 4(%esp)
        movl    $.LC0, (%esp)
        call    printf

You can see that format string %d is placed in a read-only data section (.rodata). Also, you can confirm that arguments are pushed to the stack from right to left, with the format string at the top of the stack.

Conclusion

gcc gives us many useful options to make our code into whatever we like. By understanding what these options really do, we can make the program faster and slimmer. However, do not depend entirely on them: you should pay more attention to writing efficient and well-structured code.

Acknowledgments

I would like to thank the communities in the OFTC chat room (#kernelnewbies and #gcc) and #osdev (Freenode) for their valuable ideas.

References

  1. Wikipedia's article about Preprocessing
  2. Wikipedia's article about Compilation
  3. Wikipedia's article about Assembler
  4. Wikipedia's article about Linker
  5. An example of code reordering using gcc
  6. Frame pointer omission (FPO) optimization and consequences when debugging, Part 1 and Part 2
  7. Explanation of DWARF
  8. Explanation of stabs
  9. Explanation of COFF
  10. Explanation of XCOFF (a COFF variant)
  11. Using a C preprocessor as an HTML authoring tool
  12. gcc online documentation
  13. AMD Athlon Processor x86 Code Optimization Guide

Mulyadi Santosa is a freelance writer who lives in Indonesia.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.