A Journey into Optimization

At work, I am developing for a Coldfire V1 processor using Freescale’s Codewarrior development environment. Sometimes when digging into the generated assembly, I notice that the compiler seems to have made strange choices about register use and stack management. I mentioned one particularly strange sequence to a friend, who asked about the optimization settings of the compiler. He said that the unoptimized actions of the compiler are very mechanical. To use a variable in a function, it is read from the stack, changed, and written back to the stack. If it’s used again, it will be read back from the stack and written again when it’s done. He mentioned that register coloring is used to improve this behavior by keeping values in registers instead of always syncing them to memory. I didn’t know much about optimization, so I looked into it.

I went back and looked at some code that the compiler should be able to optimize down to a single bit set. The code is here, but you can take my word for it that all this is doing is setting a bit:

#define SCG_BASE                        (0xFFFF8108)

#define SCG_REG                         ((u8 *)SCG_BASE)
#define CLOCK_REG_IDX(num)              ((num) >> 3)
#define CLOCK_BIT_NUM(num)              (1 << ((num) & 0x07))

#define CLOCK_ENABLE(num)               (SCG_REG[CLOCK_REG_IDX(num)] \
                                              |= CLOCK_BIT_NUM(num))

This code allows me to enumerate a list of clocks in a particular order, then turn them on using just their enumerated values. The macros allow this to be resolved at compile time to a known bit set at an absolute address.

When run with no optimiation, the compiler generated this:

;  931:     CLOCK_ENABLE (CLOCK_RTC);
;
0x0000B0F6  0x717C8109               mvs.w    #-32503,d0
0x0000B0FA  0x2D40FF6C               move.l   d0,-148(a6)
0x0000B0FE  0x226EFF6C               movea.l  -148(a6),a1
0x0000B102  0x206EFF6C               movea.l  -148(a6),a0
0x0000B106  0x1010                   move.b   (a0),d0
0x0000B108  0x08C00002               bset     #2,d0
0x0000B10C  0x1280                   move.b   d0,(a1)

A quick coldfire assembly primer: d0-d7 are general purpose data registers, a0-a7 are general purpose address registers, and the parentheses around a register is an indexed operation. (a0) uses the value pointed to by the address in a0, and -148(a6) reads the value located 148 bytes below the address stored in a6. In a move operation, the source is first and the destination second. a6 is a stack frame pointer.

So the absolute address that we want to modify is loaded into d0 and transferred onto the stack. The value is then loaded from the stack into a0 and a1. The current value at this address is loaded into d0, the bit is set, and the value is written back to the address. The compiler doesn't even trust itself to not destroy the address between reading and writing, so it loads it into both a0 and a1 and uses one for the read and the other for the write. This is hardly the efficiency I was hoping for, but the compiler can't really be blamed for this when optimizations are completely off.

Here is the assembly generated with optimization set at 2:

;  931:     CLOCK_ENABLE (CLOCK_RTC);
;  932:     
;
0x0000A046  0x717C8109               mvs.w    #-32503,d0
0x0000A04A  0x2D40FF6C               move.l   d0,-148(a6)
0x0000A04E  0x2240                   movea.l  d0,a1
0x0000A050  0x2040                   movea.l  d0,a0
0x0000A052  0x1010                   move.b   (a0),d0
0x0000A054  0x08C00002               bset     #2,d0
0x0000A058  0x1280                   move.b   d0,(a1)

Hmph. Not exactly a stellar attempt at optimization. I am used to GCC, where an optimiation level of 2 is a collection of optimizations that are applied. It's an easy way to say "I want some optimization, but let's not go crazy." To the Freescale compiler it apparently means "Take the day off."

However, my friend had mentioned register coloring. I had seen a setting mentioning this before and went to find it. Strangely, this is not in the same place in the IDE as the optimizations, but I found it and saw it was turned off. After turning it on, here's what I got:

;  931:     CLOCK_ENABLE (CLOCK_RTC);
;  932:     
;
0x0000839E  0x717C8109               mvs.w    #-32503,d0
0x000083A2  0x2040                   movea.l  d0,a0
0x000083A4  0x1010                   move.b   (a0),d0
0x000083A6  0x08C00002               bset     #2,d0
0x000083AA  0x1080                   move.b   d0,(a0)

OK, better! The spark has sizzled out of our love affair with the stack, and we can keep the values in registers. Also, we know that we're not trashing a0 and can use it again later when we store the value back.

On the same page as the register coloring I also saw peephole optimization, which is an optimization pass that tries to use specific instructions/features of the target processor to improve performance. This sounds like something else I'd be interested in. Here is level 2 optimization with coloring and peepholes:

;  931:     CLOCK_ENABLE (CLOCK_RTC);
;  932:     
;
0x00007EAE  0xA540                   mov3q    #2,d0
0x00007EB0  0x01F88109               bset     d0,0xffff8109

Great! It's almost like we know what we're doing here. We load the number of the bit we want set into a register, then call the bitset operation. The only thing better would be to use an immediate value for the bit to set to avoid loading it first. I tried to write the assembly myself to specify an immediate value to the bset operation:

bset      #2,0xffff8109

But this doesn't assemble. It looks like the hardware doesn't support this addressing mode. So the result from the O2, coloring, peephole settings is the best we could do!

This was a particularly useful find for me, as I had briefly tried optimization on this compiler before and decided it wasn't very useful. I was starting to worry about the total size of my application, so I turned on O2. The results were basically meaningless, which isn't a surprise after seeing what O2 did in the above example. The size of my current complete binary with O0 is about 110K. With O2 but no coloring or peephole, it is about 100K. However, O2 with coloring and peephole is about 81K, which is a great improvement.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>