At work, I am developing for a Coldfire V1 processor using Freescale’s Codewarrior development environment. Sometimes when digging into the generated assembly, I notice that the compiler seems to have made strange choices about register use and stack management. I mentioned one particularly strange sequence to a friend, who asked about the optimization settings of the compiler. He said that the unoptimized actions of the compiler are very mechanical. To use a variable in a function, it is read from the stack, changed, and written back to the stack. If it’s used again, it will be read back from the stack and written again when it’s done. He mentioned that register coloring is used to improve this behavior by keeping values in registers instead of always syncing them to memory. I didn’t know much about optimization, so I looked into it.
I went back and looked at some code that the compiler should be able to optimize down to a single bit set. The code is here, but you can take my word for it that all this is doing is setting a bit:
#define SCG_BASE (0xFFFF8108)
#define SCG_REG ((u8 *)SCG_BASE)
#define CLOCK_REG_IDX(num) ((num) >> 3)
#define CLOCK_BIT_NUM(num) (1 << ((num) & 0x07))
#define CLOCK_ENABLE(num) (SCG_REG[CLOCK_REG_IDX(num)] \
|= CLOCK_BIT_NUM(num))
This code allows me to enumerate a list of clocks in a particular order, then turn them on using just their enumerated values. The macros allow this to be resolved at compile time to a known bit set at an absolute address.
When run with no optimiation, the compiler generated this:
; 931: CLOCK_ENABLE (CLOCK_RTC); ; 0x0000B0F6 0x717C8109 mvs.w #-32503,d0 0x0000B0FA 0x2D40FF6C move.l d0,-148(a6) 0x0000B0FE 0x226EFF6C movea.l -148(a6),a1 0x0000B102 0x206EFF6C movea.l -148(a6),a0 0x0000B106 0x1010 move.b (a0),d0 0x0000B108 0x08C00002 bset #2,d0 0x0000B10C 0x1280 move.b d0,(a1)
A quick coldfire assembly primer: d0-d7 are general purpose data registers, a0-a7 are general purpose address registers, and the parentheses around a register is an indexed operation. (a0) uses the value pointed to by the address in a0, and -148(a6) reads the value located 148 bytes below the address stored in a6. In a move operation, the source is first and the destination second. a6 is a stack frame pointer.
So the absolute address that we want to modify is loaded into d0 and transferred onto the stack. The value is then loaded from the stack into a0 and a1. The current value at this address is loaded into d0, the bit is set, and the value is written back to the address. The compiler doesn't even trust itself to not destroy the address between reading and writing, so it loads it into both a0 and a1 and uses one for the read and the other for the write. This is hardly the efficiency I was hoping for, but the compiler can't really be blamed for this when optimizations are completely off.
Here is the assembly generated with optimization set at 2:
; 931: CLOCK_ENABLE (CLOCK_RTC); ; 932: ; 0x0000A046 0x717C8109 mvs.w #-32503,d0 0x0000A04A 0x2D40FF6C move.l d0,-148(a6) 0x0000A04E 0x2240 movea.l d0,a1 0x0000A050 0x2040 movea.l d0,a0 0x0000A052 0x1010 move.b (a0),d0 0x0000A054 0x08C00002 bset #2,d0 0x0000A058 0x1280 move.b d0,(a1)
Hmph. Not exactly a stellar attempt at optimization. I am used to GCC, where an optimiation level of 2 is a collection of optimizations that are applied. It's an easy way to say "I want some optimization, but let's not go crazy." To the Freescale compiler it apparently means "Take the day off."
However, my friend had mentioned register coloring. I had seen a setting mentioning this before and went to find it. Strangely, this is not in the same place in the IDE as the optimizations, but I found it and saw it was turned off. After turning it on, here's what I got:
; 931: CLOCK_ENABLE (CLOCK_RTC); ; 932: ; 0x0000839E 0x717C8109 mvs.w #-32503,d0 0x000083A2 0x2040 movea.l d0,a0 0x000083A4 0x1010 move.b (a0),d0 0x000083A6 0x08C00002 bset #2,d0 0x000083AA 0x1080 move.b d0,(a0)
OK, better! The spark has sizzled out of our love affair with the stack, and we can keep the values in registers. Also, we know that we're not trashing a0 and can use it again later when we store the value back.
On the same page as the register coloring I also saw peephole optimization, which is an optimization pass that tries to use specific instructions/features of the target processor to improve performance. This sounds like something else I'd be interested in. Here is level 2 optimization with coloring and peepholes:
; 931: CLOCK_ENABLE (CLOCK_RTC); ; 932: ; 0x00007EAE 0xA540 mov3q #2,d0 0x00007EB0 0x01F88109 bset d0,0xffff8109
Great! It's almost like we know what we're doing here. We load the number of the bit we want set into a register, then call the bitset operation. The only thing better would be to use an immediate value for the bit to set to avoid loading it first. I tried to write the assembly myself to specify an immediate value to the bset operation:
bset #2,0xffff8109
But this doesn't assemble. It looks like the hardware doesn't support this addressing mode. So the result from the O2, coloring, peephole settings is the best we could do!
This was a particularly useful find for me, as I had briefly tried optimization on this compiler before and decided it wasn't very useful. I was starting to worry about the total size of my application, so I turned on O2. The results were basically meaningless, which isn't a surprise after seeing what O2 did in the above example. The size of my current complete binary with O0 is about 110K. With O2 but no coloring or peephole, it is about 100K. However, O2 with coloring and peephole is about 81K, which is a great improvement.