Sunday, 21 September 2008

Optimising Assembly Like an 80's Hacker

Forget about fancy algorithms and data structures. If you want respect as an 80's hacker, follow these simple tips.

Never get caught setting a register to zero without using xor:

Z80 Code
ld a,0           ; bad, 2 bytes / 7 cycles

xor a ; good, 1 byte / 4 cycles

8088 Code
mov ax,0         ; bad, 3 bytes / 4 cycles

xor ax,ax ; good, 2 bytes / 3 cycles

Never set two 8 bit register independently. Code readability is not required:

Z80 Code
ld b,10          ; bad, 4 bytes / 14 cycles
ld c,32

ld bc,10*256+32 ; good, 3 bytes / 11 cycles

8088 Code
mov ch,10        ; bad, 4 bytes / 8 cycles
mov cl,32

mov cx,10*256+32 ; good, 3 bytes / 4 cycles

Never compare to zero:

Z80 Code
cp 0             ; bad, 2 bytes / 7 cycles

or a ; good, 1 byte / 4 cycles

8088 Code
cmp ax,0         ; bad, 3 bytes / 4 cycles

test ax,ax ; good, 2 bytes / 3 cycles

Remember, you don't need to worry about code alignment, order of instructions or processor penalties. Follow these simple tips and your super-optimised bubble sort will demand the utmost respect!


  1. Great Post!

    I was sorting my office out the other day and came across my first "useful" program which was written completely in assembler! I'll be putting it on my blog in the near future....! :)


  2. I try and tell the "kids" of today about saving cycles and memory contrants, they just don't get it.

  3. But don't forget: Premature optimalization is root of all evil.

  4. So who said what to anger you into writing this post?

  5. Wow, hacking and phreaking in the 80s was so much fun. Now everything is a federal offense so why bother!


  6. Most time such low-level optimization is exaggerated. Rather spend more time on software design.

  7. To the guys saying "Most time such low-level optimization is exaggerated" or stuff like that, remember that the compiler _is_ using xor instead of mov and stuff like that on platforms where it matters.

  8. Optimizing Z80 and 8088 assembler was how I spent a great deal of my time in the 1980'ies. Still very useful when you need to program single chip systems and micro controllers.

  9. Another one is never have a JSR followed by a RET. Use a JMP instead saving you lots of cycles on a 6502 (and I'd guess the same with the equivalent 8086 instruction).

  10. This kind of optimization doesn't require additional efforts, you just use these operations instead of the others. That's it!
    Try to disassemble any C/Delphi etc. program, you won't EVER see "mov ax,0"!
    You'll just get used to reading "xor ax,ax" as "ax=0".

  11. There are some who belittle this type of optimization, but they are often the same ones who produce huge bloated code that wipes out 1MB L2 caches and slows 2GHz machines to a crawl.

    More a project similar in essence to your ideal, google for "fbui".

  12. @most commentators:

    Didn't you read the last paragraph... or didn't you get the irony in it?

  13. Well I think most of us got it.
    Just, I was replying to the comments, not to the article itself ;)

  14. What WAS the point of the last paragraph? Making fun of programmers in the eighties for not optimizing for processor features they didn't have? Making fun of programmers now for writing assembly code like it was still the eighties? Neither makes one lick of sense.

  15. Anonymous (is that your real name?), I think he was poking fun at the ridiculous of the article itself. I.e. the article contains some cool tricks but the gains tend to pale into insignificance beside the larger bottlenecks one tends to find.

  16. Don't we have optimizing compilers for these kinds of things nowadays?

  17. Optimizing Assembly is so passe...

  18. I remember having to do that!

    Now I can usually trust any decent 'C' compiler to do most of the job and the low cost of high performance silicon to do the rest.

    I recently had to code an 8051 emulator, cross platform, but optimised for MIPS. I was surprised at how much performance I gained (~50%) after examining the object code and implementing a few simple tweaks (minimal inline asm, mainly type-forcing).

    It's fast enough now, but I keep looking at the code, knowing that I could get another 200% by hand coding in asm.

    Also - Some may accuse me of heresy, but MSC (7 or 8) produces *much* better 80x86 code than GCC.

  19. The hell kind of 80s hacker doesn't code on a 6502?

  20. 6502 -

    Apple II - too trendy, too expensive

    Commodore 64 - too WalMart, too limited

    Plus the Z80 was a much better processor.

  21. 8031/8051 and 8048, the worst instruction sets of all time!

  22. Another 6502 tip - use "zero page" for frequently-access variables. One fewer clock cycle for the lookup. (50% savings! w00t!)

    LDA $1001 ;2 cycle fetch
    LDA $01 ;1 cycle fetch


  23. well, then 8052 FTW!!!

  24. Shall we play a game?

  25. only an 8 bit game then

  26. If you're a bit interested in compiler's source optimization:
    Gist: fast code = important; readable code = more important

  27. 6809 rules. Has a direct page instead of a zero page.

  28. Another nice trick was to use shl/shr to multiply/divide by a power of 2.

  29. Sad thing is that I know most of these tricks but I don't have much luck keeping the stack aligned. Anyone have some advice for me?

  30. I guess a good question would be why program in assembler? When I 1st started programming, computers were 16x16 and $10,000 (Model 80), so it can't be for speed (64x since then with a good C compiler running only 2x behind). I think its for the thrill of direct control (down to the engine room).

  31. Mikkel Alan Stokkebye Christiansen3 December 2009 at 08:54

    LD bc,nn is 10 cycles if i'm not mistaken.
    Your timing for the 8088 is only the cycles used in the kernel of the cpu, not the actual cycles it takes.
    0 1 2 3 4 5 6 Prefetched
    16:1 12:1 8:1 4:1 4:2 4:3 4:4 MOV AX,0
    11:¾ 7:¾ 3:¾ 3:1¾ 3:2¾ 3:3¾ 3:4¾ XOR AX,AX
    20:1 16:1 12:1 8:1 4:1 4:2 4:3 MOV CH,10!MOV CL,32
    16:1 12:1 8:1 4:1 4:2 4:3 4:4 MOV CX,10*256+32
    16:1 12:1 8:1 4:1 4:2 4:3 4:4 CMP AX,0
    11:¾ 7:¾ 3:¾ 3:1¾ 3:2¾ 3:3¾ 3:4¾ TEST AX,AX
    The number after the : is the number of bytes in the prefetch queue. ¾ means, that it will take 1 cycle more to get the next byte read.

  32. 6809 was the best designed 8-bit processor every made!

    When writing ROM code for the Tandy CoCo, a great optimization was:
    PUSH A
    PUSH B
    ... rest of subroutine ...
    register restore + RTS in 1 instruction!

  33. Also, the 6809 is a great Forth processor as it has two stack pointers.


Note: only a member of this blog may post a comment.