The January 1984 issue of Nibble Magazine
Had this article HYPERCOUNTER
by Ron Macken and Bill Consoli on page 62:
Hypercounter.s:
ORG $300
SCRN EQU $5B8 ;FIRST LOCATION ON THE SCREEN
HOME EQU $FC58
300:20 58 FC JSR HOME ;CLEAR SCREEN
303:A9 B0 LDA #$B0 ;ASC (IN HEX) OF THE DIGIT 0
305:A2 08 LDX #8
307:9D B8 05 NUMCL STA SCRN,X ;SET XTH DIGIT TO ZERO
30A:CA DEX
30B:D0 FA BNE NUMCL ;GOTO NUMCL IF ALL 8 DIGITS AREN'T 0
30D:A2 08 COUNT LDX #8
30F:A9 B9 CHECK LDA #$B9 ; WILL THIS DIGIT BE GREATER
311:DD B8 05 CMP SCRN,X ;THAN 9 IF WE INCREMENT IT?
314:F0 06 BEQ KICK ;YES? THEN IT NEEDS TO BE KICKED OVER
316:FE B8 05 INC SCRN,X ;ACTUAL COUNTING IS DONE HERE
319:4C 0D 03 JMP COUNT ;START OVER
31C:A9 B0 KICK LDA #$B0 ;KICKS DIGIT OVER FROM 9 TO 0
31E:9D B8 05 STA SCRN,X ;PUTS THE DIGIT ON THE SCREEN
321:CA DEX
222:D0 EB BNE CHECK ;GO BACK AND DO IT AGAIN
324:60 RTS ;WE'VE REACHED 99,999,999 !
Which generates this binary:
300:20 58 FC A9 B0 A2 08 9D
308:B8 05 CA D0 FA A2 08 A9
310:B9 DD B8 05 F0 06 FE B8
318:05 4C 0D 03 A9 B0 9D B8
320:05 CA D0 EB 60
Unfortunately it is S-L-O-W and buggy. :-/
- The Y-Reg is unused. We could cache two constants in registers:
- A = $B9 (
9
) and - Y = $B0 (
0
) instead of constantly reloading A on $30F and $31C
- A = $B9 (
- It uses the slow 7 cycle
INC $abs,X
. This is done 1,000,000 times. Faster is the 4 cycleINC $abs
- It has an off-by-one bug -- it doesn't actually display 100,000,000. The last number two numbers output are:
- 99,999,999
- 00,000,000
Other problems include:
- It includes your typical "No Shit, Sherlock!" crappy verbose commenting.
We are going to change the range above (0 - 99,999,999) to (1 - 1,000,000) since on c.s.a2 this similar problem was asked:
Can anyone provide the quickest native machine language routine equivalent to
FOR I = 1 TO 1E6: PRINT I: NEXT
Now, the fastest way to count from 1 to 1,000,000 is to only change the bytes that actually change from x to x+1.
This means we can remove ALL compares and branching!
We can write a C program to spit out 6502 assembly for us:
for( i = 0; i < end; i++ )
{
int x = NUM-1;
char before[ NUM+1 ];
char after [ NUM+1 ];
sprintf( before, "%0*d", NUM, i+0 );
sprintf( after , "%0*d", NUM, i+1 );
while( x > 0 )
{
int delta = after[x] - before[x];
if( delta == 0 )
; // skip digit
else
if( delta == 1 )
{
if( (before[x] == '0') && (x != 1) )
one( x );
else
if( (before[x] == '1') && (x != 1) )
two( x );
else
inc( x );
}
else
if( delta == -9 )
zer( x );
x--;
}
You'll notice there are 4 variations of x++. Why 4? Consider what happens to each digit:
- 0 -> 1
- 1 -> 2
- 9 -> 0
- remaining 2,3,4,5,6,7,8
Since the 6502 has 3 registers, A, X, and Y we have 3 specializations where we can cache the constant values in registers:
- X = 0
- Y = 1
- A = 2
To also keep things focused on a pure benchmark we also remove that clear screen.
Depending on how much we want to loop-unroll we get this tradeoff:
End | Memory | Size |
---|---|---|
1000 | $1568 | < 1 KB |
10000 | $8A98 | ~ 33 KB |
100000 | $51E75 | >325 KB |
1000000 | $32DCB7 | > 3 MB (3,333,330 bytes) |
We'll unroll the first 10,000 numbers so it fits into the 48KB of RAM.
With a little bit of setup code ...
digit = $401 ; VTAB 1:HTAB 2
ORG $800
Prologue
LDA #'2' + $80 ; +2 = 2
LDX #'0' + $80 ; +2 = 4
LDY #'1' + $80 ; A,BCD,EFG ; +2 = 6
STX digit - 1 ; 0,BCD,EFG ; +4 = 10
STX digit + 0 ; 0,0CD,EFG ; +4 = 14
STX digit + 1 ; 0,00D,EFG ; +4 = 18
STX digit + 2 ; 0,000,EFG ; +4 = 22
STX digit + 3 ; 0,000,0FG ; +4 = 26
STX digit + 4 ; 0,000,00G ; +4 = 30
; STX digit + 5 ; 0,000,000 ; +4 = 34 ; <-- not needed since Count_10_000 sets to '1'
Work
JSR Count_100_000 ; 100000
JSR Count_100_000 ; 200000
JSR Count_100_000 ; 300000
JSR Count_100_000 ; 400000
JSR Count_100_000 ; 500000
JSR Count_100_000 ; 600000
JSR Count_100_000 ; 700000
JSR Count_100_000 ; 800000
JSR Count_100_000 ; 900000
JSR Count_100_000 ; 000000
Epilogue ;==10
STX digit + 0 ; 000000
STY digit - 1 ;1000000
RTS
Count_100_000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 00000
STX digit + 1 ; 000000
INC digit + 0 ; 100000
RTS
Count_10_000
... Bob's your uncle.
Is that the best we can do?
No, we can continue unrolling even more!
; 100000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 0:0000
STX digit + 1 ; 000000
INC digit + 0 ; 100000
; 200000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 1:0000
STX digit + 1 ; 100000
INC digit + 0 ; 200000
; 300000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 2:0000
STX digit + 1 ; 200000
INC digit + 0 ; 300000
; 400000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 3:0000
STX digit + 1 ; 400000
INC digit + 0 ; 400000
; 500000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 4:0000
STX digit + 1 ; 400000
INC digit + 0 ; 500000
; 600000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 5:0000
STX digit + 1 ; 500000
INC digit + 0 ; 600000
; 700000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 6:0000
STX digit + 1 ; 600000
INC digit + 0 ; 700000
; 800000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 7:0000
STX digit + 1 ; 700000
INC digit + 0 ; 800000
; 900000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 8:0000
STX digit + 1 ; 800000
INC digit + 0 ; 900000
;1000000
JSR Count_10_000 ; 010000
JSR Count_10_000 ; 020000
JSR Count_10_000 ; 030000
JSR Count_10_000 ; 040000
JSR Count_10_000 ; 050000
JSR Count_10_000 ; 060000
JSR Count_10_000 ; 070000
JSR Count_10_000 ; 080000
JSR Count_10_000 ; 090000
JSR Count_10_000 ; 9:0000
STX digit + 1 ; 900000 ; +4 = 64
STX digit + 0 ; 000000 ; +4 = 68
Epilogue ;==10
STY digit - 1 ;1000000 ; +4
RTS ; +6
Prologue = +30
Work = +730
Count_10_000 = +6,000,600
Epilogue = +10
==================
6,001,370 cycles
So what is the theoritical fastest case if we display every number?
Loop unrolling all 1,000,000 we have this timing:
A) We would have 30 cycles of prologue code:
digit = $401 ; VTAB 1:HTAB 2
ORG $0
Prologue
LDA #'2' + $80 ; +2 = 2
LDX #'0' + $80 ; +2 = 4
LDY #'1' + $80 ; A,BCD,EFG ; +2 = 6
STX digit - 1 ; 0,BCD,EFG ; +4 = 10
STX digit + 0 ; 0,0CD,EFG ; +4 = 14
STX digit + 1 ; 0,00D,EFG ; +4 = 18
STX digit + 2 ; 0,000,EFG ; +4 = 22
STX digit + 3 ; 0,000,0FG ; +4 = 26
STX digit + 4 ; 0,000,00G ; +4 = 30
B) 4 cycles epilogue:
STY digit - 1 ;1000000
C) And 5,999,996 of counting from 1 to 1,000,000 with these stats:
- SIZE = $32DCB7 (3,333,330 bytes)
- CYCLES = 5,999,996
For a best case of:
= 30 + 4 + 5,999,996 = 6,000,030 cycles.
You can verify yourself with unrolled.1000000.s
How well did we do?
= 6,000,030 / 6,001,370 * 100 = 99.97% of peak performance.
Not bad!
Who knew counting could be so complicated! :-)
QED.