This answer ended up much longer than I planned, like a tutorial on writing effective asm. that is, how to make a difficult task difficult.
In case anyone is interested in reviewing the code of the attempted implementation and the version with a lot of asm tricks:
There are so many small ways it could be better, for example. keep 5 in bh and 3 in bl . You do not always need to use div bl . AMD64 has 20 single-byte registers. (al / ah, bl / bh, cl / ch, dl / dh (without REX) and sil, dil, ... r15b (REX required)).
Using a 16-bit counter is at least a loss of bytes (operand-size prefixes) and can slow down. Using mov reg,0 bad . If possible, put a conditional branch at the bottom of the loop.
mov rax, 1 is the rejection of command bytes compared to mov eax, 1 , and it is labeled yasm , which does not optimize this for you during build. ( nasm ). Configuring 64-bit registers and then using int 0x80 32-bit ABI compatibility is even dumber.
Storing a 16-bit counter in memory is stupid in the first place, but storing it to the address where you reserved only one byte poses problems.
Apart from the little things, FizzBuzz(3,5) is small enough to expand and completely exclude the div part. With assembly macros, you can easily create a fully expanded loop with LCM outputs (fizz, buzz) for each loop (i.e. 15 in this case); enough for the pattern to repeat, so you don't need the conventions.
You can avoid the div without expanding with down-counters to detect count%5==0 and count%3==0 . @anatolyg 16-bit DOS golf code FizzBuzz does this . This is a very common technique that does something every N times when something else happens. for example, performance counter events work this way.
Here is my attempt at efficient FizzBuzz (for AMD64 Linux), without using libraries. only write(2) and exit_group(2)
There is no compiler, so if you need good code, you have to write it yourself. You cannot just hope that the compiler does something good with i%3 in a loop (this is not the case for most compilers ).
This code has changed a lot when I wrote it. As usual, starting to implement one of the methods gives you better ideas when you see that your first idea requires more or slower instructions than you hoped for.
I deployed 3 (Fizz) to remove all counter%3 checks. I processed counter%5 checks with a countdown of 5 instead of division. This still takes quite a bit of logic, which will disappear with full sweep to the point where the pattern repeats (LCM (3,5)). Integer ASCII code number can be in a function or embedded in a deployed loop for very bloated code.
I store everything in registers (including the constants fizz\n and buzz\n ). There are no downloads and is only stored in the buffer. Many registers are set once outside the loop, rather than with mov -immediate before use. It requires good comments to keep track of what you put there.
I add characters to the buffer that we write(2) after each line of fizzbuzz\n . This is the longest loop that naturally occurs in program logic, and means that we only need syscall code in one place.
In a real program that can write a file or pipe, it would be better to use the C stdio strategy to use a much larger buffer in this case. (Many entries of ~ 100 bytes are much worse than writing less than 4096B.) However, I thought it was an interesting choice between traditional printf at each iteration or accumulating the entire line into one large buffer. I used a static buffer instead of reserving stack space because I am writing an entire program, not a function, which should avoid memory loss after returning. In addition, it allows me to use the 32-bit operand size to increment the pointer in order to preserve code bytes (REX prefixes).
It would be fairly easy to assemble a few blocks until you reach the point where the next group can go past the end of the buffer. those. compare the current position with buffer_end - BUZZMOD*FIZZMOD*9 . Optimizing system I / O calls is obviously a wide topic, and this version is enough to demonstrate the accumulation of a line in the buffer.
Here's how it works:
$ strace ./fizzbuzz > /dev/null execve("./fizzbuzz", ["./fizzbuzz"], [/* 69 vars */]) = 0 write(1, "1\n2\nfizz\n4\nbuzz\nfizz\n7\n8\nfizz\nbu"..., 58) = 58 write(1, "16\n17\nfizz\n19\nbuzz\nfizz\n22\n23\nfi"..., 63) = 63 write(1, "31\n32\nfizz\n34\nbuzz\nfizz\n37\n38\nfi"..., 63) = 63 write(1, "46\n47\nfizz\n49\nbuzz\nfizz\n52\n53\nfi"..., 63) = 63 write(1, "61\n62\nfizz\n64\nbuzz\nfizz\n67\n68\nfi"..., 63) = 63 write(1, "76\n77\nfizz\n79\nbuzz\nfizz\n82\n83\nfi"..., 63) = 63 write(1, "91\n92\nfizz\n94\nbuzz\nfizz\n97\n98\nfi"..., 40) = 40 exit_group(0) = ?
Validation :
./fizzbuzz | diff - <(perl -E'say((fizz)[$_%3].(buzz)[$_%5]or$_)for+1..100') # no output = no difference
Deploying through Buzz (5) and using a counter for Fizz is likely to be worse. My version has a 64-bit store fizz\n\0\0\0 , and then a branch to decide if buzz\n\0\0\0 should overlap to create fizzbuzz\n . Another way is to have a branch to decide whether to store fizz (no new line required, so it can be 32-bit storage). Then it will unconditionally save buzz\n\0\0\0 . However, if FIZZMOD is less than BUZZMOD, this means more frequent resetting the counter and more checks to see if we need to print a number, not a string, this iteration. Having every third line, known as fizz\n or fizzbuzz\n , means that simpler code for this is executed more often.
If overlapping repositories are a problem, this whole algorithm is screwed up, and this is just one of many. Alternatively, we could just fork out to store fizz\n and add 5. Then in the case of fizzbuzz\n we make two stores and add 9. This also separates dec / jcc from cmp / jcc at the bottom of main_loop , so they can hope that both macro fuses on pre-Haswell. IIRC, some processors have industry predictors that really dislike several branches that are very close to each other.
Further improvements left as an exercise for the reader:
Inline buzz_or_number and possibly turn it into a loop (iterates FIZZMOD-1)
In addition, he probably could have disabled less and other minor improvements. This is kind of version 1.1: working, tested, with some comments and observations added when writing this answer, but not really improving the code, due to the fact that I initially decided it was good enough to make sure that it worked .
/ li>Make it more flexible by writing a cleanup loop (or assembler macros) for the last lines of LASTCOUNT % FIZZMOD , rather than assuming it's 1 line. The cleanup code is a flaw to deploy.
I used a div at 10 to convert the counter to a string. A better implementation would use the multiplicative inverse, as compilers generate for small constant dividers (implemented in this case with LEA) .
Another strategy worth considering is simplification to increase the sequence of ASCII digits (stored in a register). This technique will be more difficult to extend to numbers with more numbers. Keeping them in print order (the most significant digit in the low byte) makes a transfer between numbers that work against us, and not for us. (for example, if they were in the natural order, you could add eax, 256-10 correct the low digit and increase the digit using hyphenation). It might be worth saving it that way, but BSWAP is for storage. Saving \n is built-in to the register, so it takes only one store, maybe not worth it. Detecting and processing a 1-digit number that becomes a 2-digit number is bad enough.
In 32-bit mode, we could use the AAA instruction to perform decimal portability after incrementing. However, despite the mnemonics, it works on BCD ( 0-9 ), and not on ASCII ( '0'-'9' ), and it seems that it does not allow you to easily transfer the transfer to the 3rd digit. No wonder AMD removed it for AMD64. It checks the AF flag to detect low 4 bit execution, but it is only useful for DAA , where you have two BCD digits packed in one byte, and when you add unknown values, not increase. In this case, you just check al >= 10 .
My first version of this almost worked for the first time (after fixing a couple of syntax errors to get it together, and a stupid crash that took a couple of minutes to debug IIRC): she printed fizz\nbuzz\n in case of fizzbuzz\n , and he changed the numbers . I keep forgetting that digital strings need to be stored with the most significant digit first, and not as bytes in binary binary bits.
Alternative ways of doing things
I decided not to use a non-redistributable version of the code with 1 digit and 2-digit ASCII conversion code , as it took a lot of instructions. In addition, the branch should predict very well.
;; Untested buzz_or_number: ... .no_buzz: ... div r10b DECIMAL_TO_ASCII_NEWLINE_2DIGIT equ 0x0a3030 ; add '0' to two unpacked decimal digits, and a newline DECIMAL_TO_ASCII_NEWLINE_1DIGIT equ 0x000a30 ;; hoist this out of the loop: mov r15d, DECIMAL_TO_ASCII_NEWLINE_2DIGIT - DECIMAL_TO_ASCII_NEWLINE_1DIGIT xor ecx,ecx cmp ah, 1 ; set CF if ah=0 (1 digit number), otherwise clear it. This allows sbb for a conditional add, instead of setcc cmovae ecx, r15d ; 0 or the difference from 1digit to 2digit lea eax, [rax+rcx + DECIMAL_TO_ASCII_NEWLINE_1DIGIT] ; rax+=0x0a3030 or 0x000a30, without clobbering flags mov [rdx], eax sbb edx, -3 ; add 2 (-(-3) - 1) or 3. ret
In 32-bit (and 16-bit) mode, there is a div command that takes an immediate operand and uses AL as a dividend, not AX . It is called AAM , and was removed for AMD64 along with other BCD / ASCII instructions. This is convenient for checking divisibility by 5 without register binding for the divisor or losing a command inside the loop. It is slightly faster than the div r/m8 , and sets the flags according to the remainder (in AL : its reversed outputs compared to the div ).
Anatolyg golfed FizzBuzz uses AAM in a loop with shr ax, 8 to generate one digit at a time in the reverse order, save and reduce the pointer.
This version is more complicated because it checks the number% 5 with AAM and then processes it at count% 10 instead of doing a separate separation to get the ASCII digits.
;; Untested buzz_or_number_div: mov eax, ebx aam 5 ; al = al%5 ah = al/5. (opposite locations from div), and sets flags according to the remainder. jz print_buzz ; tailcall ; fall through into print_counter ;print_counter: ; maybe use the result of div by 5 to get division by 10? ; shifting the low bit of the quotient into bit 4 of the remainder should be faster than dividing again. ;; after AAM: ah = 5bit quotient (qqqqQ), al = 3bit remainder(RRR) ;; starting point: ; AX = [ 000qqqqQ 00000RRR ] ;; desired = byte swapped as well: [ 0000QRRR 0000qqqq ] shl al, 5 ; AX = [ 000qqqqQ RRR00000 ] shr ax, 1 ; AX = [ 0000qqqq QRRR0000 ] ror ax, 8 ; AX = [ QRRR0000 0000qqqq ] ; simple byte-swap shr ah, 4 ; AX = [ 0000QRRR 0000qqqq ] add eax, ...; convert to ascii ... ret ; those instructions are all single-uop 1c latency on SnB-family, but pre-Haswell will insert extra merging uops. (And stall while doing so, on pre-SnB). ; and there another partial-reg stall when we read eax ; It might be possible to do this bit manipulation with fewer operations, or maybe different ones. (maybe copy ax to cx, so we can move from cl or ch to al or ah?) ; shr ah, 1 ; AX = [ 0000qqqq 00000RRR ] CF=Q ; then what? setc/shift/or? rcl is slow, too. ; rorx eax, eax, 32-4 ; AX = [ qqqq0000 0RRR0000 ] CF=Q ; nope, seems a dead end ; shl ah, 3 ; AX = [ qqqqQ000 00000RRR ] ; ror ax, 7 ; AX = [ 0000RRRq qqqQ0000 ] ; shr al, 4 ; AX = [ 0000RRRq 0000qqqQ ] ; oops, no, shifts the wrong way. ; shl ah, 3 ; AX = [ qqqqQ000 00000RRR ] ; or ah, al ; AX = [ qqqqQRRR 00000RRR ] ; xor al,al ; AX = [ qqqqQRRR 00000000 ] ; rol ax, 4 ; AX = [ QRRR0000 0000qqqq ] ; shr ah, 4 ; AX = [ QRRR0000 qqqq0000 ] ; only 3 shifts, but still partial-reg heavy. Interesting on Haswell ; ror ax, 9 ; AX = [ Q00000RR R000qqqq ] CF=Q