FizzBuzz Assembly - Segmentation Error

I try to write FizzBuzz in Assembly, and I see a segmentation error all the time. So far, I have decided that these are not my printing procedures (because I deleted their contents and the problem persists), and the error is hidden somewhere in the main function.

I got this output when starting the program:

fizzSegmentation fault 

I believe that this is not a problem with the separation and search for residues. But I could be wrong, I did not do the Assembly two years later ...

 SECTION .data global _start fizz: db "fizz", 4 buzz: db "buzz", 4 SECTION .bss counter: resb 1 SECTION .text _start: mov ax,0 mov [counter],ax main_loop: cmp ax,100 ;from 0 to 100 je exit ; mov bl,3 ;divisor mov ah,0 ;here will be a remainder div bl ;divide cmp ah,0 ;compare the remainder with 0 je print_fizz ;print fizz if they equal mov bl,5 ;new divisor mov ah,0 ;do I have to do it every time? div bl ;divide cmp ah,0 ;compare the remainder with 0 je print_buzz ;print buzz if they equal jmp print_ax ;print contents of ax if not inc ax ;increment ax jmp main_loop ;jump to label print_ax: ret print_fizz: ret print_buzz: ret exit: mov rax,1 mov rbx,0 int 80h ret 

I compile using:

 yasm -f elf64 -o fizzbuzz.o fizzbuzz.asm ld -d -o fizzbuzz fizzbuzz.o 
-3
source share
3 answers

This causes a segmentation error:

 ... je print_fizz ;print fizz if they equal ... je print_buzz ;print buzz if they equal jmp print_ax ;print contents of ax if not ... print_ax: ret print_fizz: ret print_buzz: ret ... 

As you switch to functions, ret does not get a return address and will return anywhere. Change it to call/ret -pair:

 ... ; je print_fizz ;print fizz if they equal jne .1 ;skip if not equal call print_fizz .1: ... ; je print_buzz ;print buzz if they equal jne .2 ;skip if not equal call print_buzz .2: ; jmp print_ax ;print contents of ax if not call print_ax ... 

This will cause an infinite loop:

 mov ax,0 mov [counter],ax main_loop: cmp ax,100 ;from 0 to 100 je exit ... mov ah,0 ;here will be a remainder div bl ;divide ... mov ah,0 ;do I have to do it every time? div bl ;divide ... inc ax ;increment ax jmp main_loop ;jump to label 

AX changes its values ​​and is not suitable for storing a loop counter. I suggest:

 ... main_loop: ; cmp ax,100 ;from 0 to 100 cmp byte [counter], 100 ... ; inc ax ;increment ax inc byte [counter] jmp main_loop ;jump to label ... 
+2
source

This answer ended up much longer than I planned, like a tutorial on writing effective asm. that is, how to make a difficult task difficult.


In case anyone is interested in reviewing the code of the attempted implementation and the version with a lot of asm tricks:

There are so many small ways it could be better, for example. keep 5 in bh and 3 in bl . You do not always need to use div bl . AMD64 has 20 single-byte registers. (al / ah, bl / bh, cl / ch, dl / dh (without REX) and sil, dil, ... r15b (REX required)).

Using a 16-bit counter is at least a loss of bytes (operand-size prefixes) and can slow down. Using mov reg,0 bad . If possible, put a conditional branch at the bottom of the loop.

mov rax, 1 is the rejection of command bytes compared to mov eax, 1 , and it is labeled , which does not optimize this for you during build. ( nasm ). Configuring 64-bit registers and then using int 0x80 32-bit ABI compatibility is even dumber.

Storing a 16-bit counter in memory is stupid in the first place, but storing it to the address where you reserved only one byte poses problems.


Apart from the little things, FizzBuzz(3,5) is small enough to expand and completely exclude the div part. With assembly macros, you can easily create a fully expanded loop with LCM outputs (fizz, buzz) for each loop (i.e. 15 in this case); enough for the pattern to repeat, so you don't need the conventions.

You can avoid the div without expanding with down-counters to detect count%5==0 and count%3==0 . @anatolyg 16-bit DOS golf code FizzBuzz does this . This is a very common technique that does something every N times when something else happens. for example, performance counter events work this way.


Here is my attempt at efficient FizzBuzz (for AMD64 Linux), without using libraries. only write(2) and exit_group(2)

There is no compiler, so if you need good code, you have to write it yourself. You cannot just hope that the compiler does something good with i%3 in a loop (this is not the case for most compilers ).

This code has changed a lot when I wrote it. As usual, starting to implement one of the methods gives you better ideas when you see that your first idea requires more or slower instructions than you hoped for.

I deployed 3 (Fizz) to remove all counter%3 checks. I processed counter%5 checks with a countdown of 5 instead of division. This still takes quite a bit of logic, which will disappear with full sweep to the point where the pattern repeats (LCM (3,5)). Integer ASCII code number can be in a function or embedded in a deployed loop for very bloated code.

I store everything in registers (including the constants fizz\n and buzz\n ). There are no downloads and is only stored in the buffer. Many registers are set once outside the loop, rather than with mov -immediate before use. It requires good comments to keep track of what you put there.

I add characters to the buffer that we write(2) after each line of fizzbuzz\n . This is the longest loop that naturally occurs in program logic, and means that we only need syscall code in one place.

In a real program that can write a file or pipe, it would be better to use the C stdio strategy to use a much larger buffer in this case. (Many entries of ~ 100 bytes are much worse than writing less than 4096B.) However, I thought it was an interesting choice between traditional printf at each iteration or accumulating the entire line into one large buffer. I used a static buffer instead of reserving stack space because I am writing an entire program, not a function, which should avoid memory loss after returning. In addition, it allows me to use the 32-bit operand size to increment the pointer in order to preserve code bytes (REX prefixes).

It would be fairly easy to assemble a few blocks until you reach the point where the next group can go past the end of the buffer. those. compare the current position with buffer_end - BUZZMOD*FIZZMOD*9 . Optimizing system I / O calls is obviously a wide topic, and this version is enough to demonstrate the accumulation of a line in the buffer.

 ; for (count=1..100): ; if(count%3 == 0) { print_fizz(); } ; if(count%5 == 0) { print_buzz(); } else { ; if(count%3 && count%5) print(count); ;; } ; print(newline) ; We don't need pointers to these strings at all; The strings are immediate data for a couple mov instructions ;SECTION .rodata ; put constants in .rodata. ; fizz: db "fizz" ; No idea what the trailing 4 was for ; buzz: db "buzz" FIZZMOD equ 3 ; only 3 works, but it would be easy to use a loop BUZZMOD equ 5 ; any value works LASTCOUNT equ 100 ; max 100: we only handle two decimal digits. ; TODO: cleanup that can handle LASTCOUNT%FIZZMOD != 1 and LASTCOUNT%BUZZMOD != 0 SECTION .bss ;;; generate a string in this buffer. (flush it with write(2) on "fizzbuzz" lines) ; buf: resb 4096 buf: resb FIZZMOD * BUZZMOD * 9 ; (worst case: every line is "fizzbuzz\n") SECTION .text global _start _start: ; args for write(2). (syscall clobbers rcx/r11, and rax with the return value) mov edi, 1 ; STDOUT_FILENO. also happens to be __NR_write in the AMD64 Linux ABI mov esi, buf ; static data lives in the low 2G of address space, so we don't need a 64bit mov ;; edx = count. ; calculated each iteration ;; mov eax, edi ; also needed every time. saves 3B vs mov eax, imm32 ; 'fizz' is only used once, so we could just store with an immediate there. That wouldn't micro-fuse, and we'd have to do the newline separately mov r10b, 10 ; base 10 ;;mov r14d, BUZZMOD ; not needed, we don't div for this mov r12, 'fizz' | 10<<32 ; `fizz\n`, but YASM doesn't support NASM backquotes for \-escapes mov r13, 'buzz' | 10<<32 ; `buzz\n`. When buzz appears, it always the end of a line ;;;;;;;; Set up for first iteration mov ebp, BUZZMOD ; detect count%BUZZMOD == 0 with a down-counter instead of dividing mov ebx, 1 ; counter starts at 1 mov edx, esi ; current output position = front of buf ALIGN 16 main_loop: ;; TODO: loop FIZZMOD-1 times inside buzz_or_number, or here ;; It doesn't make much sense to unroll this loop but not inline buzz_or_number :/ call buzz_or_number inc ebx call buzz_or_number add ebx, 2 ; counter is never printed on Fizz iterations, so just set up for next main_loop ;; Fizz, and maybe also Buzz mov qword [rdx], r12 ; Fizz with a newline add edx, 5 ; TODO: move this after the branch; adjust the offsets in .fizzbuzz dec ebp jz .fizzbuzz ;;.done_buzz: ; .fizzbuzz duplicates the main_loop branch instead of jumping back here cmp ebx, LASTCOUNT-FIZZMOD jbe main_loop ;;;;;;;;;; END OF main_loop .cleanup: ;;;;;;;;;;;;;;;;;;;;; Cleanup after the loop ; hard-code the fact that 100 % FIZZMOD = 1 more line to print, ; and that 100 % BUZZMOD = 0, so the line is "buzz\n" mov eax, edi ; __NR_write mov [rdx], r13 ; the final "buzz\n". sub edx, buf - 5 ; write_count = current_pos+5 - buf. syscall ; write(1, buf, p - buf). ;; if buf isn't static, then use add edx, 5 / sub edx, esi xor edi, edi mov eax, 231 ; exit_group(0). same as eax=60: exit() for a single-threaded program syscall ;;;;; The fizzbuzz case from the loop .fizzbuzz: ;; count%BUZZMOD == 0: rdx points after the \n at the end of fizz\n, which we need to overwrite ;; this is a macro so we can use it in buzz_or_number, too, where we don't need to back up and overwrite a \n %macro BUZZ_HIT 1 mov [rdx - %1], r13 ; buzz\n. Next line will overwrite the last 3 bytes of the 64b store. add edx, 5 - %1 mov ebp, BUZZMOD ; reset the count%BUZZMOD down-counter %endmacro BUZZ_HIT 1 ; arg=1 to back up and overwrite the \n from "fizz\n" sub edx, esi ; write_count = current_pos - buf mov eax, edi ; __NR_write syscall ; write(1, buf, p - buf). clobbers only rax (return value), and rcx,r11 mov edx, esi ; restart at the front of the buffer ;;; tail-duplication of the main loop, instead of jmp back to the cmp/jbe ;;; could just be a jmp main_loop, if we check at assemble time that LASTCOUNT % FIZZMOD != 0 || LASTCOUNT % BUZZMOD != 0 cmp ebx, LASTCOUNT-FIZZMOD jbe main_loop jmp .cleanup ;;;;;;;;;;;;;;;;;;;;;;; buzz_or_number: called for non-fizz cases ; special calling convention: uses (without clobbering) the same regs as the loop ;; modifies: BUZZMOD down-counter, output position pointer ;; clobbers: rax, rcx ALIGN 32 buzz_or_number: dec ebp jnz .no_buzz ; could make this part of the macro, but flow-control inside macros is probably worse than duplication ;; count%BUZZMOD == 0: append "buzz\n" to the buffer and reset the down-counter BUZZ_HIT 0 ; back up 0 bytes before appending ret .no_buzz: ;; get count as a 1 or 2-digit ASCII number ;; assert(ebx < 10); We don't handle 3-digit numbers mov eax, ebx div r10b ; al = count/10 (first (high) decimal digit), ah = count%10 (second (low) decimal digit). ;; x86 is little-endian, so this is in printing-order already for storing eax ;movzx eax, ax ; avoid partial-reg stalls on pre-Haswell ;; convert integer digits to ASCII by adding '0' to al and ah at the same time, and set the 3rd byte to `\n`. cmp ebx, 9 ; compare against the original counter instead of the div result, for more ILP and earlier detection of branch misprediction jbe .1digit ; most numbers from 1..100 are 2-digit, so make this the not-taken case add eax, 0x0a3030 ;; `00\n`: converts 2 integer digits -> ASCII ;; eax now holds the number + newline as a 3-byte ASCII string mov [rdx], eax add edx, 3 ret .1digit: ;; Could use a 16bit operand-size here to avoid partial-reg stalls, but an imm16 would LCP-stall on Intel. shr eax, 8 ; Shift out the leading 0 add eax, 0x000a30 ;; 1-digit numbers ;; eax now holds the number + newline as a 2-byte ASCII string mov [rdx], ax add edx, 2 ret 

Here's how it works:

 $ strace ./fizzbuzz > /dev/null execve("./fizzbuzz", ["./fizzbuzz"], [/* 69 vars */]) = 0 write(1, "1\n2\nfizz\n4\nbuzz\nfizz\n7\n8\nfizz\nbu"..., 58) = 58 write(1, "16\n17\nfizz\n19\nbuzz\nfizz\n22\n23\nfi"..., 63) = 63 write(1, "31\n32\nfizz\n34\nbuzz\nfizz\n37\n38\nfi"..., 63) = 63 write(1, "46\n47\nfizz\n49\nbuzz\nfizz\n52\n53\nfi"..., 63) = 63 write(1, "61\n62\nfizz\n64\nbuzz\nfizz\n67\n68\nfi"..., 63) = 63 write(1, "76\n77\nfizz\n79\nbuzz\nfizz\n82\n83\nfi"..., 63) = 63 write(1, "91\n92\nfizz\n94\nbuzz\nfizz\n97\n98\nfi"..., 40) = 40 exit_group(0) = ? 

Validation :

 ./fizzbuzz | diff - <(perl -E'say((fizz)[$_%3].(buzz)[$_%5]or$_)for+1..100') # no output = no difference 

Deploying through Buzz (5) and using a counter for Fizz is likely to be worse. My version has a 64-bit store fizz\n\0\0\0 , and then a branch to decide if buzz\n\0\0\0 should overlap to create fizzbuzz\n . Another way is to have a branch to decide whether to store fizz (no new line required, so it can be 32-bit storage). Then it will unconditionally save buzz\n\0\0\0 . However, if FIZZMOD is less than BUZZMOD, this means more frequent resetting the counter and more checks to see if we need to print a number, not a string, this iteration. Having every third line, known as fizz\n or fizzbuzz\n , means that simpler code for this is executed more often.

If overlapping repositories are a problem, this whole algorithm is screwed up, and this is just one of many. Alternatively, we could just fork out to store fizz\n and add 5. Then in the case of fizzbuzz\n we make two stores and add 9. This also separates dec / jcc from cmp / jcc at the bottom of main_loop , so they can hope that both macro fuses on pre-Haswell. IIRC, some processors have industry predictors that really dislike several branches that are very close to each other.


Further improvements left as an exercise for the reader:

  • Inline buzz_or_number and possibly turn it into a loop (iterates FIZZMOD-1)

  • In addition, he probably could have disabled less and other minor improvements. This is kind of version 1.1: working, tested, with some comments and observations added when writing this answer, but not really improving the code, due to the fact that I initially decided it was good enough to make sure that it worked .

    / li>
  • Make it more flexible by writing a cleanup loop (or assembler macros) for the last lines of LASTCOUNT % FIZZMOD , rather than assuming it's 1 line. The cleanup code is a flaw to deploy.

  • I used a div at 10 to convert the counter to a string. A better implementation would use the multiplicative inverse, as compilers generate for small constant dividers (implemented in this case with LEA) .

    Another strategy worth considering is simplification to increase the sequence of ASCII digits (stored in a register). This technique will be more difficult to extend to numbers with more numbers. Keeping them in print order (the most significant digit in the low byte) makes a transfer between numbers that work against us, and not for us. (for example, if they were in the natural order, you could add eax, 256-10 correct the low digit and increase the digit using hyphenation). It might be worth saving it that way, but BSWAP is for storage. Saving \n is built-in to the register, so it takes only one store, maybe not worth it. Detecting and processing a 1-digit number that becomes a 2-digit number is bad enough.

    In 32-bit mode, we could use the AAA instruction to perform decimal portability after incrementing. However, despite the mnemonics, it works on BCD ( 0-9 ), and not on ASCII ( '0'-'9' ), and it seems that it does not allow you to easily transfer the transfer to the 3rd digit. No wonder AMD removed it for AMD64. It checks the AF flag to detect low 4 bit execution, but it is only useful for DAA , where you have two BCD digits packed in one byte, and when you add unknown values, not increase. In this case, you just check al >= 10 .


My first version of this almost worked for the first time (after fixing a couple of syntax errors to get it together, and a stupid crash that took a couple of minutes to debug IIRC): she printed fizz\nbuzz\n in case of fizzbuzz\n , and he changed the numbers . I keep forgetting that digital strings need to be stored with the most significant digit first, and not as bytes in binary binary bits.


Alternative ways of doing things

I decided not to use a non-redistributable version of the code with 1 digit and 2-digit ASCII conversion code , as it took a lot of instructions. In addition, the branch should predict very well.

  ;; Untested buzz_or_number: ... .no_buzz: ... div r10b DECIMAL_TO_ASCII_NEWLINE_2DIGIT equ 0x0a3030 ; add '0' to two unpacked decimal digits, and a newline DECIMAL_TO_ASCII_NEWLINE_1DIGIT equ 0x000a30 ;; hoist this out of the loop: mov r15d, DECIMAL_TO_ASCII_NEWLINE_2DIGIT - DECIMAL_TO_ASCII_NEWLINE_1DIGIT xor ecx,ecx cmp ah, 1 ; set CF if ah=0 (1 digit number), otherwise clear it. This allows sbb for a conditional add, instead of setcc cmovae ecx, r15d ; 0 or the difference from 1digit to 2digit lea eax, [rax+rcx + DECIMAL_TO_ASCII_NEWLINE_1DIGIT] ; rax+=0x0a3030 or 0x000a30, without clobbering flags mov [rdx], eax sbb edx, -3 ; add 2 (-(-3) - 1) or 3. ret 

In 32-bit (and 16-bit) mode, there is a div command that takes an immediate operand and uses AL as a dividend, not AX . It is called AAM , and was removed for AMD64 along with other BCD / ASCII instructions. This is convenient for checking divisibility by 5 without register binding for the divisor or losing a command inside the loop. It is slightly faster than the div r/m8 , and sets the flags according to the remainder (in AL : its reversed outputs compared to the div ).

Anatolyg golfed FizzBuzz uses AAM in a loop with shr ax, 8 to generate one digit at a time in the reverse order, save and reduce the pointer.

This version is more complicated because it checks the number% 5 with AAM and then processes it at count% 10 instead of doing a separate separation to get the ASCII digits.

  ;; Untested buzz_or_number_div: mov eax, ebx aam 5 ; al = al%5 ah = al/5. (opposite locations from div), and sets flags according to the remainder. jz print_buzz ; tailcall ; fall through into print_counter ;print_counter: ; maybe use the result of div by 5 to get division by 10? ; shifting the low bit of the quotient into bit 4 of the remainder should be faster than dividing again. ;; after AAM: ah = 5bit quotient (qqqqQ), al = 3bit remainder(RRR) ;; starting point: ; AX = [ 000qqqqQ 00000RRR ] ;; desired = byte swapped as well: [ 0000QRRR 0000qqqq ] shl al, 5 ; AX = [ 000qqqqQ RRR00000 ] shr ax, 1 ; AX = [ 0000qqqq QRRR0000 ] ror ax, 8 ; AX = [ QRRR0000 0000qqqq ] ; simple byte-swap shr ah, 4 ; AX = [ 0000QRRR 0000qqqq ] add eax, ...; convert to ascii ... ret ; those instructions are all single-uop 1c latency on SnB-family, but pre-Haswell will insert extra merging uops. (And stall while doing so, on pre-SnB). ; and there another partial-reg stall when we read eax ; It might be possible to do this bit manipulation with fewer operations, or maybe different ones. (maybe copy ax to cx, so we can move from cl or ch to al or ah?) ; shr ah, 1 ; AX = [ 0000qqqq 00000RRR ] CF=Q ; then what? setc/shift/or? rcl is slow, too. ; rorx eax, eax, 32-4 ; AX = [ qqqq0000 0RRR0000 ] CF=Q ; nope, seems a dead end ; shl ah, 3 ; AX = [ qqqqQ000 00000RRR ] ; ror ax, 7 ; AX = [ 0000RRRq qqqQ0000 ] ; shr al, 4 ; AX = [ 0000RRRq 0000qqqQ ] ; oops, no, shifts the wrong way. ; shl ah, 3 ; AX = [ qqqqQ000 00000RRR ] ; or ah, al ; AX = [ qqqqQRRR 00000RRR ] ; xor al,al ; AX = [ qqqqQRRR 00000000 ] ; rol ax, 4 ; AX = [ QRRR0000 0000qqqq ] ; shr ah, 4 ; AX = [ QRRR0000 qqqq0000 ] ; only 3 shifts, but still partial-reg heavy. Interesting on Haswell ; ror ax, 9 ; AX = [ Q00000RR R000qqqq ] CF=Q 
+2
source

Use the debugger to take one step of your code and see where it goes wrong.

At first glance, it’s already obvious that you are destroying ax (maybe you don’t know that ax consists of ah and al ?). You also go to functions instead of calling them, which is probably the cause of the errors.

0
source

Source: https://habr.com/ru/post/1436200/


All Articles