Naive Fibonacci in C vs Haskell

Please, how to make the g (fib) score completely strict? ( I know that this exponential solution is not optimal. I would like to know how to make this recursion completely strict / if possible /)

Haskell

 g :: Int -> Int g 0 = 0 g 1 = 1 gx = g(x-1) + g(x-2) main = print $ g 42 

So, it works about as fast as the naive C solution:

WITH

 #include <stdio.h> long f(int x) { if (x == 0) return 0; if (x == 1) return 1; return f(x-1) + f(x-2); } int main(void) { printf("%ld\n", f(42)); return 0; } 

Note : this fib recursion is used only as a simple example. I fully know that there are dozens of the best algorithms. But there are definitely recursive algorithms that DO NOT have such simple and more efficient alternatives.

+4
source share
4 answers

Answer: GHC makes the assessment completely rigorous in itself (when you give it a chance, compiling with optimization). Source code creates a kernel

 Rec { Main.$wg [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L] Main.$wg = \ (ww_s1JE :: GHC.Prim.Int#) -> case ww_s1JE of ds_XsI { __DEFAULT -> case Main.$wg (GHC.Prim.-# ds_XsI 1) of ww1_s1JI { __DEFAULT -> case Main.$wg (GHC.Prim.-# ds_XsI 2) of ww2_X1K4 { __DEFAULT -> GHC.Prim.+# ww1_s1JI ww2_X1K4 } }; 0 -> 0; 1 -> 1 } end Rec } 

which, as you can see, do you know the core of GHC, is absolutely strict and uses unpacked source machine integers.

(Unfortunately, gcc native code generated from source C is simply faster to execute.)

The GHC strictness analyzer is pretty good, and in simple cases like here, where there is no polymorphism and the function is not too complicated, you can count on finding that it can unpack all the values ​​to create a working one using unboxed Int# s .

However, in such cases, there is faster code than just running on machine types. The assembly created by the native code generator, as well as using the LLVM backend, is basically a direct translation of the code to the assembly, checking whether the argument is 0 or 1, and if you do not call the function twice and add the results. Both generate input and output code, which I do not understand, and on my mailbox a code generator generates faster code.

For C code, clang -O3 creates a simple collection with fewer cracks and more registers,

 .Ltmp8: .cfi_offset %r14, -24 movl %edi, %ebx xorl %eax, %eax testl %ebx, %ebx je .LBB0_4 # BB#1: cmpl $1, %ebx jne .LBB0_3 # BB#2: movl $1, %eax jmp .LBB0_4 .LBB0_3: leal -1(%rbx), %edi callq recfib movq %rax, %r14 addl $-2, %ebx movl %ebx, %edi callq recfib addq %r14, %rax .LBB0_4: popq %rbx popq %r14 popq %rbp ret 

(which, for some reason, works much better on my system today than yesterday). The big difference in performance between the code received from the Haskell source and C comes from the use of registers in the latter case, when indirect addressing is used in the former, the core of the algorithm is the same in both.

gcc, without any optimizations, does essentially the same thing using some indirect addressing, but less than the GHC created using NCG or LLVM. With -O1 , the same, but with even less indirect addressing. With -O2 you get a transformation so that the assembly doesn't easily display back to the source code, and with -O3 gcc produces pretty amazing

 .LFB0: .cfi_startproc pushq %r15 .cfi_def_cfa_offset 16 .cfi_offset 15, -16 pushq %r14 .cfi_def_cfa_offset 24 .cfi_offset 14, -24 pushq %r13 .cfi_def_cfa_offset 32 .cfi_offset 13, -32 pushq %r12 .cfi_def_cfa_offset 40 .cfi_offset 12, -40 pushq %rbp .cfi_def_cfa_offset 48 .cfi_offset 6, -48 pushq %rbx .cfi_def_cfa_offset 56 .cfi_offset 3, -56 subq $120, %rsp .cfi_def_cfa_offset 176 testl %edi, %edi movl %edi, 64(%rsp) movq $0, 16(%rsp) je .L2 cmpl $1, %edi movq $1, 16(%rsp) je .L2 movl %edi, %eax movq $0, 16(%rsp) subl $1, %eax movl %eax, 108(%rsp) .L3: movl 108(%rsp), %eax movq $0, 32(%rsp) testl %eax, %eax movl %eax, 72(%rsp) je .L4 cmpl $1, %eax movq $1, 32(%rsp) je .L4 movl 64(%rsp), %eax movq $0, 32(%rsp) subl $2, %eax movl %eax, 104(%rsp) .L5: movl 104(%rsp), %eax movq $0, 24(%rsp) testl %eax, %eax movl %eax, 76(%rsp) je .L6 cmpl $1, %eax movq $1, 24(%rsp) je .L6 movl 72(%rsp), %eax movq $0, 24(%rsp) subl $2, %eax movl %eax, 92(%rsp) .L7: movl 92(%rsp), %eax movq $0, 40(%rsp) testl %eax, %eax movl %eax, 84(%rsp) je .L8 cmpl $1, %eax movq $1, 40(%rsp) je .L8 movl 76(%rsp), %eax movq $0, 40(%rsp) subl $2, %eax movl %eax, 68(%rsp) .L9: movl 68(%rsp), %eax movq $0, 48(%rsp) testl %eax, %eax movl %eax, 88(%rsp) je .L10 cmpl $1, %eax movq $1, 48(%rsp) je .L10 movl 84(%rsp), %eax movq $0, 48(%rsp) subl $2, %eax movl %eax, 100(%rsp) .L11: movl 100(%rsp), %eax movq $0, 56(%rsp) testl %eax, %eax movl %eax, 96(%rsp) je .L12 cmpl $1, %eax movq $1, 56(%rsp) je .L12 movl 88(%rsp), %eax movq $0, 56(%rsp) subl $2, %eax movl %eax, 80(%rsp) .L13: movl 80(%rsp), %eax movq $0, 8(%rsp) testl %eax, %eax movl %eax, 4(%rsp) je .L14 cmpl $1, %eax movq $1, 8(%rsp) je .L14 movl 96(%rsp), %r15d movq $0, 8(%rsp) subl $2, %r15d .L15: xorl %r14d, %r14d testl %r15d, %r15d movl %r15d, %r13d je .L16 cmpl $1, %r15d movb $1, %r14b je .L16 movl 4(%rsp), %r12d xorb %r14b, %r14b subl $2, %r12d .p2align 4,,10 .p2align 3 .L17: xorl %ebp, %ebp testl %r12d, %r12d movl %r12d, %ebx je .L18 cmpl $1, %r12d movb $1, %bpl je .L18 xorb %bpl, %bpl jmp .L20 .p2align 4,,10 .p2align 3 .L21: cmpl $1, %ebx je .L58 .L20: leal -1(%rbx), %edi call recfib addq %rax, %rbp subl $2, %ebx jne .L21 .L18: addq %rbp, %r14 subl $2, %r13d je .L16 subl $2, %r12d cmpl $1, %r13d jne .L17 addq $1, %r14 .L16: addq %r14, 8(%rsp) subl $2, 4(%rsp) je .L14 subl $2, %r15d cmpl $1, 4(%rsp) jne .L15 addq $1, 8(%rsp) .L14: movq 8(%rsp), %rax addq %rax, 56(%rsp) subl $2, 96(%rsp) je .L12 subl $2, 80(%rsp) cmpl $1, 96(%rsp) jne .L13 addq $1, 56(%rsp) .L12: movq 56(%rsp), %rax addq %rax, 48(%rsp) subl $2, 88(%rsp) je .L10 subl $2, 100(%rsp) cmpl $1, 88(%rsp) jne .L11 addq $1, 48(%rsp) .L10: movq 48(%rsp), %rax addq %rax, 40(%rsp) subl $2, 84(%rsp) je .L8 subl $2, 68(%rsp) cmpl $1, 84(%rsp) jne .L9 addq $1, 40(%rsp) .L8: movq 40(%rsp), %rax addq %rax, 24(%rsp) subl $2, 76(%rsp) je .L6 subl $2, 92(%rsp) cmpl $1, 76(%rsp) jne .L7 addq $1, 24(%rsp) .L6: movq 24(%rsp), %rax addq %rax, 32(%rsp) subl $2, 72(%rsp) je .L4 subl $2, 104(%rsp) cmpl $1, 72(%rsp) jne .L5 addq $1, 32(%rsp) .L4: movq 32(%rsp), %rax addq %rax, 16(%rsp) subl $2, 64(%rsp) je .L2 subl $2, 108(%rsp) cmpl $1, 64(%rsp) jne .L3 addq $1, 16(%rsp) .L2: movq 16(%rsp), %rax addq $120, %rsp .cfi_remember_state .cfi_def_cfa_offset 56 popq %rbx .cfi_def_cfa_offset 48 popq %rbp .cfi_def_cfa_offset 40 popq %r12 .cfi_def_cfa_offset 32 popq %r13 .cfi_def_cfa_offset 24 popq %r14 .cfi_def_cfa_offset 16 popq %r15 .cfi_def_cfa_offset 8 ret .p2align 4,,10 .p2align 3 .L58: .cfi_restore_state addq $1, %rbp jmp .L18 .cfi_endproc 

which is much faster than anything else. gcc deployed the algorithm to a remarkable depth that neither GHC nor LLVM did, and this is of great importance here.

+16
source

Start by using the best algorithm!

 fibs = 0 : 1 : zipWith (+) fibs (tail fibs) fib n = fibs !! n-1 

fib 42 will give you an answer much faster.

Using a better algorithm is much more important than making small speed changes.

You can joyfully and quickly calculate fib 123456 in ghci (i.e. interpret, don’t even compile) with this definition (25801 digits long). You can get your C code to figure it out faster, but you will take quite a while to write it. It almost did not concern me. I spent much more time recording this post!

Morality:

  • Use the correct algorithm!
  • Haskell allows you to write clean versions of code just by writing answers.
  • Sometimes it’s easier to define an endless list of answers and grab the one you want than write some cyclic version that updates the values.
  • Haskell is awesome.
+4
source

This is absolutely strict.

 g :: Int -> Int g 0 = 0 g 1 = 1 gx = a `seq` b `seq` a + b where a = g $! x-1 b = g $! x-2 main = print $! g 42 

$! matches $ (an application with a low priority function), except that it strictly matches the function argument.

You will also want to compile with -O2 , although I wonder why you don't want to use a better algorithm.

+3
source

The function is already completely strict.

The usual definition of a strict function is that if you give it an undefined input, it will be undefined. I assume from the context that you are thinking of another notion of rigor, namely a function, rigorous if it evaluates its arguments before getting the result. But, as a rule, the only way to check if the value is undefined is to evaluate it, so they are often equivalent.

According to the first definition, g is certainly strict, since it must check whether the argument is zero before it knows which branch of the definition to use, so if the argument is undefined, g will suffocate itself when it tries to read it.

According to a more informal definition, well, what can g do wrong? The first two sentences are obviously accurate and mean that by the time we move on to the third sentence, we should already evaluate n . Now in the third section we have added two function calls. Moreover, we have the following tasks:

  • subtract 1 from n
  • subtract 2 from n
  • call g with result 1.
  • call g with result 2.
  • add the results 3. and 4. together.

Laziness can mess up a bit with orders of these operations, but since both + and g need values ​​of their arguments before they can run their code, in fact, nothing can be delayed by any significant amount and, of course, the compiler can freely perform these operations in strict order if it can only show that + is strict (it is built-in, so it should not be too complicated), and g is strict (but obviously this). Thus, any reasonable optimizing compiler will not have too many problems with this, and, in addition, any non-optimizing compiler will not incur any significant overhead (this, of course, does not look like a situation foldl (+) 0 [1 .. 1000000] ), doing a completely naive thing.

The lesson is that when a function immediately compares its argument with something, this function is already strict, and any worthy compiler will be able to use this fact to eliminate the usual laziness overhead. This does not mean that it will be able to eliminate other overheads, such as the time taken to start the runtime system, which tend to make Haskell programs a little slower than C programs. If you just look at performance numbers, there’s a lot more happens than your strict or lazy program.

+1
source

Source: https://habr.com/ru/post/1444006/


All Articles