Answer: GHC makes the assessment completely rigorous in itself (when you give it a chance, compiling with optimization). Source code creates a kernel
Rec { Main.$wg [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L] Main.$wg = \ (ww_s1JE :: GHC.Prim.Int#) -> case ww_s1JE of ds_XsI { __DEFAULT -> case Main.$wg (GHC.Prim.-# ds_XsI 1) of ww1_s1JI { __DEFAULT -> case Main.$wg (GHC.Prim.-# ds_XsI 2) of ww2_X1K4 { __DEFAULT -> GHC.Prim.+# ww1_s1JI ww2_X1K4 } }; 0 -> 0; 1 -> 1 } end Rec }
which, as you can see, do you know the core of GHC, is absolutely strict and uses unpacked source machine integers.
(Unfortunately, gcc native code generated from source C is simply faster to execute.)
The GHC strictness analyzer is pretty good, and in simple cases like here, where there is no polymorphism and the function is not too complicated, you can count on finding that it can unpack all the values to create a working one using unboxed Int# s .
However, in such cases, there is faster code than just running on machine types. The assembly created by the native code generator, as well as using the LLVM backend, is basically a direct translation of the code to the assembly, checking whether the argument is 0 or 1, and if you do not call the function twice and add the results. Both generate input and output code, which I do not understand, and on my mailbox a code generator generates faster code.
For C code, clang -O3 creates a simple collection with fewer cracks and more registers,
.Ltmp8: .cfi_offset %r14, -24 movl %edi, %ebx xorl %eax, %eax testl %ebx, %ebx je .LBB0_4 # BB#1: cmpl $1, %ebx jne .LBB0_3 # BB#2: movl $1, %eax jmp .LBB0_4 .LBB0_3: leal -1(%rbx), %edi callq recfib movq %rax, %r14 addl $-2, %ebx movl %ebx, %edi callq recfib addq %r14, %rax .LBB0_4: popq %rbx popq %r14 popq %rbp ret
(which, for some reason, works much better on my system today than yesterday). The big difference in performance between the code received from the Haskell source and C comes from the use of registers in the latter case, when indirect addressing is used in the former, the core of the algorithm is the same in both.
gcc, without any optimizations, does essentially the same thing using some indirect addressing, but less than the GHC created using NCG or LLVM. With -O1 , the same, but with even less indirect addressing. With -O2 you get a transformation so that the assembly doesn't easily display back to the source code, and with -O3 gcc produces pretty amazing
.LFB0: .cfi_startproc pushq %r15 .cfi_def_cfa_offset 16 .cfi_offset 15, -16 pushq %r14 .cfi_def_cfa_offset 24 .cfi_offset 14, -24 pushq %r13 .cfi_def_cfa_offset 32 .cfi_offset 13, -32 pushq %r12 .cfi_def_cfa_offset 40 .cfi_offset 12, -40 pushq %rbp .cfi_def_cfa_offset 48 .cfi_offset 6, -48 pushq %rbx .cfi_def_cfa_offset 56 .cfi_offset 3, -56 subq $120, %rsp .cfi_def_cfa_offset 176 testl %edi, %edi movl %edi, 64(%rsp) movq $0, 16(%rsp) je .L2 cmpl $1, %edi movq $1, 16(%rsp) je .L2 movl %edi, %eax movq $0, 16(%rsp) subl $1, %eax movl %eax, 108(%rsp) .L3: movl 108(%rsp), %eax movq $0, 32(%rsp) testl %eax, %eax movl %eax, 72(%rsp) je .L4 cmpl $1, %eax movq $1, 32(%rsp) je .L4 movl 64(%rsp), %eax movq $0, 32(%rsp) subl $2, %eax movl %eax, 104(%rsp) .L5: movl 104(%rsp), %eax movq $0, 24(%rsp) testl %eax, %eax movl %eax, 76(%rsp) je .L6 cmpl $1, %eax movq $1, 24(%rsp) je .L6 movl 72(%rsp), %eax movq $0, 24(%rsp) subl $2, %eax movl %eax, 92(%rsp) .L7: movl 92(%rsp), %eax movq $0, 40(%rsp) testl %eax, %eax movl %eax, 84(%rsp) je .L8 cmpl $1, %eax movq $1, 40(%rsp) je .L8 movl 76(%rsp), %eax movq $0, 40(%rsp) subl $2, %eax movl %eax, 68(%rsp) .L9: movl 68(%rsp), %eax movq $0, 48(%rsp) testl %eax, %eax movl %eax, 88(%rsp) je .L10 cmpl $1, %eax movq $1, 48(%rsp) je .L10 movl 84(%rsp), %eax movq $0, 48(%rsp) subl $2, %eax movl %eax, 100(%rsp) .L11: movl 100(%rsp), %eax movq $0, 56(%rsp) testl %eax, %eax movl %eax, 96(%rsp) je .L12 cmpl $1, %eax movq $1, 56(%rsp) je .L12 movl 88(%rsp), %eax movq $0, 56(%rsp) subl $2, %eax movl %eax, 80(%rsp) .L13: movl 80(%rsp), %eax movq $0, 8(%rsp) testl %eax, %eax movl %eax, 4(%rsp) je .L14 cmpl $1, %eax movq $1, 8(%rsp) je .L14 movl 96(%rsp), %r15d movq $0, 8(%rsp) subl $2, %r15d .L15: xorl %r14d, %r14d testl %r15d, %r15d movl %r15d, %r13d je .L16 cmpl $1, %r15d movb $1, %r14b je .L16 movl 4(%rsp), %r12d xorb %r14b, %r14b subl $2, %r12d .p2align 4,,10 .p2align 3 .L17: xorl %ebp, %ebp testl %r12d, %r12d movl %r12d, %ebx je .L18 cmpl $1, %r12d movb $1, %bpl je .L18 xorb %bpl, %bpl jmp .L20 .p2align 4,,10 .p2align 3 .L21: cmpl $1, %ebx je .L58 .L20: leal -1(%rbx), %edi call recfib addq %rax, %rbp subl $2, %ebx jne .L21 .L18: addq %rbp, %r14 subl $2, %r13d je .L16 subl $2, %r12d cmpl $1, %r13d jne .L17 addq $1, %r14 .L16: addq %r14, 8(%rsp) subl $2, 4(%rsp) je .L14 subl $2, %r15d cmpl $1, 4(%rsp) jne .L15 addq $1, 8(%rsp) .L14: movq 8(%rsp), %rax addq %rax, 56(%rsp) subl $2, 96(%rsp) je .L12 subl $2, 80(%rsp) cmpl $1, 96(%rsp) jne .L13 addq $1, 56(%rsp) .L12: movq 56(%rsp), %rax addq %rax, 48(%rsp) subl $2, 88(%rsp) je .L10 subl $2, 100(%rsp) cmpl $1, 88(%rsp) jne .L11 addq $1, 48(%rsp) .L10: movq 48(%rsp), %rax addq %rax, 40(%rsp) subl $2, 84(%rsp) je .L8 subl $2, 68(%rsp) cmpl $1, 84(%rsp) jne .L9 addq $1, 40(%rsp) .L8: movq 40(%rsp), %rax addq %rax, 24(%rsp) subl $2, 76(%rsp) je .L6 subl $2, 92(%rsp) cmpl $1, 76(%rsp) jne .L7 addq $1, 24(%rsp) .L6: movq 24(%rsp), %rax addq %rax, 32(%rsp) subl $2, 72(%rsp) je .L4 subl $2, 104(%rsp) cmpl $1, 72(%rsp) jne .L5 addq $1, 32(%rsp) .L4: movq 32(%rsp), %rax addq %rax, 16(%rsp) subl $2, 64(%rsp) je .L2 subl $2, 108(%rsp) cmpl $1, 64(%rsp) jne .L3 addq $1, 16(%rsp) .L2: movq 16(%rsp), %rax addq $120, %rsp .cfi_remember_state .cfi_def_cfa_offset 56 popq %rbx .cfi_def_cfa_offset 48 popq %rbp .cfi_def_cfa_offset 40 popq %r12 .cfi_def_cfa_offset 32 popq %r13 .cfi_def_cfa_offset 24 popq %r14 .cfi_def_cfa_offset 16 popq %r15 .cfi_def_cfa_offset 8 ret .p2align 4,,10 .p2align 3 .L58: .cfi_restore_state addq $1, %rbp jmp .L18 .cfi_endproc
which is much faster than anything else. gcc deployed the algorithm to a remarkable depth that neither GHC nor LLVM did, and this is of great importance here.