Efficient Bit Maintenance in LFSR Implementation

Although I have a good implementation of LSFR C, I thought I would try the same in Haskell - just to see how this happens. What I came up with is currently two orders of magnitude slower than the C implementation, which asks the question: How to improve performance? Obviously, bit operations are a bottleneck, and the profiler confirms this.

Here's the basic Haskell code using lists and Data.Bits :

 import Control.Monad (when) import Data.Bits (Bits, shift, testBit, xor, (.&.), (.|.)) import System.Environment (getArgs) import System.Exit (exitFailure, exitSuccess) tap :: [[Int]] tap = [ [], [], [], [3, 2], [4, 3], [5, 3], [6, 5], [7, 6], [8, 6, 5, 4], [9, 5], [10, 7], [11, 9], [12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14], [16,15,13,4], [17, 14], [18, 11], [19, 6, 2, 1], [20, 17], [21, 19], [22, 21], [23, 18], [24,23,22,17], [25, 22], [26, 6, 2, 1], [27, 5, 2, 1], [28, 25], [29, 27], [30, 6, 4, 1], [31, 28], [32,22,2,1], [33,20], [34,27,2,1], [35,33], [36,25], [37,5,4,3,2,1],[38,6,5,1], [39,35], [40,38,21,19], [41,38], [42,41,20,19], [43,42,38,37], [44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42], [48,47,21,20], [49,40], [50,49,24,23], [51,50,36,35], [52,49], [53,52,38,37], [54,53,18,17], [55,31], [56,55,35,34], [57,50], [58,39], [59,58,38,37], [60,59], [61,60,46,45], [62,61,6,5], [63,62] ] xor' :: [Bool] -> Bool xor' = foldr xor False mask :: (Num a, Bits a) => Int -> a mask len = shift 1 len - 1 advance :: Int -> [Int] -> Int -> Int advance len tap lfsr | d0 = shifted | otherwise = shifted .|. 1 where shifted = shift lfsr 1 .&. mask len d0 = xor' $ map (testBit lfsr) tap' tap' = map (subtract 1) tap main :: IO () main = do args <- getArgs when (null args) $ fail "Usage: lsfr <number-of-bits>" let len = read $ head args when (len < 8) $ fail "No need for LFSR" let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0 if out == 0 then do putStr "OK\n" exitSuccess else do putStr "FAIL\n" exitFailure 

It basically checks to see if the LSFR defined in tap :: [[Int]] is for any given maximum bit length. (More precisely, it simply checks to see if the LSFR reaches its initial state (zero) after 2 n iterations.)

According to the profiler, the most expensive line is the feedback bit d0 = xor' $ map (testBit lfsr) tap' .

What I have tried so far:

  • use Data.Array : Attempt thrown because no foldl / r
  • use Data.Vector : slightly faster than the baseline

The compiler options I use are: -O2 , LTS Haskell 8.12 (ghc-8.0.2) .

Link to C ++ - the program can be found at gist.github.com .

Haskell (?) Code cannot be expected to run as fast as C code, but two orders of magnitude too much, there should be a better way to do bit-driving.

Update: results of applying optimizations proposed in answers

  • The C ++ help program with input 28, compiled with LLVM 8.0.0, runs on 0.67 from my machine (same thing with clang 3.7 a bit slower, 0.68 s)
  • Haskell source code runs about 100 times slower (due to space inefficiency, do not try to use it with inputs greater than 25)
  • With the rewriting of @ thomas-m-dubuisson still using the default GHC backend, runtime comes down to 5.2s
  • With the rewriting of @ thomas-m-dubuisson, now using the LLVM backend (option ghc -O2 -fllvm ), the execution time is reduced to 1.7 s
    • Using the ghc option -O2 -fllvm -optlc -mcpu=native results in 0.73s
  • Replacing iterate with iterate' for @cirdec does not matter when Thomas code is used (both with source code "native" and LLVM by default). However, it does matter when the underlying code is used.

So, we came from 100x to 8x to 1.09x, that is, only 9% slower than C!

Note LLVM backend for Ghc 8.0.2 requires LLVM 3.7. On Mac OS X, this means installing this version with brew , and then opt and llc symbolic binding. See 7.10. GHC Backends .

+5
source share
3 answers

Front issues

First, I use GHC 8.0.1 on Intel I5 ~ 2.5 GHz, Linux x86-64.

First project: Oh no! Slow down!

Your start code with parameter 25 works:

 % ghc -O2 orig.hs && time ./orig 25 [1 of 1] Compiling Main ( orig.hs, orig.o ) Linking orig ... OK ./orig 25 7.25s user 0.50s system 99% cpu 7.748 total 

Thus, the beat time is 77 ms - two orders of magnitude better than this Haskell code. Allows you to dive.

Problem 1: Shifty Code

I found a couple of oddities with the code. The first is the use of shift code in high-performance code. Shift supports left and right shift, and this requires a branch. Lets kill it with more readable degrees of two, etc. ( shift 1 x ~> 2^x and shift x 1 ~> 2*x ):

 % ghc -O2 noShift.hs && time ./noShift 25 [1 of 1] Compiling Main ( noShift.hs, noShift.o ) Linking noShift ... OK ./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total 

(As you noted in the comments: β€œYes, this is a study.” It is possible that some oddity in the previous code interfered with the rewrite rule and, as a result, led to significantly worse code)

Problem 2: Bit Lists? Int operations save the day!

One change, an order of magnitude. Hooray. What else? Well, you have this uncomfortable list of bits that you click on, it just looks like asking for inefficiency and / or relies on fragile optimization. At this point, I note that hard coding any of them from this list leads to really good performance (for example, testBit lsfr 24 `xor` testBit lsfr 21 ), but we want a more general quick solution.

I propose to calculate the mask of all places of cranes, and then perform the calculation of step-by-step instructions. To do this, we need only one Int , passed in advance , and not the entire list. The popcount command requires a good build assembly, which requires llvm and possibly -optlc-mcpu=native or another choice of instruction set that is not pessimistic.

This step gives us the pc below. I put in advance guard removal, which was mentioned in the comments:

 let tp = sum $ map ((2^) . subtract 1) (tap !! len) pc lfsr = fromEnum (even (popCount (lfsr .&. tp))) mask = 2^len - 1 advance' :: Int -> Int advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr out :: Int out = last $ take (2^len) $ iterate advance' 0 

Our final performance:

 % ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25 [1 of 1] Compiling Main ( so.hs, so.o ) Linking so ... OK ./so 25 0.06s user 0.00s system 96% cpu 0.067 total 

This is two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in the deployed code, Haskell packages with C bindings are actually very common, but this is often a training exercise, so I hope you have fun .

Edit: now available C ++ code takes my system 0.10 ( g++ -O3 ) and 0.12 ( clang++ -O3 -march=native ) seconds, so it seems like we beat our mark with a fair bit.

+8
source

I suspect the following line creates a large list, similar to thunk in memory, before evaluating it.

 let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is 

Let's find out if I am right, and if I find it, we will fix it. The first step in debugging is to get an idea of ​​the memory used by the program. To do this, we are going to compile with the -rtsopts options in addition to -O2 . This allows you to run the program with RTS options , including +RTS -s , which displays a summary of the small memory.

Initial performance

Running your program as lfsr 25 +RTS -s I get the following output

 OK 5,420,148,768 bytes allocated in the heap 6,705,977,216 bytes copied during GC 1,567,511,384 bytes maximum residency (20 sample(s)) 357,862,432 bytes maximum slop 3025 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10343 colls, 0 par 2.453s 2.522s 0.0002s 0.0009s Gen 1 20 colls, 0 par 2.281s 3.065s 0.1533s 0.7128s INIT time 0.000s ( 0.000s elapsed) MUT time 1.438s ( 1.162s elapsed) GC time 4.734s ( 5.587s elapsed) EXIT time 0.016s ( 0.218s elapsed) Total time 6.188s ( 6.967s elapsed) %GC time 76.5% (80.2% elapsed) Alloc rate 3,770,538,273 bytes per MUT second Productivity 23.5% of total user, 19.8% of total elapsed 

That a lot of memory is used immediately. Most likely, somewhere there something shudders.

Trying to reduce the size of a piece

I assumed that thunk is being built in iterate (advance ...) . If so, we can try to reduce the size of the thunk by making advance more stringent in the lsfr argument. This will not remove the spine (sequential iterations), but it can reduce the size of the condition that was created when evaluating the spine.

BangPatterns is an easy way to make a function strong in an argument. f !x = .. is short for fx = seq x $ ...

 {-# LANGUAGE BangPatterns #-} advance :: Int -> [Int] -> Int -> Int advance len tap = go where go !lfsr | d0 = shifted | otherwise = shifted .|. 1 where shifted = shift lfsr 1 .&. mask len d0 = xor' $ map (testBit lfsr) tap' tap' = map (subtract 1) tap 

Let's see what's the difference ...

 >lfsr 25 +RTS -s OK 5,420,149,072 bytes allocated in the heap 6,705,979,368 bytes copied during GC 1,567,511,448 bytes maximum residency (20 sample(s)) 357,862,448 bytes maximum slop 3025 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10343 colls, 0 par 2.688s 2.711s 0.0003s 0.0059s Gen 1 20 colls, 0 par 2.438s 3.252s 0.1626s 0.8013s INIT time 0.000s ( 0.000s elapsed) MUT time 1.328s ( 1.146s elapsed) GC time 5.125s ( 5.963s elapsed) EXIT time 0.000s ( 0.226s elapsed) Total time 6.484s ( 7.335s elapsed) %GC time 79.0% (81.3% elapsed) Alloc rate 4,081,053,418 bytes per MUT second Productivity 21.0% of total user, 18.7% of total elapsed 

No, which is noticeable.

Spinal repair

I guess this is the spine of this iterate (advance ...) that is being built. In the end, there will be 2^25 or just over 33 million elements for the team I'm running. The list itself is probably deleted by list fusion , but for the last item in the list more than 33 million advance ... applications advance ...

To solve this problem, we need a strict version of iterate so that the value is forcibly replaced with Int before applying the advance function again. This should only support memory up to one lfsr value lfsr time with the currently calculated advance application.

Unfortunately, there is no strict iterate in Data.List . Here's one that doesn't refuse to merge lists, providing other important (I think) performance optimizations for this problem.

 {-# LANGUAGE BangPatterns #-} import GHC.Base (build) {-# NOINLINE [1] iterate' #-} iterate' :: (a -> a) -> a -> [a] iterate' f = go where go !x = x : go (fx) {-# NOINLINE [0] iterateFB' #-} iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b iterateFB' cf = go where go !x = x `c` go (fx) {-# RULES "iterate'" [~1] forall f x. iterate' fx = build (\c _n -> iterateFB' cfx) "iterateFB'" [1] iterateFB' (:) = iterate' #-} 

It's just an iterate from GHC.List (along with all its rewrite rules), but strict in the accumulated argument.

Equipped with a strict iterate' iteration, we can change the nasty string to

 let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0 

I expect this to work much better. We will see...

 >lfsr 25 +RTS -s OK 3,758,156,184 bytes allocated in the heap 297,976 bytes copied during GC 43,800 bytes maximum residency (1 sample(s)) 21,736 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 7281 colls, 0 par 0.047s 0.008s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.000s ( 0.000s elapsed) MUT time 0.750s ( 0.783s elapsed) GC time 0.047s ( 0.008s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.797s ( 0.792s elapsed) %GC time 5.9% (1.0% elapsed) Alloc rate 5,010,874,912 bytes per MUT second Productivity 94.1% of total user, 99.0% of total elapsed 

It used 0.00002 times as much memory and worked 10 times faster.

I don’t know if it will improve on Thomas DeBuisson which will improve advance , but still remain lazy iterate advance' . It would be easy to verify; add the iterate' code to this answer and use iterate' instead of iterate in this answer.

+6
source
  • Does the compiler tap !! len tap !! len out of loop? I suspect this is the case, but pull it out to ensure that it does not hurt:

     let tap1 = tap !! len let out = last $ take (shift 1 len) $ iterate (advance len tap1) 0 
  • In the comments you say: β€œ 2^len is required exactly once”, but this is wrong. You do this every time in advance . Therefore you can try

     advance len tap mask lfsr | d0 = shifted | otherwise = shifted .|. 1 where shifted = shift lfsr 1 .&. mask d0 = xor' $ map (testBit lfsr) tap' tap' = map (subtract 1) tap -- in main let tap1 = tap !! len let numIterations = 2^len let mask = numIterations - 1 let out = iterate (advance len tap1 mask) 0 !! (numIterations - 1) 

    (The compiler cannot optimize last $ take ... to !! in general, because it is different for finite lists, but iterate always returns infinite.)

  • You compared foldr to foldl , but foldl almost never need foldl ; since xor always needs both arguments and is associative, foldl' will most likely be the right choice (the compiler can optimize it, but if there is a real difference between foldl and foldr , and not just a random variation, maybe this is not the case It worked).

+2
source

Source: https://habr.com/ru/post/1267114/


All Articles