Large Vector Manipulation Optimization

Question

Large Vector Manipulation Optimization

This is a continuation of my previous question about processing a vector representation of an edge-oriented 5.1m graph. I am trying to implement the Kosaraj graph algorithm and, therefore, I need to rearrange my Vector in the order of the end time of the first depth search (DFS) at the edges. I have code that works on small data sets, but it cannot return after 10 minutes to the full data set. (I cannot exclude that the cycle arises from a large graph, but there are no signs of this in my test data.)

DFS needs to be avoided to revise nodes, so I need some kind of “state” to search (currently a tuple, should I use State Monad?). The first search should return an ordered vector, but for now, I keep simple things by returning a list of Node reordered indices so that I can subsequently process the vector at a time.

I assume the problem is dfsInner . The code below "remembers" the visited nodes, updating the scout field of each Node (third defender). Although I tried to make it tail recursive, the code seems to use memory quickly. Do I need to be forced to some rigor, and if so, how? (I have another version that I use in one search that checks previous visits by looking at the starting nodes of unexplored edges on the stack and a list of nodes that have been completed. It does not grow so fast, but it doesn’t return for a well connected node.)

However, it could also be foldr' , but how can I detect this?

This is presumably Kursers' homework, but I'm not sure I can tag the honor code button! Learning is more important though, so I really don't want the copy / paste answer. What I have is not very elegant - he has an imperative feeling for him, which is caused by the problem of maintaining a kind of state - see The Third Guard. I would welcome comments on design patterns.

 type NodeName = Int type Edges = [NodeName] type Explored = Bool type Stack = [(Int, Int)] data Node = Node NodeName Explored Edges Edges deriving (Eq, Show) type Graph = Vector Node main = do edges <- V.fromList `fmap` getEdges "SCC.txt" let maxIndex = fst $ V.last edges gr = createGraph maxIndex edges res = dfsOuter gr --return gr putStrLn $ show res dfsOuter gr = let tmp = V.foldr' callInner (gr,[]) gr in snd tmp callInner :: Node -> (Graph, Stack) -> (Graph, Stack) callInner (Node idx _ fwd bwd) (gr,acc) = let (Node _ explored _ _) = gr V.! idx in case explored of True -> (gr, acc) False -> let initialStack = map (\l -> (idx, l)) bwd gr' = gr V.// [(idx, Node idx True fwd bwd)] (gr'', newScc) = dfsInner idx initialStack (length acc) (gr', []) in (gr'', newScc++acc) dfsInner :: NodeName -> Stack -> Int -> (Graph, [(Int, Int)]) -> (Graph, [(Int, Int)]) dfsInner start [] finishCounter (gr, acc) = (gr, (start, finishCounter):acc) dfsInner start stack finishCounter (gr, acc) | nextStart /= start = -- no more places to go from this node dfsInner nextStart stack (finishCounter + 1) $ (gr, (start, finishCounter):acc) | nextExplored = -- nextExplored || any (\(y,_) -> y == stack0Head) stack || any (\(x,_) -> x == stack0Head) acc = dfsInner start (tail stack) finishCounter (gr, acc) | otherwise = dfsInner nextEnd (add2Stack++stack) finishCounter (gr V.// [(nextEnd, Node idx True nextLHS nextRHS)], acc) -- dfsInner gr stack0Head (add2Stack++stack) finishCounter acc where (nextStart, nextEnd) = head stack (Node idx nextExplored nextLHS nextRHS) = gr V.! nextEnd add2Stack = map (\l -> (nextEnd, l)) nextRHS

+6

algorithm haskell

Simon h Jun 21 '14 at 18:11

source share

2 answers

In a nutshell:

Know the time difficulties.

There are many subtle points for optimization, most of which are not very important in everyday programming, but do not know that asymptotic difficulties and programs often simply do not work at all.

Haskell libraries usually document complexity, especially if it is not obvious or inefficient (linearly worse). In particular, all the difficulties associated with this issue can be found in Data.List and Data.Vector .

Efficiency is killed by V.// . Vectors are placed in a box or unboxed immutable continuous arrays in memory. Therefore, changing them requires copying the entire vector. Since we have O (N) such modifications, the whole algorithm is O (n ^ 2), so we need to copy about 2 terabytes with N = 500000. Thus, marking visited nodes inside the vector is not so much use. Instead, create IntSet indices if necessary.

initialStack (length acc) also looks very bad. It is almost never recommended to use length in large lists, because it is also O (n). This is probably not as bad as // in your code, as it is in a relatively rare branch, but it will still leave performance crippled after we fix the vector problem.

Also, the search implementation seems rather obscure and complicated for me. A good start should be the desire for literary translation of pseudocode on the Wiki page. In addition, there is no need to store indexes in nodes, since they can be determined from vector positions and adjacency lists.

+2

András kovács Jun 21 '14 at 10:57

source share

Simon h · Accepted Answer · 2014-06-23T16:46:07+0000

Based on @andras gist, I rewrote my code as shown below. I did not use the Arrow functions, as I am unfamiliar with them, and my second depth search first stylistically matches the first (instead of @Andras filterM approach). The end result is that it completes at 20% of the Andras code time (21s instead of 114s).

 import qualified Data.Vector as V import qualified Data.IntSet as IS import qualified Data.ByteString.Char8 as BS import Data.List import Control.Monad import Control.Monad.State --import Criterion.Main --getEdges :: String -> IO [(Int, Int)] getEdges file = do lines <- (map BS.words . BS.lines) `fmap` BS.readFile file let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines pairs' = [(a, b) | [a, b] <- pairs] -- adds 9 seconds maxIndex = fst $ last pairs' graph = createGraph maxIndex pairs' return graph main = do graph <- getEdges "SCC.txt" --let --maxIndex = fst $ V.last edges let fts = bwdLoop graph leaders = fst $ execState (fwdLoop graph fts) ([], IS.empty) print $ length leaders type Connections = [Int] data Node = Node {fwd, bwd :: Connections} deriving (Show) type Graph = V.Vector Node type Visited = IS.IntSet type FinishTime = Int type FinishTimes = [FinishTime] type Leaders = [Int] createGraph :: Int -> [(Int, Int)] -> Graph createGraph maxIndex pairs = let graph = V.replicate (maxIndex+1) (Node [] []) graph' = V.accum (\(Node fb) x -> Node (x:f) b) graph pairs in V.accum (\(Node fb) x -> Node f (x:b)) graph' $ map (\(a,b) -> (b,a)) pairs bwdLoop :: Graph -> FinishTimes bwdLoop g = fst $ execState (mapM_ go $ reverse [0 .. V.length g - 1]) ([], IS.empty) where go :: Int -> State (FinishTimes, Visited) () go i = do (fTimes, vs) <- get let visited = IS.member i vs if not visited then do put (fTimes, IS.insert i vs) mapM_ go $ bwd $ g V.! i -- get state again after changes from mapM_ (fTimes', vs') <- get put (i : fTimes', vs') else return () fwdLoop :: Graph -> FinishTimes -> State (Leaders, Visited) () fwdLoop _ [] = return () fwdLoop g (i:fts) = do (ls, vs) <- get let visited = IS.member i vs if not visited then do put (i:ls, IS.insert i vs) mapM_ go $ fwd $ g V.! i else return () fwdLoop g fts where go :: Int -> State (Leaders, Visited) () go i = do (ls, vs) <- get let visited = IS.member i vs if not visited then do put (ls, IS.insert i vs) mapM_ go $ fwd $ g V.! i else return ()

Large Vector Manipulation Optimization

More articles: