Haskell parses a large XML file with low memory

So, I played with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter at Real World Haskell (http://book.realworldhaskell.org/read/io.html), I got the impression that if I ran the following code, it would be garbage collected when I go through it.

However, when I run it in a large file, memory usage continues to increase when it starts.

runghc parse.hs bigfile.xml 

What am I doing wrong? Am I mistaken in my assumption? Does the card / filter display it all?

 import qualified Data.ByteString.Lazy as BSL import qualified Data.ByteString.Lazy.UTF8 as U import Prelude hiding (readFile) import Text.XML.Expat.SAX import System.Environment (getArgs) main :: IO () main = do args <- getArgs contents <- BSL.readFile (head args) -- putStrLn $ U.toString contents let events = parse defaultParseOptions contents mapM_ print $ map getTMSId $ filter isEvent events isEvent :: SAXEvent String String -> Bool isEvent (StartElement "event" as) = True isEvent _ = False getTMSId :: SAXEvent String String -> Maybe String getTMSId (StartElement _ as) = lookup "TMSId" as 

My ultimate goal is to parse a huge XML file using a simple saxophone interface. I do not want to know the whole structure in order to be notified that I have found an "event".

+6
source share
2 answers

I support hexpat. This is the error that I have now fixed in hexpat-0.19.8. Thanks for getting my attention.

The error on ghc-7.2.1 is new, and this is due to the interaction that I did not expect between binding where where to the triple and unsafePerformIO, which I need to do with the C code seem clean in Haskell.

+8
source

This seems to be a problem with hexpat. Running compiled with optimization and just for a simple task like length leads to the use of linear memory.

Looking at hexpat, I think that excessive caching is happening (see parseG function). I suggest contacting the supporting hexpat and asking if this will be the expected behavior. It should have been mentioned in haddocks anyway, but resource consumption seems to be too often ignored in the library documentation.

+3
source

Source: https://habr.com/ru/post/901123/


All Articles