Quickly parse utf-8 large text file in haskell

I have a 300MB file ( link ) with utf-8 characters. I want to write a haskell program equivalent to:

cat bigfile.txt | grep "^en " | wc -l 

This works in 2.6 on my system.

Right now, I am reading the file as a regular String (readFile) and have this:

 main = do contents <- readFile "bigfile.txt" putStrLn $ show $ length $ lines contents 

After a couple of seconds I get this error:

 Dictionary.hs: bigfile.txt: hGetContents: invalid argument (Illegal byte sequence) 

I assume I need to use something more than utf-8? How can I do this both quickly and utf-8? I read about Data.ByteString.Lazy for speed, but Real World Haskell says it does not support utf-8.

+4
source share
1 answer

The utf8-string package provides support for reading and writing UTF8 strings. It reuses the ByteString , so the interface is likely to be very similar.

Another Unicode string project, which is likely to be related to the above and also inspired by ByteStrings, is discussed in this master's thesis .

+7
source

Source: https://habr.com/ru/post/1381781/


All Articles