Quickly parse utf-8 large text file in haskell

Question

Quickly parse utf-8 large text file in haskell

I have a 300MB file ( link ) with utf-8 characters. I want to write a haskell program equivalent to:

cat bigfile.txt | grep "^en " | wc -l

This works in 2.6 on my system.

Right now, I am reading the file as a regular String (readFile) and have this:

 main = do contents <- readFile "bigfile.txt" putStrLn $ show $ length $ lines contents

After a couple of seconds I get this error:

 Dictionary.hs: bigfile.txt: hGetContents: invalid argument (Illegal byte sequence)

I assume I need to use something more than utf-8? How can I do this both quickly and utf-8? I read about Data.ByteString.Lazy for speed, but Real World Haskell says it does not support utf-8.

+4

parsing haskell utf-8

Sean clark hess Nov 17 '11 at 19:12

source share

1 answer

roldugin · Accepted Answer · 2011-11-17T19:25:02+0000

The utf8-string package provides support for reading and writing UTF8 strings. It reuses the ByteString , so the interface is likely to be very similar.

Another Unicode string project, which is likely to be related to the above and also inspired by ByteStrings, is discussed in this master's thesis .

Quickly parse utf-8 large text file in haskell

More articles: