Map from one data structure to another

Question

Map from one data structure to another

Sorry for what seems to be the main question. I am trying to figure out how to convert a Data.Text to a Char vector. The best I could come up with is this:

textToVec :: T.Text -> V.Vector Char textToVec t = T.foldr append V.empty t where append c vec = V.snoc vec c

This seems a bit complicated (and computationally inefficient?) Is there a better way?

More generally, as a newbie to Haskell, I would appreciate any advice on how to do this for myself. I could not make significant progress either in Google or in the search for documentation.

+5

haskell

happydave Dec 29 '16 at 0:18

source share

1 answer

Zeta · Accepted Answer · 2016-12-29T11:20:49+0000

Better subjective. What makes something better? Less code? Less memory consumption? Need less time?

Saying yes, your algorithm is computationally inefficient, since snoc has O (n), where n is the length of your (intermediate) vector. Thus, you created the O (n * n) (n snoc with an increase in length, so 1 + 2 + 3 + ... + (n-1) = n * (n-1) / 2).

Depending on the constants hidden in the Big-O notation, you might expect your function to work slowly on large lists. But before we compare it, think about alternatives:

we can move on to an intermediate data structure that provides transformations for both data structures, for example. [Char] via V.fromList and T.pack
we can expand the vector with V.unfoldr by dividing the individual characters from T.Text into T.uncons .

So, let's use Criterion to create some tests:

 module Main where import Criterion import Criterion.Main import Control.Monad (replicateM) import qualified Data.Text as T import qualified Data.Vector as V import System.Random (randomRIO) setupEnv n = fmap T.pack $ replicateM n (randomRIO (' ','~')) textToVec :: T.Text -> V.Vector Char textToVec t = T.foldr append V.empty t where append c vec = V.snoc vec c example :: T.Text example = T.pack "Hello, World" benchOn :: T.Text -> [Benchmark] benchOn ~e = map (\(s,f) -> bench s $ whnf fe) [("textToVec", textToVec) ,("T.foldr (flip V.snoc) V.empty", T.foldr (flip V.snoc) V.empty) ,("V.fromList . T.unpack", V.fromList . T.unpack) ,("V.unfoldr T.uncons", V.unfoldr T.uncons) ] main = defaultMain [ bgroup "example" $ benchOn example , env (setupEnv $ 1000) $ \ ~small -> bgroup "1000" $ benchOn small , env (setupEnv $ 10000) $ \ ~small -> bgroup "10000" $ benchOn small ]

I used lts-5 to compile it:

 stack exec --package random\ --package vector\ --package text\ --resolver=lts-5\ -- ghc -O2 Benchmark.hs

Benchmarking results

First we run all the options on "Hello, world", then on a randomly generated string of length 1000, and then on one of 10000.

"Hello World"

 benchmarking example/textToVec time 1.106 us (1.098 us .. 1.117 us) 1.000 R² (0.999 R² .. 1.000 R²) mean 1.102 us (1.099 us .. 1.107 us) std dev 12.67 ns (6.277 ns .. 21.00 ns) benchmarking example/T.foldr (flip V.snoc) V.empty time 1.225 us (1.222 us .. 1.229 us) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.229 us (1.226 us .. 1.232 us) std dev 10.48 ns (7.634 ns .. 16.00 ns) benchmarking example/V.fromList . T.unpack time 643.6 ns (640.9 ns .. 646.4 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 643.6 ns (641.7 ns .. 645.7 ns) std dev 6.525 ns (5.573 ns .. 7.860 ns) benchmarking example/V.unfoldr T.uncons time 518.1 ns (516.1 ns .. 520.4 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 518.0 ns (516.1 ns .. 520.1 ns) std dev 6.541 ns (4.943 ns .. 8.479 ns) variance introduced by outliers: 11% (moderately inflated)

As you can see, currently expanding or moving through a list takes about half the time compared to your function. Both the deployable approach and the fromList . unpack fromList . unpack takes half a microsecond.

Random text with 1000 characters

 benchmarking 1000/textToVec time 1.311 ms (1.302 ms .. 1.320 ms) 0.999 R² (0.999 R² .. 1.000 R²) mean 1.342 ms (1.331 ms .. 1.366 ms) std dev 55.31 us (27.75 us .. 96.80 us) variance introduced by outliers: 29% (moderately inflated) benchmarking 1000/T.foldr (flip V.snoc) V.empty time 6.013 ms (5.953 ms .. 6.081 ms) 0.999 R² (0.997 R² .. 1.000 R²) mean 6.054 ms (6.028 ms .. 6.097 ms) std dev 100.8 us (62.45 us .. 180.7 us) benchmarking 1000/V.fromList . T.unpack time 26.83 us (26.77 us .. 26.92 us) 1.000 R² (1.000 R² .. 1.000 R²) mean 26.90 us (26.83 us .. 27.00 us) std dev 264.1 ns (209.7 ns .. 348.2 ns) benchmarking 1000/V.unfoldr T.uncons time 15.05 us (14.93 us .. 15.19 us) 0.999 R² (0.999 R² .. 1.000 R²) mean 15.02 us (14.95 us .. 15.15 us) std dev 303.0 ns (206.6 ns .. 428.3 ns) variance introduced by outliers: 19% (moderately inflated)

Here it becomes interesting. Your function is better than contactless. One would have to look into the core of the GHC to see where the bottleneck is.

As said, for a line that is about 100 times longer than our original ( length "Hello, World" == 12 ), our two members are V.unfoldr T.uncons and V.fromList . T.unpack V.fromList . T.unpack takes about 30/45 times more data compared to example 1.

On the other hand, your function requires 1 ms compared to the previous 1 μs. 100 times the data 1000 times the time. Oh oh ...

Random text with 10,000 characters

 benchmarking 10000/textToVec time 190.9 ms (188.7 ms .. 192.8 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 192.5 ms (191.8 ms .. 193.6 ms) std dev 1.096 ms (684.4 us .. 1.426 ms) variance introduced by outliers: 14% (moderately inflated) benchmarking 10000/T.foldr (flip V.snoc) V.empty For a time 859.3 ms (856.5 ms .. 860.9 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 861.9 ms (860.7 ms .. 862.6 ms) std dev 1.042 ms (0.0 s .. 1.173 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking 10000/V.fromList . T.unpack time 567.8 us (565.5 us .. 570.6 us) 1.000 R² (1.000 R² .. 1.000 R²) mean 571.7 us (569.8 us .. 574.0 us) std dev 7.132 us (5.674 us .. 9.545 us) benchmarking 10000/V.unfoldr T.uncons time 200.6 us (199.4 us .. 201.8 us) 1.000 R² (1.000 R² .. 1.000 R²) mean 200.2 us (199.4 us .. 201.2 us) std dev 2.948 us (2.240 us .. 3.837 us)

We have increased the data size by 10 times. Your function takes 200 times longer than the previous run, while both unfoldr and fromList only take 20 times.

Winner

According to the criteria, V.unfoldr T.uncons takes the cake. So, how would you, as a newcomer to Haskell, find this feature?

Well, at first it is not so obvious. The “obvious” approach is usually to use a list, as they are everywhere in Haskell, and, frankly, V.fromList . T.unpack V.fromList . T.unpack seems to work quite well. In addition, this trick works for every data structure that provides fromList (as a target) or toList (as source). Therefore, it works for any toList (as source). Therefore, it works for any Foldable instance ( Text` not one).

You might find that if you looked at the documentation and saw the type of both V.unfoldr and T.uncons at the same time:

 V.unfoldr :: (b -> Maybe (a , b )) -> b -> V.Vector a T.uncons :: T.Text -> Maybe (Char, T.Text) V.unfoldr T.uncons :: T.Text -> V.Vector Char

But now you know that the unfoldr / uncons pattern exists, so you know how to use it in the future.

Map from one data structure to another

Benchmarking results

"Hello World"

Random text with 1000 characters

Random text with 10,000 characters

Winner

More articles: