Haskell Thrift library is 300 times slower than C ++ in performance test

Question

Haskell Thrift library is 300 times slower than C ++ in performance test

I am creating an application that contains two components - a server written in Haskell and a client written in Qt (C ++). I use frugality to communicate them, and I wonder why it works so slowly.

I did a performance test and here is the result on my machine

results

C++ server and C++ client: Sending 100 pings - 13.37 ms Transfering 1000000 size vector - 433.58 ms Recieved: 3906.25 kB Transfering 100000 items from server - 1090.19 ms Transfering 100000 items to server - 631.98 ms Haskell server and C++ client: Sending 100 pings 3959.97 ms Transfering 1000000 size vector - 12481.40 ms Recieved: 3906.25 kB Transfering 100000 items from server - 26066.80 ms Transfering 100000 items to server - 1805.44 ms

Why is Haskell so slow in this test? How can I improve its performance?

Here are the files:

Files

performance.thrift

 namespace hs test namespace cpp test struct Item { 1: optional string name 2: optional list<i32> coordinates } struct ItemPack { 1: optional list<Item> items 2: optional map<i32, Item> mappers } service ItemStore { void ping() ItemPack getItems(1:string name, 2: i32 count) bool setItems(1: ItemPack items) list<i32> getVector(1: i32 count) }

Main.hs

 {-# LANGUAGE ScopedTypeVariables #-} module Main where import Data.Int import Data.Maybe (fromJust) import qualified Data.Vector as Vector import qualified Data.HashMap.Strict as HashMap import Network -- Thrift libraries import Thrift.Server -- Generated Thrift modules import Performance_Types import ItemStore_Iface import ItemStore i32toi :: Int32 -> Int i32toi = fromIntegral itoi32 :: Int -> Int32 itoi32 = fromIntegral port :: PortNumber port = 9090 data ItemHandler = ItemHandler instance ItemStore_Iface ItemHandler where ping _ = return () --putStrLn "ping" getItems _ mtname mtsize = do let size = i32toi $ fromJust mtsize item i = Item mtname (Just $ Vector.fromList $ map itoi32 [i..100]) items = map item [0..(size-1)] itemsv = Vector.fromList items mappers = zip (map itoi32 [0..(size-1)]) items mappersh = HashMap.fromList mappers itemPack = ItemPack (Just itemsv) (Just mappersh) putStrLn "getItems" return itemPack setItems _ _ = do putStrLn "setItems" return True getVector _ mtsize = do putStrLn "getVector" let size = i32toi $ fromJust mtsize return $ Vector.generate size itoi32 main :: IO () main = do _ <- runBasicServer ItemHandler process port putStrLn "Server stopped"

ItemStore_client.cpp

 #include <iostream> #include <chrono> #include "gen-cpp/ItemStore.h" #include <transport/TSocket.h> #include <transport/TBufferTransports.h> #include <protocol/TBinaryProtocol.h> using namespace apache::thrift; using namespace apache::thrift::protocol; using namespace apache::thrift::transport; using namespace test; using namespace std; #define TIME_INIT std::chrono::_V2::steady_clock::time_point start, stop; \ std::chrono::duration<long long int, std::ratio<1ll, 1000000000ll> > duration; #define TIME_START start = std::chrono::steady_clock::now(); #define TIME_END duration = std::chrono::steady_clock::now() - start; \ std::cout << chrono::duration <double, std::milli> (duration).count() << " ms" << std::endl; int main(int argc, char **argv) { boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090)); boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket)); boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport)); ItemStoreClient server(protocol); transport->open(); TIME_INIT long pings = 100; cout << "Sending " << pings << " pings" << endl; TIME_START for(auto i = 0 ; i< pings ; ++i) server.ping(); TIME_END long vectorSize = 1000000; cout << "Transfering " << vectorSize << " size vector" << endl; std::vector<int> v; TIME_START server.getVector(v, vectorSize); TIME_END cout << "Recieved: " << v.size()*sizeof(int) / 1024.0 << " kB" << endl; long itemsSize = 100000; cout << "Transfering " << itemsSize << " items from server" << endl; ItemPack items; TIME_START server.getItems(items, "test", itemsSize); TIME_END cout << "Transfering " << itemsSize << " items to server" << endl; TIME_START server.setItems(items); TIME_END transport->close(); return 0; }

ItemStore_server.cpp

 #include "gen-cpp/ItemStore.h" #include <thrift/protocol/TBinaryProtocol.h> #include <thrift/server/TSimpleServer.h> #include <thrift/transport/TServerSocket.h> #include <thrift/transport/TBufferTransports.h> #include <map> #include <vector> using namespace ::apache::thrift; using namespace ::apache::thrift::protocol; using namespace ::apache::thrift::transport; using namespace ::apache::thrift::server; using namespace test; using boost::shared_ptr; class ItemStoreHandler : virtual public ItemStoreIf { public: ItemStoreHandler() { } void ping() { // printf("ping\n"); } void getItems(ItemPack& _return, const std::string& name, const int32_t count) { std::vector <Item> items; std::map<int, Item> mappers; for(auto i = 0 ; i < count ; ++i){ std::vector<int> coordinates; for(auto c = i ; c< 100 ; ++c) coordinates.push_back(c); Item item; item.__set_name(name); item.__set_coordinates(coordinates); items.push_back(item); mappers[i] = item; } _return.__set_items(items); _return.__set_mappers(mappers); printf("getItems\n"); } bool setItems(const ItemPack& items) { printf("setItems\n"); return true; } void getVector(std::vector<int32_t> & _return, const int32_t count) { for(auto i = 0 ; i < count ; ++i) _return.push_back(i); printf("getVector\n"); } }; int main(int argc, char **argv) { int port = 9090; shared_ptr<ItemStoreHandler> handler(new ItemStoreHandler()); shared_ptr<TProcessor> processor(new ItemStoreProcessor(handler)); shared_ptr<TServerTransport> serverTransport(new TServerSocket(port)); shared_ptr<TTransportFactory> transportFactory(new TBufferedTransportFactory()); shared_ptr<TProtocolFactory> protocolFactory(new TBinaryProtocolFactory()); TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory); server.serve(); return 0; }

Makefile

 GEN_SRC := gen-cpp/ItemStore.cpp gen-cpp/performance_constants.cpp gen-cpp/performance_types.cpp GEN_OBJ := $(patsubst %.cpp,%.o, $(GEN_SRC)) THRIFT_DIR := /usr/local/include/thrift BOOST_DIR := /usr/local/include INC := -I$(THRIFT_DIR) -I$(BOOST_DIR) .PHONY: all clean all: ItemStore_server ItemStore_client %.o: %.cpp $(CXX) --std=c++11 -Wall -DHAVE_INTTYPES_H -DHAVE_NETINET_IN_H $(INC) -c $< -o $@ ItemStore_server: ItemStore_server.o $(GEN_OBJ) $(CXX) $^ -o $@ -L/usr/local/lib -lthrift -DHAVE_INTTYPES_H -DHAVE_NETINET_IN_H ItemStore_client: ItemStore_client.o $(GEN_OBJ) $(CXX) $^ -o $@ -L/usr/local/lib -lthrift -DHAVE_INTTYPES_H -DHAVE_NETINET_IN_H clean: $(RM) *.o ItemStore_server ItemStore_client

Compile and run

I generated files (using economical 0.9 available here ) with:

 $ thrift --gen cpp performance.thrift $ thrift --gen hs performance.thrift

Compile with

 $ make $ ghc Main.hs gen-hs/ItemStore_Client.hs gen-hs/ItemStore.hs gen-hs/ItemStore_Iface.hs gen-hs/Performance_Consts.hs gen-hs/Performance_Types.hs -Wall -O2

Run the Haskell test:

 $ ./Main& $ ./ItemStore_client

Run C ++ test:

 $ ./ItemStore_server& $ ./ItemStore_client

Remember to kill the server after each test.

Update

Edited getVector way getVector use Vector.generate instead of Vector.fromList , but still no effect

Update 2

Due to the @MdxBhmt suggestion, I tested the getItems function as follows:

 getItems _ mtname mtsize = do let size = i32toi $! fromJust mtsize item i = Item mtname (Just $! Vector.enumFromN (i::Int32) (100- (fromIntegral i))) itemsv = Vector.map item $ Vector.enumFromN 0 (size-1) itemPack = ItemPack (Just itemsv) Nothing putStrLn "getItems" return itemPack

which is strict and has improved vector generation against its alternative based on my initial implementation:

 getItems _ mtname mtsize = do let size = i32toi $ fromJust mtsize item i = Item mtname (Just $ Vector.fromList $ map itoi32 [i..100]) items = map item [0..(size-1)] itemsv = Vector.fromList items itemPack = ItemPack (Just itemsv) Nothing putStrLn "getItems" return itemPack

Please note that the HashMap has not been sent. The first version gives a time of 12338.2 ms, and the second - 11698.7 ms, without acceleration: (

Update 3

I reported a Thrift Jira issue

Update 4 with abhinav

This is completely unscientific, but using GHC 7.8.3 with Thrift 0.9.2 and the @MdxBhmt version of getItems discrepancy is greatly reduced.

 C++ server and C++ client: Sending 100 pings: 8.56 ms Transferring 1000000 size vector: 137.97 ms Recieved: 3906.25 kB Transferring 100000 items from server: 467.78 ms Transferring 100000 items to server: 207.59 ms Haskell server and C++ client: Sending 100 pings: 24.95 ms Recieved: 3906.25 kB Transferring 1000000 size vector: 378.60 ms Transferring 100000 items from server: 233.74 ms Transferring 100000 items to server: 913.07 ms

Several executions were performed, each time restarting the server. The results are reproducible.

Please note that the source code of the original question (with @MdxBhmt getItems implementation) will not compile as is. The following changes should be made:

 getItems _ mtname mtsize = do let size = i32toi $! fromJust mtsize item i = Item mtname (Just $! Vector.enumFromN (i::Int32) (100- (fromIntegral i))) itemsv = Vector.map item $ Vector.enumFromN 0 (size-1) itemPack = ItemPack (Just itemsv) Nothing putStrLn "getItems" return itemPack getVector _ mtsize = do putStrLn "getVector" let size = i32toi $ fromJust mtsize return $ Vector.generate size itoi32

+42

c ++ performance haskell networking thrift

remdezx Oct 22 '13 at 8:38

source share

5 answers

MdxBhmt · Answer 1 · 2013-10-23 00:12

Everyone points out that the culprit is the frugality library, but I will focus on your code (and where I can help get some speed)

Using a simplified version of your code, where you calculate itemsv :

 testfunc mtsize = itemsv where size = i32toi $ fromJust mtsize item i = Item (Just $ Vector.fromList $ map itoi32 [i..100]) items = map item [0..(size-1)] itemsv = Vector.fromList items

First you have some intermediate data created in item i . Because of laziness, those small and fast computations of vectors become delayed bulky data when we could get them right away.

Having 2 carefully placed $! which constitute a rigorous assessment:

  item i = Item (Just $! Vector.fromList $! map itoi32 [i..100])

Gives you a 25% reduction in lead time (for sizes 1e5 and 1e6).

But there is a more problematic template here: you create a list to convert it to a vector, instead of directly creating the vector.

Look at these last two lines, you create a list → convert a function → to a vector.

Well, vectors are very similar to a list, you can do something like that! So you have to create a vector -> vector.map above it and do it. You no longer need to convert a list to a vector, and mapping to a vector is usually faster than a list!

So you can get rid of items and overwrite the following itemsv :

  itemsv = Vector.map item $ Vector.enumFromN 0 (size-1)

Repeating the same logic until item i , we delete all the lists.

 testfunc3 mtsize = itemsv where size = i32toi $! fromJust mtsize item i = Item (Just $! Vector.enumFromN (i::Int32) (100- (fromIntegral i))) itemsv = Vector.map item $ Vector.enumFromN 0 (size-1)

This is a 50% reduction over the initial lead time.

matthias krull · Answer 2 · 2013-10-22 16:29

You should take a look at the Haskell profiling methods to find out what resources your program uses / allocates and where.

The profiling chapter at Real World Haskell is a good starting point.

CoreyOConnor · Answer 3 · 2013-10-22 23:16

This is consistent with what user 13232 says: implementing a haskell of lean means a large number of small readings.

EG: At Thirft.Protocol.Binary

 readI32 p = do bs <- tReadAll (getTransport p) 4 return $ Data.Binary.decode bs

Allows you to ignore other odd bits and just focus on it. This says: "read 32-bit int: read 4 bytes from the transport, then decode this lazy byte string."

The transport method reads exactly 4 bytes using the lazy byte set hGet. HGet will do the following: allocate a buffer of 4 bytes, and then use hGetBuf to fill this buffer. hGetBuf may use an internal buffer, depending on how the handle is initialized.

Thus, there may be some buffering. However, this means that Thrift for haskell performs a read / decode cycle for each integer separately. Allocating a small memory buffer each time. Oh!

I really see no way to fix this unless the Thrift library is modified to handle larger byte messages.

Then there are other oddities in implementing frugality: using classes to structure methods. Although they look the same and can act as a structure of methods, they are even implemented as a structure of methods: they should not be considered as such. See "Existential Type Malfunction":

http://lukepalmer.wordpress.com/2010/01/24/haskell-antipattern-existential-typeclass/

One odd part of the test implementation:

generating an Ints array only for immediate change in Int32 only for immediate embedding in Vector Int32. Vector generation would immediately be sufficient and fast.

Although, I suspect, this is not the main source of performance problems.

user13251 · Answer 4 · 2013-10-22 22:01

I do not see buffering links on the Haskell server. In C ++, if you do not buffer, you take one system call for each vector / list element. I suspect the same thing is happening on the Haskell server.

I do not see buffered transport in Haskell directly. As an experiment, you can change both the client and the server to use a transport with a frame. Haskell has wireframe transport, and it is buffered. Please note that this will change the layout of the wires.

As a standalone experiment, you might want to turn -off-buffering for C ++ and see if performance metrics are comparable.

kvanberendonck · Answer 5 · 2013-10-22 11:43

The Haskell implementation of the underlying lean server uses internal threading, but you did not compile it to use multiple cores.

To repeat the test using multiple cores, change your command line to compile the Haskell program to include -rtsopts and -threaded , then run the final binary, e.g. ./Main -N4 & , where 4 is the number of cores to use.

Haskell Thrift library is 300 times slower than C ++ in performance test

results

Files

performance.thrift

Main.hs

ItemStore_client.cpp

ItemStore_server.cpp

Makefile

Compile and run

Update

Update 2

Update 3

Update 4 with abhinav

More articles: