Consequences of Using MPI with TensorFlow

Question

Consequences of Using MPI with TensorFlow

I come from a kind of HPC background, and I'm just starting to learn machine learning in general and TensorFlow in particular. Initially, I was surprised to learn that distributed TensorFlow is designed to communicate with TCP / IP by default, although it makes sense in retrospect, given that Google and what equipment it uses most often.

I am interested in experimenting with TensorFlow in parallel with MPI on a cluster. From my point of view, this should be beneficial, since latency should be much lower due to the use of MPI remote direct memory access (RDMA) through machines without shared memory.

So my question is: why doesn't this approach seem more common given the growing popularity of TensorFlow and machine learning? Is latency a bottleneck? Is there a typical problem that is being solved, which makes such a solution inappropriate? Are there any significant differences between calling TensorFlow functions in a parallel way and implementing MPI calls inside the TensorFlow library?

thank

+4

python tensorflow mpi mpi4py

Cogitator 18 sept '17 at 15:08

source share

2 answers

Gilles Gouaillardet · Answer 1 · 2017-09-18T15:40:13+0000

It seems tenorflow already supports MPI, as stated at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi MPI support for tensor flow was also discussed at https://arxiv.org/abs/1603.02339

, , MPI / , . , : MPI (, MPI_THREAD_MULTIPLE) MPI. , , .

Kehe CAI · Answer 2 · 2017-11-15T12:50:08+0000

Tensorflow git repo, tf gRPC , HTTP2, TCP/IP, , , .

Consequences of Using MPI with TensorFlow

More articles: