I come from a kind of HPC background, and I'm just starting to learn machine learning in general and TensorFlow in particular. Initially, I was surprised to learn that distributed TensorFlow is designed to communicate with TCP / IP by default, although it makes sense in retrospect, given that Google and what equipment it uses most often.
I am interested in experimenting with TensorFlow in parallel with MPI on a cluster. From my point of view, this should be beneficial, since latency should be much lower due to the use of MPI remote direct memory access (RDMA) through machines without shared memory.
So my question is: why doesn't this approach seem more common given the growing popularity of TensorFlow and machine learning? Is latency a bottleneck? Is there a typical problem that is being solved, which makes such a solution inappropriate? Are there any significant differences between calling TensorFlow functions in a parallel way and implementing MPI calls inside the TensorFlow library?
thank
source
share