Switching to distributed training moving from TF v1.3 to v1.4: "UnavailableError: attempt to connect http1.x server"

Question

Switching to distributed training moving from TF v1.3 to v1.4: "UnavailableError: attempt to connect http1.x server"

When creating a guided session for distributed learning with this line:

with sv.managed_session(server.target, config=config) as sess, sess.as_default():

I get this error (full stack trace below) to the main worker:

tensorflow.python.framework.errors_impl.UnavailableError: attempt to connect http1.x server

It still seems that everything is fine on the parameter server, reports:

E1106 11:26:32.844686639    5543 ev_epoll1_linux.c:1051]     grpc epoll fd: 8
2017-11-06 11:26:32.851773: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12222}   
2017-11-06 11:26:32.851863: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223}
2017-11-06 11:26:32.856802: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12222

I get this error only when using the new v1.4 tensor stream built from the source (found the same problem when installing from pip). Everything works fine in version 1.3. Does anyone know if there was a perfect change, I assume that with tensor it works with grpc?

, http2 vs http1? , GRPC, , protobuf http2, , , http1, , v1.3 v1.4

- ,

UnavailableError: http1.x

?

RedHat Linux ... . , .

:

E1106 11:28:24.383745692    5787 ev_epoll1_linux.c:1051]     grpc epoll fd: 8
2017-11-06 11:28:24.391084: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize 

GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-06 11:28:24.391185: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize 

GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-06 11:28:24.392285: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server 

with target: grpc://localhost:12223
2017-11-06 11:28:37.875632: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: 

Trying to connect an http1.x server
Traceback (most recent call last):
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in 

_do_call
    return fn(*args)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in 

_run_fn
    self._extend_graph()
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in 

_extend_graph
    self._session, graph_def.SerializeToString(), status)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, 

in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1599, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1026, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "worker.py", line 426, in <module>
    main()
  File "worker.py", line 418, in main
    run(args, server)
  File "worker.py", line 174, in run
    sess.run(trainer.sync)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in 

_run
    feed_dict_tensor, options, run_metadata)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in 

_do_run
    options, run_metadata)
  File "/app/sbtt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in 

_do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Trying to connect an http1.x server

+4

tensorflow grpc

Angus393 06 . '17 18:02

2

, , ( !).

, TensorFlow, gRPC, gRPC, : GRPC_TRACE = GRPC_VERBOSITY = DEBUG

0

Noah Eisen 08 . '17 1:44

Angus393 · Accepted Answer · 2017-11-08T22:52:48+0000

@NoahEisen

export GRPC_VERBOSITY="DEBUG"

- :

E1108 17:37:57.085195825   17711 ev_epoll1_linux.c:1051]     grpc epoll fd: 5
D1108 17:37:57.085309439   17711 ev_posix.c:111]             Using polling engine: epoll1
D1108 17:37:57.085380147   17711 dns_resolver.c:301]         Using native dns resolver
I1108 17:37:57.085819333   17711 socket_utils_common_posix.c:223] Disabling AF_INET6 sockets because ::1 is not available.
I1108 17:37:57.086001584   17711 tcp_server_posix.c:322]     Failed to add :: listener, the environment may not support IPv6: {"created":"@1510180677.085876868","description":"OS Error","errno":97,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.c","file_line":256,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:12223"}
2017-11-08 17:37:57.092525: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
2017-11-08 17:37:57.092648: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12223}
2017-11-08 17:37:57.093435: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12223
D1108 17:38:02.607109518   17830 http_proxy.c:70]            userinfo found in proxy URI
I1108 17:38:02.611335569   17807 http_connect_handshaker.c:304] Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx
2017-11-08 17:38:02.617814: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: Trying to connect an http1.x server

-, . - -, IP 127.0.0.1 localhost ? IE , :

Connecting to server 127.0.0.1:12222 via HTTP proxy ipv4:xx.xx.xx.xx:xxxx

, python. ps "localhost" IP 127.0.0.1, , , TF1.4, localhost - (, HTTP1.x, ).

@PeteWaren - grpc? localhost = 127.0.0.1? , TF1.3 TF1.4

Switching to distributed training moving from TF v1.3 to v1.4: "UnavailableError: attempt to connect http1.x server"

More articles: