Update: I am updating this answer with this excellent resource here: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ Socket programming is complicated, so check out the links in this post.
None of the answers here seem accurate or useful. The OP does not look for BSD socket programming information. He is trying to figure out how to reliably handle failures of accept () ed client sockets in ZMQ on the REP socket to prevent server hangs or crashes.
As already noted, this problem is complicated by the fact that ZMQ tries to pretend that the listen () server socket matches the accept () ed socket (and there is no place in the documentation that describes how to set base timeouts on such sockets.)
My answer:
After much digging in the code, the only relevant socket options passed to accept () ed media seem to support the parameters from the parent listen () er. Therefore, the solution is to set the following parameters on the listening socket before calling send or recv:
void zmq_setup(zmq::context_t** context, zmq::socket_t** socket, const char* endpoint) { // Free old references. if(*socket != NULL) { (**socket).close(); (**socket).~socket_t(); } if(*context != NULL) { // Shutdown all previous server client-sockets. zmq_ctx_destroy((*context)); (**context).~context_t(); } *context = new zmq::context_t(1); *socket = new zmq::socket_t(**context, ZMQ_REP); // Enable TCP keep alive. int is_tcp_keep_alive = 1; (**socket).setsockopt(ZMQ_TCP_KEEPALIVE, &is_tcp_keep_alive, sizeof(is_tcp_keep_alive)); // Only send 2 probes to check if client is still alive. int tcp_probe_no = 2; (**socket).setsockopt(ZMQ_TCP_KEEPALIVE_CNT, &tcp_probe_no, sizeof(tcp_probe_no)); // How long does a con need to be "idle" for in seconds. int tcp_idle_timeout = 1; (**socket).setsockopt(ZMQ_TCP_KEEPALIVE_IDLE, &tcp_idle_timeout, sizeof(tcp_idle_timeout)); // Time in seconds between individual keep alive probes. int tcp_probe_interval = 1; (**socket).setsockopt(ZMQ_TCP_KEEPALIVE_INTVL, &tcp_probe_interval, sizeof(tcp_probe_interval)); // Discard pending messages in buf on close. int is_linger = 0; (**socket).setsockopt(ZMQ_LINGER, &is_linger, sizeof(is_linger)); // TCP user timeout on unacknowledged send buffer int is_user_timeout = 2; (**socket).setsockopt(ZMQ_TCP_MAXRT, &is_user_timeout, sizeof(is_user_timeout)); // Start internal enclave event server. printf("Host: Starting enclave event server\n"); (**socket).bind(endpoint); }
This means that the operating system aggressively checks the client socket for timeouts and reaps them for cleaning when the client does not return the heartbeat in time. As a result, the operating system will send SIGPIPE back to your program, and socket errors will bubble when sending / restoring a fixed hung server. Then you need to do two more things:
1. Handle SIGPIPE errors so that the program does not crash
#include <signal.h>
2. Check -1 returned by send or recv, and find ZMQ errors.
// Eg skip broken accepted sockets (pseudo-code.) while (1): { try { if ((*socket).recv(&request)) == -1) throw -1; } catch (...) { // Prevent any endless error loops killing CPU. sleep(1) // Reset ZMQ state machine. try { zmq::message_t blank_reply = zmq::message_t(); (*socket).send (blank_reply); } catch (...) { 1; } continue; }
Notice the weird code that tries to send a response when a socket fails? In ZMQ, the server's βsocketβ REP is the endpoint for another program that creates the REQ socket for this server. As a result, if you recv for the REP socket with a hung client, the server toe will get stuck in the broken receive cycle, where it will wait forever to get a valid response.
To force the update on the destination computer, try sending a response. ZMQ detects that the socket is broken, and removes it from its queue. The server socket is peeling off, and the next recv call returns a new client from the queue.
To enable timeouts on an asynchronous client (in Python 3 ), the code should look something like this:
import asyncio import zmq import zmq.asyncio @asyncio.coroutine def req(endpoint): ms = 2000
Now you have several failure scenarios when something goes wrong.
By the way, if anyone is interested, the default value for TCP idle timeout on Linux is 7200 seconds or 2 hours. So you will wait a long time until the hung server does something!
Sources:
Denial of responsibility:
I tested this code and it seems to work, but does ZMQ really complicate testing it because the client reconnects when it fails? If someone wants to use this solution in production, I recommend writing a few basic unit tests first.
Server code can also be greatly improved with threads or polling so that multiple clients can be processed simultaneously. In the existing state, a malicious client can temporarily receive resources from the server (waiting time 3 seconds), which is not ideal.