Dockers stop talking after a while

I have 6 containers working in docker together. Kafka + Zookeeper, MongoDB, A, B, C and interface. The interface is the main access point from the general - only this container publishes the port - 5683. The interface container is connected to A, B and C during startup. I use docker-compose file + docker stack deploy, each service has a name that is used as a host for the interface. Everything starts successfully and works great. After some time (20 minutes, 1 hour, ..) I can’t request the interface. The interface receives my requests, but the application has lost connection with services A, B, C or all of them. If I restart the interface, it will be able to connect to services A, B, C.

At first I thought it was an application problem, so I open up 2 new ports for each service (interface, A, B, C) and connect to them using a profiler and debugger. The application works fine, without leaks, without blocked threads, normally working and waiting for connections. The debugger shows me that when I make a request to the interface and the interface tries to request service A, a reset connection was called using the peer exception.

During this debugging I learned interesting things. I connected the debugger to the interface when starting the services, and the debugger was turned off after a while. + I could not reconnect it until I requested the container β†’ application. PRoblem - handshake not completed.

Another interesting material that I found out was that I was unable to request any interface. So I used wirehark to find out what was going on, and: SYN - ACK was fine. Then the application will publish some data and interface using FIN, ACK. I assume this also happens when the interface tries to request service A, and this is a FIN connection. The codebase of the interface, A, B and C is the same with respect to the netty server.

Finally, I do not think this is a problem with the application. What for? I tried to deploy containers not as services. I start each container separately, publishing the ports of each and the service endpoint have been installed on localhost. (not overlay network). And it works. Containers work without problems. + I did not say at first that Java applications (interface, A, B, C) work without problems when they work as a standalone application, and not in the docker.

Could you help me on what might be the problem? Why does docker close sockets in case of network overlapping?

I am using the latest docker. I used the older one too.

+5
source share
1 answer

Finally, I was able to solve the problem.

What happens, again. The interface opens a permanent TCP connection with A, B, C. When you try to start these services A, B, C as stand-alone Java applications, everything works. When we complete them and start the swarm, it worked for only a few minutes. It was strange that the connection between the interface and another service was interrupted the moment you made a request from the client to the interface.

After many unsuccessful tests and debugging of each container, I tried to launch each docker container separately, with the ports displayed, and I specified localhost as the endpoint. (each container, open ports and interface, connected to localhost) A funny thing happens, it works. When starting such containers, another network driver for the container is used. The bridge is one. If you run it in a swarm, the network overlay driver is used.

So it had to be something with a docker network, and not with the application itself. The next step was tcpdump from each container after a couple of minutes when it should stop working. It was very interesting.

  • Client β†’ Interface (OK, request accepted)
  • Interface β†’ (request ahead, because it belongs to A) A
    • Interface β†’ A [POST]
    • A β†’ Interface [RESET]

A rewrote an open TCP connection after a couple of minutes without a connection. Why?

Docker uses IP Virtual Server, and IPVS supports its own connection table. The default timeout for CLOSE_WAIT connections in the IPVS table is 60 seconds. Therefore, when the server sends something after 60 seconds, the IPVS connection is no longer available and the packet appears invalid for the new TCP session and receives RST. On the client side, the connection remains permanently in FIN_WAIT2 state because the application still has an open socket; The fin_wait kernel timer starts only for orphaned TCP sockets.

This is what I read about it and how I understand it. I am not sure that my explanation of the problem is correct, but based on these assumptions, I implemented ping-pong between the interface and services A, B, C in the absence of communication for <60 seconds. And his job.

+4
source

Source: https://habr.com/ru/post/1266525/


All Articles