I run the Rails 3.2.21 application and deploy it to the Ubuntu 12.04.5 field using capistrano (nginx and unicorn).
I have a zero-downtime deployment application (at least I thought), with my configuration files looking more or less like these .
Here's the problem: When the deployment is almost complete, and it restarts the unicorn, when I look at my unicorn.log, I see that it starts new workers, reapes the old ones ... but then my application just freezes for 2-3 minutes. Any application request at this moment gets into the timeout window (which I set for 40 seconds) and returns a page with an application error of 500.
Here is the first part of the output of unicorn.log when the unicorn reboots (I have 5 unicorn people):
I, [2015-04-21T23:06:57.022492 #14347] INFO -- : master process ready I, [2015-04-21T23:06:57.844273 #15378] INFO -- : worker=0 ready I, [2015-04-21T23:06:57.944080 #15381] INFO -- : worker=1 ready I, [2015-04-21T23:06:58.089655 #15390] INFO -- : worker=2 ready I, [2015-04-21T23:06:58.230554 #14541] INFO -- : reaped #<Process::Status: pid 15551 exit 0> worker=4 I, [2015-04-21T23:06:58.231455 #14541] INFO -- : reaped #<Process::Status: pid 3644 exit 0> worker=0 I, [2015-04-21T23:06:58.249110 #15393] INFO -- : worker=3 ready I, [2015-04-21T23:06:58.650007 #15396] INFO -- : worker=4 ready I, [2015-04-21T23:07:01.246981 #14541] INFO -- : reaped #<Process::Status: pid 32645 exit 0> worker=1 I, [2015-04-21T23:07:01.561786 #14541] INFO -- : reaped #<Process::Status: pid 15534 exit 0> worker=2 I, [2015-04-21T23:07:06.657913 #14541] INFO -- : reaped #<Process::Status: pid 16821 exit 0> worker=3 I, [2015-04-21T23:07:06.658325 #14541] INFO -- : master complete
Subsequently, when the application hangs in these 2-3 minutes, this is what happens:
E, [2015-04-21T23:07:38.069635 #14347] ERROR -- : worker=0 PID:15378 timeout (41s > 40s), killing E, [2015-04-21T23:07:38.243005 #14347] ERROR -- : reaped #<Process::Status: pid 15378 SIGKILL (signal 9)> worker=0 E, [2015-04-21T23:07:39.647717 #14347] ERROR -- : worker=3 PID:15393 timeout (41s > 40s), killing E, [2015-04-21T23:07:39.890543 #14347] ERROR -- : reaped #<Process::Status: pid 15393 SIGKILL (signal 9)> worker=3 I, [2015-04-21T23:07:40.727755 #16002] INFO -- : worker=0 ready I, [2015-04-21T23:07:43.212395 #16022] INFO -- : worker=3 ready E, [2015-04-21T23:08:24.511967 #14347] ERROR -- : worker=3 PID:16022 timeout (41s > 40s), killing E, [2015-04-21T23:08:24.718512 #14347] ERROR -- : reaped #<Process::Status: pid 16022 SIGKILL (signal 9)> worker=3 I, [2015-04-21T23:08:28.010429 #16234] INFO -- : worker=3 ready
In the end, after 2 or 3 minutes, the application starts to react again, but more and more sluggish. You can see this very clearly in New Relic (a horizontal line indicates a reversal, and a light blue area indicates Ruby):

I have the same intermediate server, and I can not replicate the problem at the stage ... provided, the configuration is not loaded (I'm the only one who tries to make requests to pages).
Here is my config / unicorn.rb file :
root = "/home/deployer/apps/myawesomeapp/current" working_directory root pid "#{root}/tmp/pids/unicorn.pid" stderr_path "#{root}/log/unicorn.log" stdout_path "#{root}/log/unicorn.log" shared_path = "/home/deployer/apps/myawesomeapp/shared" listen "/tmp/unicorn.myawesomeapp.sock" worker_processes 5 timeout 40 preload_app true before_exec do |server| ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile" end before_fork do |server, worker| if defined?(ActiveRecord::Base) ActiveRecord::Base.connection.disconnect! end old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin" if File.exists?(old_pid) && server.pid != old_pid begin Process.kill("QUIT", File.read(old_pid).to_i) rescue Errno::ENOENT, Errno::ESRCH end end end after_fork do |server, worker| if defined?(ActiveRecord::Base) ActiveRecord::Base.establish_connection end end
And just to paint the whole picture, in my capistrano deploy.rb the task of restarting the unicorn is as follows:
namespace :deploy do task :restart, roles: :app, except: { no_release: true } do run "kill -s USR2 `cat #{release_path}/tmp/pids/unicorn.pid`" end end
Any ideas why the unicorn workers timeout right after deployment? I thought the moment of zero downtime was to keep the old ones until the new ones were deployed and ready to serve?
Thanks!
UPDATE
I did another deployment and this time watched production.log to see what was going on there. The only suspicious thing was the following lines, which were mixed with regular queries:
Dalli/SASL authenticating as 7510de Dalli/SASL: 7510de Dalli/SASL authenticating as 7510de Dalli/SASL: 7510de Dalli/SASL authenticating as 7510de Dalli/SASL: 7510de
UPDATE # 2
As suggested in some of the answers below, I changed the before_fork block to add sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU , so that the workers are gradually destroyed.
Same result, terribly slow deployment, with the same spike as in the graph above. For context only, out of my 5 workflows, the first 4 sent a TTOU signal, and the fifth sent QUIT. However, it does not seem to have changed.