Unicorn workers timeout after zero downtime deployed with capistrano

Question

Unicorn workers timeout after zero downtime deployed with capistrano

I run the Rails 3.2.21 application and deploy it to the Ubuntu 12.04.5 field using capistrano (nginx and unicorn).

I have a zero-downtime deployment application (at least I thought), with my configuration files looking more or less like these .

Here's the problem: When the deployment is almost complete, and it restarts the unicorn, when I look at my unicorn.log, I see that it starts new workers, reapes the old ones ... but then my application just freezes for 2-3 minutes. Any application request at this moment gets into the timeout window (which I set for 40 seconds) and returns a page with an application error of 500.

Here is the first part of the output of unicorn.log when the unicorn reboots (I have 5 unicorn people):

I, [2015-04-21T23:06:57.022492 #14347] INFO -- : master process ready I, [2015-04-21T23:06:57.844273 #15378] INFO -- : worker=0 ready I, [2015-04-21T23:06:57.944080 #15381] INFO -- : worker=1 ready I, [2015-04-21T23:06:58.089655 #15390] INFO -- : worker=2 ready I, [2015-04-21T23:06:58.230554 #14541] INFO -- : reaped #<Process::Status: pid 15551 exit 0> worker=4 I, [2015-04-21T23:06:58.231455 #14541] INFO -- : reaped #<Process::Status: pid 3644 exit 0> worker=0 I, [2015-04-21T23:06:58.249110 #15393] INFO -- : worker=3 ready I, [2015-04-21T23:06:58.650007 #15396] INFO -- : worker=4 ready I, [2015-04-21T23:07:01.246981 #14541] INFO -- : reaped #<Process::Status: pid 32645 exit 0> worker=1 I, [2015-04-21T23:07:01.561786 #14541] INFO -- : reaped #<Process::Status: pid 15534 exit 0> worker=2 I, [2015-04-21T23:07:06.657913 #14541] INFO -- : reaped #<Process::Status: pid 16821 exit 0> worker=3 I, [2015-04-21T23:07:06.658325 #14541] INFO -- : master complete

Subsequently, when the application hangs in these 2-3 minutes, this is what happens:

 E, [2015-04-21T23:07:38.069635 #14347] ERROR -- : worker=0 PID:15378 timeout (41s > 40s), killing E, [2015-04-21T23:07:38.243005 #14347] ERROR -- : reaped #<Process::Status: pid 15378 SIGKILL (signal 9)> worker=0 E, [2015-04-21T23:07:39.647717 #14347] ERROR -- : worker=3 PID:15393 timeout (41s > 40s), killing E, [2015-04-21T23:07:39.890543 #14347] ERROR -- : reaped #<Process::Status: pid 15393 SIGKILL (signal 9)> worker=3 I, [2015-04-21T23:07:40.727755 #16002] INFO -- : worker=0 ready I, [2015-04-21T23:07:43.212395 #16022] INFO -- : worker=3 ready E, [2015-04-21T23:08:24.511967 #14347] ERROR -- : worker=3 PID:16022 timeout (41s > 40s), killing E, [2015-04-21T23:08:24.718512 #14347] ERROR -- : reaped #<Process::Status: pid 16022 SIGKILL (signal 9)> worker=3 I, [2015-04-21T23:08:28.010429 #16234] INFO -- : worker=3 ready

In the end, after 2 or 3 minutes, the application starts to react again, but more and more sluggish. You can see this very clearly in New Relic (a horizontal line indicates a reversal, and a light blue area indicates Ruby):

New Relic graph during and after deploy

I have the same intermediate server, and I can not replicate the problem at the stage ... provided, the configuration is not loaded (I'm the only one who tries to make requests to pages).

Here is my config / unicorn.rb file :

 root = "/home/deployer/apps/myawesomeapp/current" working_directory root pid "#{root}/tmp/pids/unicorn.pid" stderr_path "#{root}/log/unicorn.log" stdout_path "#{root}/log/unicorn.log" shared_path = "/home/deployer/apps/myawesomeapp/shared" listen "/tmp/unicorn.myawesomeapp.sock" worker_processes 5 timeout 40 preload_app true before_exec do |server| ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile" end before_fork do |server, worker| if defined?(ActiveRecord::Base) ActiveRecord::Base.connection.disconnect! end old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin" if File.exists?(old_pid) && server.pid != old_pid begin Process.kill("QUIT", File.read(old_pid).to_i) rescue Errno::ENOENT, Errno::ESRCH end end end after_fork do |server, worker| if defined?(ActiveRecord::Base) ActiveRecord::Base.establish_connection end end

And just to paint the whole picture, in my capistrano deploy.rb the task of restarting the unicorn is as follows:

 namespace :deploy do task :restart, roles: :app, except: { no_release: true } do run "kill -s USR2 `cat #{release_path}/tmp/pids/unicorn.pid`" end end

Any ideas why the unicorn workers timeout right after deployment? I thought the moment of zero downtime was to keep the old ones until the new ones were deployed and ready to serve?

Thanks!

UPDATE

I did another deployment and this time watched production.log to see what was going on there. The only suspicious thing was the following lines, which were mixed with regular queries:

 Dalli/SASL authenticating as 7510de Dalli/SASL: 7510de Dalli/SASL authenticating as 7510de Dalli/SASL: 7510de Dalli/SASL authenticating as 7510de Dalli/SASL: 7510de

UPDATE # 2

As suggested in some of the answers below, I changed the before_fork block to add sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU , so that the workers are gradually destroyed.

Same result, terribly slow deployment, with the same spike as in the graph above. For context only, out of my 5 workflows, the first 4 sent a TTOU signal, and the fifth sent QUIT. However, it does not seem to have changed.

+6

ruby-on-rails ruby-on-rails-3 deployment capistrano unicorn

Delpiero Apr 22 '15 at 17:15

source share

2 answers

travisluong · Answer 1 · 2015-04-27T05:19:11+0000

I recently encountered a similar problem when trying to configure Rails / Nginx / Unicorn on Digital Ocean. I was able to run applications with zero downtime to work after setting up a few things. Here are a few things you can try:

Reduce the number of workflows.
Increase the memory capacity of your server. I got timeouts in a 512 MB RAM camera. I decided to fix the problem when I increased it to 1 GB.
Use the capistrano3-unicorn gem.
If preload_app is true, use restart (USR2). If false, use reboot (HUP).
Make sure that "tmp / pids" is in the set as linked_dirs in deploy.rb.
Use px aux | grep unicorn px aux | grep unicorn to remove old processes.
Use kill [pid] to manually stop all unicorn processes.

Here is my unicorn configurator for reference:

 working_directory '/var/www/yourapp/current' pid '/var/www/yourapp/current/tmp/pids/unicorn.pid' stderr_path '/var/www/yourapp/log/unicorn.log' stdout_path '/var/www/yourapp/log/unicorn.log' listen '/tmp/unicorn.yourapp.sock' worker_processes 2 timeout 30 preload_app true before_fork do |server, worker| old_pid = "/var/www/yourapp/current/tmp/pids/unicorn.pid.oldbin" if old_pid != server.pid begin sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU Process.kill(sig, File.read(old_pid).to_i) rescue Errno::ENOENT, Errno::ESRCH end end end

deploy.rb

 lock '3.4.0' set :application, 'yourapp' set :repo_url, ' git@bitbucket.org :username/yourapp.git' set :deploy_to, '/var/www/yourapp' set :linked_files, fetch(:linked_files, []).push('config/database.yml', 'config/secrets.yml', 'config/application.yml') set :linked_dirs, fetch(:linked_dirs, []).push('log', 'tmp/pids', 'tmp/cache', 'tmp/sockets', 'vendor/bundle', 'public/system') set :format, :pretty set :log_level, :info set :rbenv_ruby, '2.1.3' namespace :deploy do after :restart, :clear_cache do on roles(:web), in: :groups, limit: 3, wait: 10 do end end end after 'deploy:publishing', 'deploy:restart' namespace :deploy do task :restart do #invoke 'unicorn:reload' invoke 'unicorn:restart' end end

Marcus walser · Answer 2 · 2015-05-02T06:16:07+0000

Do you trade unicorn and have the ability to run the package when deployed? If so, it could be a problem with the executable.

When you deploy Capistrano, cap creates a new release directory for your version and moves the current symbolic link to point to the new version. If you did not tell the controlling unicorn to gracefully update the path to its executable file, it should work if you add this line:

Unicorn::HttpServer::START_CTX[0] = ::File.join(ENV['GEM_HOME'].gsub(/releases\/[^\/]+/, "current"),'bin','unicorn')

You can find more information here . I think the before_fork block looks good for you, but would I add the line sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU from sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU 's answer; which will gradually kill workers, as new ones appear.

I would not delete preload_app true , by the way, since it significantly improves the appearance time of the worker.

Unicorn workers timeout after zero downtime deployed with capistrano

More articles: