Question:
How do we find out if this problem was in our application or a problem with the Google platform?
Steps taken:
- Checked platform logs and no signs of VM migration.
- The Google Cloud Status Dashboard has been checked and there are no signs of crashes.
Detailed description of the problem:
We encountered disconnecting the IO network and drive at approximately 8:37:40 UTC on Sunday, October 16th. Here is a summary (see Logs below for more details):
- [08:37:40]: - Our application is experiencing a DNS problem.
- [08:37:43]: - sd 0: 0: 1: 0: [sda] abort - sent to syslog
- [08:37:43]: - for more than 5 minutes the kernel has been blocked for more than 120 seconds.
- [08:43:10]: - sd 0: 0: 1: 0: device reset - sent to syslog
- [08:43:10]: - Google ( , )
- [08:43:11]: .
[ ]
I|2016-10-16|08:37:09.271|ALM Finished processing alarms
W|2016-10-16|08:37:40.165|RC Exception: DNS error: Temporary DNS error while resolving: www.googleapis.com
W|2016-10-16|08:37:47.218|BP Exception: DNS error: Temporary DNS error while resolving: www.some-domain.com
I|2016-10-16|08:43:11.138|DB line 1127: HWMDatabase::virtual void HWMDatabase::run() - Elapsed: 357999
I|2016-10-16|08:43:11.149|PE.CON onTimeoutNotification 185.3.54.28:9161
[Linux syslog]
Oct 16 08:37:43 hwm-node-1 kernel: [151118.601288] sd 0:0:1:0: [sda] abort
Oct 16 08:41:07 hwm-node-1 kernel: [151321.937381] INFO: task kworker/u4:1:29 blocked for more than 120 seconds.
...
Oct 16 08:41:07 hwm-node-1 kernel: [151322.089136] INFO: task jbd2/sda1-8:104 blocked for more than 120 seconds.
...
Oct 16 08:41:07 hwm-node-1 kernel: [151322.245617] INFO: task rs:main Q:Reg:414 blocked for more than 120 seconds.
...
Oct 16 08:41:07 hwm-node-1 kernel: [151322.481381] INFO: task hwm_master:7791 blocked for more than 120 seconds.
...
Oct 16 08:41:07 hwm-node-1 kernel: [151322.616600] INFO: task hwm_master:7802 blocked for more than 120 seconds.
...
Oct 16 08:41:07 hwm-node-1 kernel: [151322.861420] INFO: task cron:18904 blocked for more than 120 seconds.
...
Oct 16 08:41:08 hwm-node-1 kernel: [151323.051763] INFO: task cron:18905 blocked for more than 120 seconds.
...
Oct 16 08:42:53 hwm-node-1 kernel: [151428.634159] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.638435] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.642497] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.646611] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.650844] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.655165] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.659332] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.663459] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.667794] sd 0:0:1:0: [sda] abort
Oct 16 08:42:53 hwm-node-1 kernel: [151428.671939] sd 0:0:1:0: [sda] abort
Oct 16 08:43:08 hwm-node-1 kernel: [151443.169478] INFO: task jbd2/sda1-8:104 blocked for more than 120 seconds.
...
Oct 16 08:43:08 hwm-node-1 kernel: [151443.328262] INFO: task ntpd:393 blocked for more than 120 seconds.
...
Oct 16 08:43:08 hwm-node-1 kernel: [151443.527233] INFO: task rs:main Q:Reg:414 blocked for more than 120 seconds.
...
Oct 16 08:43:10 hwm-node-1 kernel: [151445.559469] sd 0:0:1:0: device reset
Oct 16 08:43:10 hwm-node-1 rsyslogd-2007: action 'action 18' suspended, next retry is Sun Oct 16 08:43:40 2016 [try http:
...
Oct 16 08:43:10 hwm-node-1 google-ip-forwarding: ERROR GET request error retrieving metadata.#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 159, in _HandleMetadataUpdate#012 metadata_key=metadata_key, recursive=recursive, wait=wait)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 134, in _GetMetadataUpdate#012 response = self._GetMetadataRequest(metadata_url, params=params)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 50, in Wrapper#012 response = func(*args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 97, in _GetMetadataRequest#012 return request_opener.open(request, timeout=self.timeout*1.1)#012 File "/usr/lib/python2.7/urllib2.py", line 431, in open#012 response = self._open(req, data)#012 File "/usr/lib/python2.7/urllib2.py", line 449, in _open#012 '_open', req)#012 File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain#012 result = func(*args)#012 File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open#012 return self.do_open(httplib.HTTPConnection, req)#012 File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open#012 r = h.getresponse(buffering=True)#012 File "/usr/lib/python2.7/httplib.py", line 1111, in getresponse#012 response.begin()#012 File "/usr/lib/python2.7/httplib.py", line 444, in begin#012 version, status, reason = self._read_status()#012 File "/usr/lib/python2.7/httplib.py", line 400, in _read_status#012 line = self.fp.readline(_MAXLINE + 1)#012 File "/usr/lib/python2.7/socket.py", line 476, in readline#012 data = self._sock.recv(self._rbufsize)#012timeout: timed out
Oct 16 08:43:10 hwm-node-1 google-accounts: ERROR GET request error retrieving metadata.#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 159, in _HandleMetadataUpdate#012 metadata_key=metadata_key, recursive=recursive, wait=wait)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 134, in _GetMetadataUpdate#012 response = self._GetMetadataRequest(metadata_url, params=params)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 50, in Wrapper#012 response = func(*args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 97, in _GetMetadataRequest#012 return request_opener.open(request, timeout=self.timeout*1.1)#012 File "/usr/lib/python2.7/urllib2.py", line 431, in open#012 response = self._open(req, data)#012 File "/usr/lib/python2.7/urllib2.py", line 449, in _open#012 '_open', req)#012 File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain#012 result = func(*args)#012 File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open#012 return self.do_open(httplib.HTTPConnection, req)#012 File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open#012 r = h.getresponse(buffering=True)#012 File "/usr/lib/python2.7/httplib.py", line 1111, in getresponse#012 response.begin()#012 File "/usr/lib/python2.7/httplib.py", line 444, in begin#012 version, status, reason = self._read_status()#012 File "/usr/lib/python2.7/httplib.py", line 400, in _read_status#012 line = self.fp.readline(_MAXLINE + 1)#012 File "/usr/lib/python2.7/socket.py", line 476, in readline#012 data = self._sock.recv(self._rbufsize)#012timeout: timed out
Oct 16 08:43:10 hwm-node-1 google-clock-skew: ERROR GET request error retrieving metadata.#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 159, in _HandleMetadataUpdate#012 metadata_key=metadata_key, recursive=recursive, wait=wait)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 134, in _GetMetadataUpdate#012 response = self._GetMetadataRequest(metadata_url, params=params)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 50, in Wrapper#012 response = func(*args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 97, in _GetMetadataRequest#012 return request_opener.open(request, timeout=self.timeout*1.1)#012 File "/usr/lib/python2.7/urllib2.py", line 431, in open#012 response = self._open(req, data)#012 File "/usr/lib/python2.7/urllib2.py", line 449, in _open#012 '_open', req)#012 File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain#012 result = func(*args)#012 File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open#012 return self.do_open(httplib.HTTPConnection, req)#012 File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open#012 r = h.getresponse(buffering=True)#012 File "/usr/lib/python2.7/httplib.py", line 1111, in getresponse#012 response.begin()#012 File "/usr/lib/python2.7/httplib.py", line 444, in begin#012 version, status, reason = self._read_status()#012 File "/usr/lib/python2.7/httplib.py", line 400, in _read_status#012 line = self.fp.readline(_MAXLINE + 1)#012 File "/usr/lib/python2.7/socket.py", line 476, in readline#012 data = self._sock.recv(self._rbufsize)#012timeout: timed out