Why does copy_to_user spend a few hundred milliseconds?

I have linux servers running apache trafficicserver and found around 1/20 requests for a few milliseconds, which are much longer than expected (should be 1-2 ms).

After tracking using systemtap, I found that the time spent in the linux kernel copy_to_user () call. (Accept-> inet_csk_accept-> move_addr_to_user-> copy_to_user)

Server loading ends at a low level (<100qps) and copy_to_user () only copies data from 16 bytes (struct sockaddr), but spends hundreds of milliseconds.

Since I no longer understand the systemtap method and the kernel method, I cannot investigate a further reason. I checked CPU usage, using swap

Thanks for any advice.

Equipment:

  • CPU: Intel (R) Xeon (R) CPU E5-2640 v3 @ 2.60 GHz
  • memory: 64G
  • disk: 11 * 6T hdd

Software:

  • centos6 2.6.32-696.el6.x86_64 # 1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU / Linux
  • apache traicserver 5.3.2 configure 10G ram
  • nginx 1.10.2, proxy server user request to a file server

    // systemtap probe kernel.function("move_addr_to_user") { ts["move_addr_to_user", pid()] = gettimeofday_ms() } probe kernel.function("move_addr_to_user").return { printf("[%d]move_addr_to_user done:%d %s %d\n", gettimeofday_ms(), pid(), execname(), gettimeofday_ms()-ts["move_addr_to_user", pid()]) } probe kernel.function("copy_to_user") { ts["copy_to_user", pid()] = gettimeofday_ms() } probe kernel.function("copy_to_user").return { printf("[%d]copy_to_user done:%d %s %d %d\n", gettimeofday_ms(), pid(), execname(), gettimeofday_ms()-ts["copy_to_user", pid()], gettimeofday_ms()-ts["__copy_to_user", pid()]) } // output: [1495630190767] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0 [1495630191164]copy_to_user done:24145 [ACCEPT 0:8080] 0 [1495630191164] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0 [1495630192172]copy_to_user done:24145 [ACCEPT 0:8080] 861 [1495630192172] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 861 [1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0 [1495630192173] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0 [1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0 [1495630192173] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0 [1495630192173]copy_to_user done:24145 [ACCEPT 0:8080] 0 free -g total used free shared buffers cached Mem: 62 55 6 0 0 32 -/+ buffers/cache: 23 39 Swap: 15 0 15 top - 20:57:39 up 24 days, 19:26, 2 users, load average: 7.70, 9.43, 9.62 Tasks: 643 total, 1 running, 642 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 1.0%sy, 0.0%ni, 97.5%id, 1.1%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 65560992k total, 58525192k used, 7035800k free, 365084k buffers Swap: 16777212k total, 0k used, 16777212k free, 33957572k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24145 traffics 20 0 21.7g 12g 6212 S 24.7 19.3 212:42.85 [ET_NET 0] 22173 root 20 0 677m 325m 1180 S 3.6 0.5 0:41.10 nginx 22161 root 20 0 677m 325m 1184 S 2.6 0.5 0:47.50 nginx 22168 root 20 0 677m 326m 2076 S 2.6 0.5 0:28.31 nginx 22150 root 20 0 677m 325m 1208 S 1.6 0.5 0:42.75 nginx 22165 root 20 0 677m 325m 1200 S 1.6 0.5 0:31.77 nginx 

Update:

@employee of the month, thank you for your advice, after adding some probe to __do_page_fault, I found that the time was spent on __do_page_fault β†’ down_read (& mm-> mmap_sem);

  [1495677639202]copy_to_user done:24145 [ACCEPT 0:8080] 1 [1495677639202] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 1 [1495677639314]copy_to_user done:24145 [ACCEPT 0:8080] 0 [1495677639314] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 0 [1495677641329]do page fault done:24145 [ACCEPT 0:8080] 622 [1495677641329]copy_to_user done:24145 [ACCEPT 0:8080] 622 [1495677641329] 16 16 move_addr_to_user done:24145 [ACCEPT 0:8080] 622 

@Ricardo Biehl Pasquali thanks for the advice. Apache traicserver reads / writes to hdd in separate streams. traicserver has 1 thread for receiving connections, 88 (8 threads on hdd) threads for blocking read / write (cache contents), but I don’t understand why blocking read / write in other streams can lead to high __do_page_fault () latency.

  Device: rrqm/s wrqm/sr/sw/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 16.00 0.00 4.00 0.00 160.00 40.00 0.05 13.50 0.00 13.50 7.75 3.10 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 6.00 0.00 4326.00 0.00 721.00 0.06 10.50 10.50 0.00 6.83 4.10 sdc 0.00 0.00 2.00 0.00 1472.00 0.00 736.00 0.04 18.50 18.50 0.00 9.50 1.90 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 14.00 0.00 10464.00 0.00 747.43 0.04 2.71 2.71 0.00 1.79 2.50 

Update:

The problem is solved! The root cause is the periodic call system trafficyserver () to back up the file in the ancient plugin that we wrote.

+5
source share
1 answer

Well, a reasonable explanation is that you are making a page error. One possible reason would be that the server plugs / outputs depending on how the load changes, and the new child has a page failure, but as a minor error, it should not last long if there is not enough memory.

Another problem is lack of memory, and you are actually swapping places.

In any case, given that you have come this far, I don’t see that the problem goes further. You want to put the probe in __do_page_fault and continue from there.

+3
source

Source: https://habr.com/ru/post/1268174/


All Articles