ubuntu下ceph的osd进程状态突然变D


近日ceph集群的一台服务器上的所有OSD进程状态变为D,显示所有osd进程down。

ssh登陆该节点发现除了ceph的命令以及ps命令一执行就会无响应,状态变为D

但是该节点上的云主机工作正常,qume以及nova都正常着。

查看syslog如下:


 <3>Jul 17 05:53:59 node-57 kernel: [1363616.943059] INFO: task kworker/28:1:1153 blocked for more than 120 seconds.
<3>Jul 17 05:53:59 node-57 kernel: [1363617.028596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:53:59 node-57 kernel: [1363617.124822] INFO: task kworker/28:2:3516 blocked for more than 120 seconds.
<3>Jul 17 05:53:59 node-57 kernel: [1363617.210440] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:53:59 node-57 kernel: [1363617.306594] INFO: task kworker/6:1:19327 blocked for more than 120 seconds.
<3>Jul 17 05:53:59 node-57 kernel: [1363617.392103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:54:00 node-57 kernel: [1363617.488554] INFO: task kworker/6:0:15106 blocked for more than 120 seconds.
<3>Jul 17 05:54:00 node-57 kernel: [1363617.574091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:54:00 node-57 kernel: [1363617.670239] INFO: task kworker/6:2:3208 blocked for more than 120 seconds.
<3>Jul 17 05:54:00 node-57 kernel: [1363617.754676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:56:00 node-57 kernel: [1363737.980036] INFO: task ceph-osd:12626 blocked for more than 120 seconds.
<3>Jul 17 05:56:00 node-57 kernel: [1363738.062468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:56:00 node-57 kernel: [1363738.158625] INFO: task ceph-osd:12637 blocked for more than 120 seconds.
<3>Jul 17 05:56:00 node-57 kernel: [1363738.240975] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:56:00 node-57 kernel: [1363738.337147] INFO: task ceph-osd:12646 blocked for more than 120 seconds.
<3>Jul 17 05:56:00 node-57 kernel: [1363738.419612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:56:01 node-57 kernel: [1363738.515859] INFO: task ceph-osd:13825 blocked for more than 120 seconds.
<3>Jul 17 05:56:01 node-57 kernel: [1363738.598301] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<3>Jul 17 05:56:01 node-57 kernel: [1363738.694489] INFO: task ceph-osd:13827 blocked for more than 120 seconds.
<3>Jul 17 05:56:01 node-57 kernel: [1363738.776982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

进而查看dmesg发现下面的信息:


 [1363738.694489] INFO: task ceph-osd:13827 blocked for more than 120 seconds.
[1363738.776982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1363738.872942] ceph-osd        D ffff883f225a8360     0 13827      1 0x00000000
[1363738.872948]  ffff883de61d3d30 0000000000000046 ffff883de61d3fd8 ffff883f7f0d4580
[1363738.872952]  ffff883de61d3fd8 ffff883de61d3fd8 ffff883de61d3fd8 0000000000014580
[1363738.872955]  ffff883f26018000 ffff883de61cc650 ffff883de61d3dc8 ffff883de61cc650
[1363738.872959] Call Trace:
[1363738.872972]  [<ffffffff81749499>] schedule+0x29/0x70
[1363738.872975]  [<ffffffff8174a20d>] rwsem_down_read_failed+0xad/0x100
[1363738.872984]  [<ffffffff810c8848>] ? futex_wait+0x108/0x210
[1363738.872990]  [<ffffffff81383534>] call_rwsem_down_read_failed+0x14/0x30
[1363738.872993]  [<ffffffff81747e84>] ? down_read+0x24/0x2b
[1363738.872997]  [<ffffffff8174ed92>] __do_page_fault+0x202/0x560
[1363738.873000]  [<ffffffff8174a8be>] ? _raw_spin_lock+0xe/0x20
[1363738.873003]  [<ffffffff810c8ab3>] ? futex_wake+0x113/0x130
[1363738.873005]  [<ffffffff810ca6f8>] ? do_futex+0xd8/0x1b0
[1363738.873037]  [<ffffffffa04f8784>] ? kvm_on_user_return+0x74/0x80 [kvm]
[1363738.873039]  [<ffffffff8174f127>] do_page_fault+0x37/0x70
[1363738.873043]  [<ffffffff8174b1d8>] page_fault+0x28/0x30

请教如何分析啊?或者有什么参考意见。

Linux linux-kernel

青柠的味道 9 years, 3 months ago

Your Answer