Symptoms
In case the hardware node is unresponsive, it is worth collecting some debug data and try to gracefully restart it using ALT+SysRq keys.
Note: Collecting debug data makes sense when serial console (see Additional Information at the bottom of the article) or netconsole is attached to the server.
Resolution
-
Check which sequences are allowed for your current kernel:
-
Press Alt+SysRq+H to view available commands:
SysRq HELP : loglevel0-8 vsced_stAte reBoot Crashdump tErm Full debuG kIll thaw-filesystems(J) saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount shoWcpusNote: on different versions of the kernel and in different Linux distributions these key sequences may be slightly different, which is why it is strongly recommended to check output of Alt+SysRq+H to see available commands.
-
In this output:
- Alt+SysRq+0 - Alt+SysRq+8 - set kernel log level, where 0 is the lowest and 8 is the highest verbosity of logging. It is recommended to set log level to 8 before executing other commands.
- Alt+SysRq+A - Show Vsched.
- Alt+SysRq+B - Reboot immediately without syncing or unmounting your disks, it may lead to file system corruption.
- Alt+SysRq+C - Trigger a crash.
- Alt+SysRq+E - Send the TERM signal to all running processes except init, asking them to exit.
- Alt+SysRq+F - Manual OOM execution.
- Alt+SysRq+G - Debug mode.
- Alt+SysRq+H - Show help message.
- Alt+SysRq+I - Send KILL signal to all running processes except init, asking them to exit.
- Alt+SysRq+J - Emergency Thaw of all frozen filesystems.
- Alt+SysRq+K - Kill all processes (including X) which are running on the currently active virtual console. This key combination is know as "secure access key" (SAK).
- Alt+SysRq+M - Show Memory.
- Alt+SysRq+N - Nice All RT Tasks.
- Alt+SysRq+O - Shut the system off. Without preliminary operations it may lead to file system corruption.
- Alt+SysRq+P - Dump the current registers and flags.
- Alt+SysRq+Q - Quit from debugging mode.
- Alt+SysRq+R - Turn off keyboard raw mode.
- Alt+SysRq+S - Run an emergency sync (cache write) on all mounted filesystems. This can prevent data loss.
- Alt+SysRq+T - Dump a list of current tasks and their information.
- Alt+SysRq+U - Remount all mounted filesystems as read-only.
- Alt+SysRq+W - Show CPUs.
-
-
Try to collect all debug data:
-
Dump CPU information by pressing Alt+SysRq+W, you should get a similar output on serial console:
Mar 8 10:07:57 pvcfl46 SysRq: Show CPUs Mar 8 10:07:57 pvcfl46 requested on CPU0: Mar 8 10:07:57 pvcfl46 CPU0: Mar 8 10:07:57 pvcfl46 ffff810037d69dd0 0000000000000000 ffffffff80344648 0000000000000000 Mar 8 10:07:57 pvcfl46 ffffffff801aef3f 0000000000000000 ffffffff801aef2d ffffffff801aef97 Mar 8 10:07:57 pvcfl46 ffffffff801aef2d ffffffff8009463e ffffffff80344640 ffff81007ffabdc0 Mar 8 10:07:57 pvcfl46 Call Trace: Mar 8 10:07:57 pvcfl46 [] showacpu+0x0/0x65 Mar 8 10:07:57 pvcfl46 [] sysrq_showregs_othercpus+0x0/0x12 Mar 8 10:07:57 pvcfl46 [] showacpu+0x58/0x65 Mar 8 10:07:57 pvcfl46 [] sysrq_showregs_othercpus+0x0/0x12 Mar 8 10:07:57 pvcfl46 [] on_each_cpu+0x19/0x22 Mar 8 10:07:57 pvcfl46 [] run_workqueue+0x94/0xe5 Mar 8 10:07:57 pvcfl46 [] worker_thread+0x0/0x122 Mar 8 10:07:57 pvcfl46 [] worker_thread+0xf0/0x122 Mar 8 10:07:57 pvcfl46 [] default_wake_function+0x0/0xe Mar 8 10:07:57 pvcfl46 [] kthread+0xfe/0x132 Mar 8 10:07:57 pvcfl46 [] child_rip+0xa/0x11 Mar 8 10:07:57 pvcfl46 [] kthread+0x0/0x132 Mar 8 10:07:57 pvcfl46 [] child_rip+0x0/0x11 -
Dump memory information by pressing Alt+SysRq+M. It is worth doing it 2-3 times.
-
Dump all registers state by pressing Alt+SysRq+P. It is worth doing it 2-3 times.
NOTE: This operation can bring PSBM virtual machines down and it makes sense to use it ONLY when serial console is attached.
-
Dump all tasks information by pressing Alt+SysRq+T. It is worth doing it 2-3 times.
NOTE: This operation can bring PSBM virtual machines down and it makes sense to use it ONLY when serial console is attached.
- Dump Vsched information by pressing Alt+SysRq+A.
-
-
Now try to reboot the server safely:
-
Sync all filesystems by pressing Alt+SysRq+S. It should provide the similar output:
Mar 8 10:31:58 pvcfl46 SysRq: Emergency Sync Mar 8 10:31:58 pvcfl46 Emergency Sync complete -
Try to unmount all filesystems by pressing Alt+SysRq+U. It should provide a similar output:
Mar 8 10:33:30 pvcfl46 SysRq: Emergency Remount R/O Mar 8 10:33:30 pvcfl46 Emergency Remount complete -
Try to kill all processes by pressing Alt+SysRq+I. It should provide a similar output:
Mar 8 10:36:13 pvcfl46 SysRq: Kill All Tasks Mar 8 10:36:13 pvcfl46 CT: 1: stopped -
Try to unmount all filesystems once again by pressing Alt+SysRq+U. It should provide a similar output:
Mar 8 10:33:30 pvcfl46 SysRq: Emergency Remount R/O Mar 8 10:33:30 pvcfl46 Emergency Remount complete -
Try to send server for reboot by pressing Alt+SysRq+B.
-
- Also it is worth trying to trigger a crash dump in case system looks hanged by pressing Alt+SysRq+C, however it is useful only if kernel crash dumps are configured properly (see Additional Information below).
-
Instead of pressing Alt+SysRq+
it is possible to trigger the same command by writing the corresponding character to/proc/sysrq-trigger:~# echo 1 /proc/sys/kernel/sysrq ~# echo h /proc/sysrq-trigger
Additional information
For more information about attaching serial console to the server, please refer to this article:
Article #10041 [Legacy] How to set up a serial console to a Linux server
For more information about kernel crash dumps configuration, please refer to this article:
Article #10044 How to configure kernel crash dumps on a Linux server.