Symptoms
Cluster nodes — if these nodes have AMD 2nd generation Epyc (Rome) CPU and Mellanox NIC — reboot unexpectedly.
In kernel log (use the dmesg command to see it):
AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x002d address=0x00000000be6ca980 flags=0x0020]
Cause
Known issue specific to this configuration. See the Troubleshooting section of this Mellanox community article.
Solution
- Enable SR-IOV in BIOS.
- Set iommu=pt is set on the Linux grub menu:
-
In /etc/default/grub, add kernel parameter iommu=pt to the string GRUB_CMDLINE_LINUX.
GRUB_CMDLINE_LINUX="<YOUR_PARAMS> iommu=pt"
For example:
Before:GRUB_CMDLINE_LINUX="crashkernel=auto tcache.enabled=0 rd.md.uuid=93606373:d5569557:322f4641:13d6fab3 rd.md.uuid=c0b44f6a:1efde5fe:51aace30:4627c299 rd.md.uuid=d8db1339:2fb46769:61385b6b:ba385aa7 quiet"
After:GRUB_CMDLINE_LINUX="crashkernel=auto tcache.enabled=0 rd.md.uuid=93606373:d5569557:322f4641:13d6fab3 rd.md.uuid=c0b44f6a:1efde5fe:51aace30:4627c299 rd.md.uuid=d8db1339:2fb46769:61385b6b:ba385aa7 quiet iommu=pt"
- Run:
grub2-mkconfig -o /boot/grub2/grub.cfg
Default location, is different for EFI or if changed by the user.
-
- If the alert appears again (after the cluster update for example) and the solution above was done - safely ignore it as there is no way to check if above settings were applied in BIOS and the alert is just a reminder in this case.