Symptoms:
Nvidia systemd services nvidia-vgpud and nvidia-vgpu-mgr do not work correctly after upgrading VHI to the 6.0 version.
Following error messages are observed in their statuses:
Oct 19 16:40:20 gpu01.vstoragedomain nvidia-vgpud[1615]: error: failed to allocate client: 59 Oct 19 16:40:20 gpu01.vstoragedomain nvidia-vgpud[1615]: error: failed to read pGPU information: 9 Oct 19 16:40:20 gpu01.vstoragedomain nvidia-vgpud[1615]: error: failed to send vGPU configuration info to RM: 9 Oct 19 16:40:20 gpu01.vstoragedomain nvidia-vgpud[1615]: PID file unlocked. Oct 19 16:40:20 gpu01.vstoragedomain nvidia-vgpud[1615]: PID file closed. Oct 19 16:40:20 gpu01.vstoragedomain nvidia-vgpud[1615]: Shutdown (1615) Oct 19 16:40:20 gpu01.vstoragedomain systemd[1]: nvidia-vgpud.service: Main process exited, code=exited, status=9/n/a Oct 19 16:40:20 gpu01.vstoragedomain systemd[1]: nvidia-vgpud.service: Failed with result 'exit-code'.
Oct 19 16:40:19 gpu01.vstoragedomain nvidia-vgpu-mgr[1612]: error: vmiop_env_log: Failed to initialize RM client: 0x59 Oct 19 16:40:20 gpu01.vstoragedomain systemd[1]: nvidia-vgpu-mgr.service: Deactivated successfully.
Cause:
In general these services do not work if they are started without Nvidia modules.
After the upgrade to the VHI 6.0 version and node reboot, dkms service rebuilds Nvidia modules on boot.
As Nvidia services have no dependency on dkms - they fail because the corresponding module is not loaded.
Resolution:
Please create the following files on all nodes where Nvidia Virtual GPU drivers are installed:
[root@gpu02 ~]# cat /etc/systemd/system/nvidia-vgpu-mgr.service.d/dkms.conf [Unit] After=dkms.service Wants=dkms.service [root@gpu02 ~]# cat /etc/systemd/system/nvidia-vgpud.service.d/dkms.conf [Unit] After=dkms.service Wants=dkms.service
And restart nvidia-vgpud and nvidia-vgpu-mgr systemd services after.
Resolution can be applied on the cluster both proactively before upgrading it or reactively after the upgrade is already performed.