Hey all,
We’ve observed issues with our CC-Ubuntu24.04 image crashing/hanging on nodes with AMD MI100 GPUs, and this appears to be an upstream issue with the amdgpu
driver. build variant of ubuntu24.04 with hwe kernel by msherman64 · Pull Request #14 · ChameleonCloud/CC-Images · GitHub
It does appear that the 6.11 kernel distributed to ubuntu as linux-image-hwe-24.04 fixes the issue, and we’re considering updating the “default” kernel on CC-Ubuntu24.04 to this 6.11 variant.
Any thoughts/comments/concerns?
The image would look/behave like the one from this PR:
ChameleonCloud:main
← ChameleonCloud:ubuntu/hwe_611
opened 11:27PM - 19 Feb 25 UTC
We dicovered that the 6.8 kernels that ship with ubuntu24.04 (at least 6.8.0-47 … - 6.8.0-53), have a panic with the amdgpu kernel module when an AMD MI100 gpu is present on the system. The panic is shown in the dmesg logs from boot up, but causes an infinite hang on shutdown/reboot, due as udev also fails.
This bug appears to be fixed in 6.10, see:
https://lore.kernel.org/dri-devel/20240413213708.3427038-1-alexander.deucher@amd.com/
and we can get 6.11 by installing ubuntu's hwe kernel.
TODO: this fix causes both linux-image-generic, AND linux-image-hwe-24.04 to be installed, due to the inclusion of linux-image-generic under package-installs in our "ancestor" element, "ubuntu". The only real penalty being and extra 300mb of disk space taken, but should fix this before releasing to users.
it can currently be tried (as beta!) via image UUID 1672070f-5b6f-480d-a67b-19a377fa5466
on CHI@TACC