site  contact  subhomenews

Kernel AMD GPU disaster fixed

May 17, 2024 — BarryK

I posted about getting a black screen when the 'amdgpu' kernel module loads, for 5.15.150 and later kernel:

https://bkhome.org/news/202405/kernel-515150-disaster-for-amd-gpu.html

There are three guys who did amdgpu commits between 5.15.149 and 5.15.150, so I sent an email to them explaining the problem. One of those guys (Armin Wolf) responded, giving me basic instructions how to use "git bisect" to identify which commit has caused the problem. I did that, here is a summary of the steps:

# git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
# cd linux-stable
# git tag -l | grep '5\.15\.150'
v5.15.150
# git checkout -b my5.15.150 v5.15.150
Updating files: 100% (65776/65776), done.
Switched to a new branch 'my5.15.150'

Now for the bisecting. Using my '.config' file...

# make menuconfig
# git bisect start -- drivers/gpu/drm/amd
# git bisect bad
# git bisect good v5.15.149
Bisecting: 1 revision left to test after this (roughly 1 step)
[b9a61ee2bb2704e42516e3da962f99dfa98f3b20] drm/amdgpu: reset gpu for s3 suspend abort case
# make
# rm -rf /boot2
# mkdir -p /boot2/lib/modules
# make INSTALL_MOD_STRIP=1 INSTALL_MOD_PATH=/boot2 modules_install
# cp arch/x86/boot/bzImage /boot2/vmlinuz

I copied the kernel and modules into a QV usb-stick, booted it on the laptop; works!

# git bisect good
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[56b522f4668167096a50c39446d6263c96219f5f] drm/amdgpu: init iommu after amdkfd device init
# make
# rm -rf /boot2
# mkdir -p /boot2/lib/modules
# make INSTALL_MOD_STRIP=1 INSTALL_MOD_PATH=/boot2 modules_install
# cp arch/x86/boot/bzImage /boot2/vmlinuz

Same thing, tested on QV usb-stick; black screen!

# git bisect bad
56b522f4668167096a50c39446d6263c96219f5f is the first bad commit
commit 56b522f4668167096a50c39446d6263c96219f5f
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date: Tue Sep 28 15:42:35 2021 +0800

drm/amdgpu: init iommu after amdkfd device init

[ Upstream commit 286826d7d976e7646b09149d9bc2899d74ff962b ]

This patch is to fix clinfo failure in Raven/Picasso:

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.2 AMD-APP (3364.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback

Platform Name: AMD Accelerated Parallel Processing Number of devices: 0

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

That's it, that's the bad commit.

On May 9, sent this result to those three guys, waited until yesterday, 7 days, no reply.

So, yesterday I created a patch that reverts Yifan Zhang's commit, and compiled the 5.15.158 kernel. Success on my laptop, confirming that this commit is the culprit. This morning I sent that patch to those guys.

Here is my reverting patch, quite small:

diff -Naur linux-5.15ORIG/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c linux-5.15/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
--- linux-5.15ORIG/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2024-05-17 03:14:28.813312020 +0800
+++ linux-5.15/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2024-05-17 03:21:24.446637456 +0800
@@ -2487,6 +2487,10 @@
if (r)
goto init_failed;

+ r = amdgpu_amdkfd_resume_iommu(adev);
+ if (r)
+ goto init_failed;
+
r = amdgpu_device_ip_hw_init_phase1(adev);
if (r)
goto init_failed;
@@ -2525,10 +2529,6 @@
if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);

- r = amdgpu_amdkfd_resume_iommu(adev);
- if (r)
- goto init_failed;
-
amdgpu_fru_get_product_info(adev);

init_failed:

So, hopefully they will take it onboard to fix. Here is the offending commit:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.15.y&id=56b522f4668167096a50c39446d6263c96219f5f

EDIT 2024-05-24:
Thanks to kernel developer Armin Wolf, who took this onboard. He has submitted that the offending commit be reversed, which will now presumably happen. See "dri-devel" mail list:

https://lore.kernel.org/dri-devel/20240523173031.4212-1-W_Armin@gmx.de/T/#u   

Tags: easy