summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2023-11-10Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Daniel Vetter: "Dave's VPN to the big machine died, so it's on me to do fixes pr this and next week while everyone else is at plumbers. - big pile of amd fixes, but mostly for hw support newly added in 6.7 - i915 fixes, mostly minor things - qxl memory leak fix - vc4 uaf fix in mock helpers - syncobj fix for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE" * tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm: (78 commits) drm/amdgpu: fix error handling in amdgpu_vm_init drm/amdgpu: Fix possible null pointer dereference drm/amdgpu: move UVD and VCE sched entity init after sched init drm/amdgpu: move kfd_resume before the ip late init drm/amd: Explicitly check for GFXOFF to be enabled for s0ix drm/amdgpu: Change WREG32_RLC to WREG32_SOC15_RLC where inst != 0 (v2) drm/amdgpu: Use correct KIQ MEC engine for gfx9.4.3 (v5) drm/amdgpu: add smu v13.0.6 pcs xgmi ras error query support drm/amdgpu: fix software pci_unplug on some chips drm/amd/display: remove duplicated argument drm/amdgpu: correct mca debugfs dump reg list drm/amdgpu: correct acclerator check architecutre dump drm/amdgpu: add pcs xgmi v6.4.0 ras support drm/amdgpu: Change extended-scope MTYPE on GC 9.4.3 drm/amdgpu: disable smu v13.0.6 mca debug mode by default drm/amdgpu: Support multiple error query modes drm/amdgpu: refine smu v13.0.6 mca dump driver drm/amdgpu: Do not program PF-only regs in hdp_v4_0.c under SRIOV (v2) drm/amdgpu: Skip PCTL0_MMHUB_DEEPSLEEP_IB write in jpegv4.0.3 under SRIOV drm: amd: Resolve Sphinx unexpected indentation warning ...
2023-11-10Merge tag 'amd-drm-next-6.7-2023-11-10' of ↵Daniel Vetter
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.7-2023-11-10: amdgpu: - SR-IOV fixes - DMCUB fixes - DCN3.5 fixes - DP2 fixes - SubVP fixes - SMU14 fixes - SDMA4.x fixes - Suspend/resume fixes - AGP regression fix - UAF fixes for some error cases - SMU 13.0.6 fixes - Documentation fixes - RAS fixes - Hotplug fixes - Scheduling entity ordering fix - GPUVM fixes Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231110190703.4741-1-alexander.deucher@amd.com
2023-11-10drm/amdgpu: fix error handling in amdgpu_vm_initChristian König
When clearing the root PD fails we need to properly release it again. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-10drm/amdgpu: Fix possible null pointer dereferenceFelix Kuehling
mem = bo->tbo.resource may be NULL in amdgpu_vm_bo_update. Fixes: 180253782038 ("drm/ttm: stop allocating dummy resources during BO creation") Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-10drm/amdgpu: move UVD and VCE sched entity init after sched initAlex Deucher
We need kernel scheduling entities to deal with handle clean up if apps are not cleaned up properly. With commit 56e449603f0ac5 ("drm/sched: Convert the GPU scheduler to variable number of run-queues") the scheduler entities have to be created after scheduler init, so change the ordering to fix this. v2: Leave logic in UVD and VCE code Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable number of run-queues") Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: ltuikov89@gmail.com
2023-11-10drm/amdgpu: move kfd_resume before the ip late initTim Huang
The kfd_resume needs to touch GC registers to enable the interrupts, it needs to be done before GFXOFF is enabled to ensure that the GFX is not off and GC registers can be touched. So move kfd_resume before the amdgpu_device_ip_late_init which enables the CGPG/GFXOFF. Signed-off-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-10drm/amd: Explicitly check for GFXOFF to be enabled for s0ixMario Limonciello
If a user has disabled GFXOFF this may cause problems for the suspend sequence. Ensure that it is enabled in amdgpu_acpi_is_s0ix_active(). The system won't reach the deepest state but it also won't hang. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-10Merge tag 'drm-misc-fixes-2023-11-08' of ↵Daniel Vetter
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-fixes for v6.7-rc1: qxl: - qxl memory leak fix. syncobj: - Fix waiting for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE vc4: - Fix UAF in mock helpers Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> [sima: Stitch together both changelogs from Maarten. Also because of branch history this contains a few more bugfixes which are already in v6.6, but I didn't feel like this justifies some backmerge since there wasn't any real conflict.] Link: https://patchwork.freedesktop.org/patch/msgid/bc8598ee-d427-4616-8ebd-64107ab9a2d8@linux.intel.com
2023-11-09drm/amdgpu: Change WREG32_RLC to WREG32_SOC15_RLC where inst != 0 (v2)Victor Lu
W/RREG32_RLC is hardedcoded to use instance 0. W/RREG32_SOC15_RLC should be used instead when inst != 0. v2: rebase Signed-off-by: Victor Lu <victorchengchi.lu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Use correct KIQ MEC engine for gfx9.4.3 (v5)Victor Lu
amdgpu_kiq_wreg/rreg is hardcoded to use MEC engine 0. Add an xcc_id parameter to amdgpu_kiq_wreg/rreg, define W/RREG32_XCC and amdgpu_device_xcc_wreg/rreg to use the new xcc_id parameter. Using amdgpu_sriov_runtime to determine whether to access via kiq or RLC is sufficient for now. v5: add condition in amdgpu_device_xcc_w/rreg, remove trace func call v4: avoid using amdgpu_sriov_w/rreg v3: use W/RREG32_XCC to handle non-kiq case v2: define amdgpu_device_xcc_wreg/rreg instead of changing parameters of amdgpu_device_wreg/rreg Signed-off-by: Victor Lu <victorchengchi.lu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: add smu v13.0.6 pcs xgmi ras error query supportYang Wang
add pcs xgmi ras error query support for smu v13.0.6. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: fix software pci_unplug on some chipsVitaly Prosyak
When software 'pci unplug' using IGT is executed we got a sysfs directory entry is NULL for differant ras blocks like hdp, umc, etc. Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group' check that 'sd' is not NULL. [ +0.000001] RIP: 0010:sysfs_remove_group+0x83/0x90 [ +0.000002] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 [ +0.000001] RSP: 0018:ffffc90002067c90 EFLAGS: 00010246 [ +0.000002] RAX: 0000000000000000 RBX: ffffffff824ea180 RCX: 0000000000000000 [ +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ +0.000001] RBP: ffffc90002067ca8 R08: 0000000000000000 R09: 0000000000000000 [ +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ +0.000001] R13: ffff88810a395f48 R14: ffff888101aab0d0 R15: 0000000000000000 [ +0.000001] FS: 00007f5ddaa43a00(0000) GS:ffff88841e800000(0000) knlGS:0000000000000000 [ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000001] CR2: 00007f8ffa61ba50 CR3: 0000000106432000 CR4: 0000000000350ef0 [ +0.000001] Call Trace: [ +0.000001] <TASK> [ +0.000001] ? show_regs+0x72/0x90 [ +0.000002] ? sysfs_remove_group+0x83/0x90 [ +0.000002] ? __warn+0x8d/0x160 [ +0.000001] ? sysfs_remove_group+0x83/0x90 [ +0.000001] ? report_bug+0x1bb/0x1d0 [ +0.000003] ? handle_bug+0x46/0x90 [ +0.000001] ? exc_invalid_op+0x19/0x80 [ +0.000002] ? asm_exc_invalid_op+0x1b/0x20 [ +0.000003] ? sysfs_remove_group+0x83/0x90 [ +0.000001] dpm_sysfs_remove+0x61/0x70 [ +0.000002] device_del+0xa3/0x3d0 [ +0.000002] ? ktime_get_mono_fast_ns+0x46/0xb0 [ +0.000002] device_unregister+0x18/0x70 [ +0.000001] i2c_del_adapter+0x26d/0x330 [ +0.000002] arcturus_i2c_control_fini+0x25/0x50 [amdgpu] [ +0.000236] smu_sw_fini+0x38/0x260 [amdgpu] [ +0.000241] amdgpu_device_fini_sw+0x116/0x670 [amdgpu] [ +0.000186] ? mutex_lock+0x13/0x50 [ +0.000003] amdgpu_driver_release_kms+0x16/0x40 [amdgpu] [ +0.000192] drm_minor_release+0x4f/0x80 [drm] [ +0.000025] drm_release+0xfe/0x150 [drm] [ +0.000027] __fput+0x9f/0x290 [ +0.000002] ____fput+0xe/0x20 [ +0.000002] task_work_run+0x61/0xa0 [ +0.000002] exit_to_user_mode_prepare+0x150/0x170 [ +0.000002] syscall_exit_to_user_mode+0x2a/0x50 Cc: Hawking Zhang <hawking.zhang@amd.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: correct mca debugfs dump reg listYang Wang
avoid driver to touch invalid mca reg. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: correct acclerator check architecutre dumpHawking Zhang
So driver doesn't touch invalid aca entries. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: add pcs xgmi v6.4.0 ras supportYang Wang
add pcs xgmi v6.4.0 ras support Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Change extended-scope MTYPE on GC 9.4.3David Yat Sin
Change local memory type to MTYPE_UC on revision id 0 Signed-off-by: David Yat Sin <David.YatSin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Support multiple error query modesHawking Zhang
Direct error query mode and firmware error query mode are supported for now. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: refine smu v13.0.6 mca dump driverYang Wang
refine smu mca driver to support query ras error from pmfw path. - correct gfx smu bank hwid (from mp5 to smu bank) - retire unused callback function in amdgpu_mca_smu_funcs{} - add new mca_bank_set{} structure to collect mca bank - move enum mca_reg_idx into amdgpu_mca.h header - add mca status register field decode macro Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Do not program PF-only regs in hdp_v4_0.c under SRIOV (v2)Victor Lu
The following regs can only be programmed by the PF: HDP_MISC_CNTL HDP_NONSURFACE_BASE HDP_NONSURFACE_BASE_HI v2: update commit message Signed-off-by: Victor Lu <victorchengchi.lu@amd.com> Reviewed-by: Samir Dhume <samir.dhume@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Skip PCTL0_MMHUB_DEEPSLEEP_IB write in jpegv4.0.3 under SRIOVVictor Lu
PCTL0_MMHUB_DEEPSLEEP_IB is blocked for VF access Signed-off-by: Victor Lu <victorchengchi.lu@amd.com> Reviewed-by: Samir Dhume <samir.dhume@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: correct smu v13.0.6 umc ras error checkYang Wang
correct smu v13.0.0 umc ras error check Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Add xcc param to SRIOV kiq write and WREG32_SOC15_IP_NO_KIQ (v4)Victor Lu
WREG32/RREG32_SOC15_IP_NO_KIQ and amdgpu_virt_kiq_reg_write_reg_wait are not using the correct rlcg interface or mec engine, respectively. Add xcc instance parameter to them. v4: Use GET_INST and squash commit with: "drm/amdgpu: Add xcc_inst param to amdgpu_virt_kiq_reg_write_reg_wait" v3: xcc not needed for MMMHUB v2: rebase Signed-off-by: Victor Lu <victorchengchi.lu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Add flag to enable indirect RLCG access for gfx v9.4.3Victor Lu
The "rlcg_reg_access_supported" flag is missing. Add it back in. Signed-off-by: Victor Lu <victorchengchi.lu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: correct amdgpu ip block rev infoYang Wang
correct following amdgpu ip block version information: - gfx_v9_4_3 - sdma_v4_4_2 Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: Don't warn for unsupported set_xgmi_plpd_modeTao Zhou
set_xgmi_plpd_mode may be unsupported and this isn't error, no need to print warning for it. v2: add ret2 to save the status of psp_ras_trigger_error. Suggested-by: lijo.lazar@amd.com Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09drm/amdgpu: lower CS errors to debug severityChristian König
Otherwise userspace can spam the logs by using incorrect input values. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-09drm/amdgpu: fix error handling in amdgpu_bo_list_get()Christian König
We should not leak the pointer where we couldn't grab the reference on to the caller because it can be that the error handling still tries to put the reference then. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-09drm/amdgpu: fix AGP init orderAlex Deucher
The default AGP settings were overwriting the IP selected ones since the default was getting set after the IP ones were selected. Fixes: de59b69932e6 ("drm/amdgpu/gmc: set a default disable value for AGP") Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/100966.html Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
2023-11-07Merge tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull more drm updates from Dave Airlie: "Geert pointed out I missed the renesas reworks in my main pull, so this pull contains the renesas next work for atomic conversion and DT support. It also contains a bunch of amdgpu and some small ssd13xx fixes. renesas: - atomic conversion - DT support ssd13xx: - dt binding fix for ssd132x - Initialize ssd130x crtc_state to NULL. amdgpu: - Fix RAS support check - RAS fixes - MES fixes - SMU13 fixes - Contiguous memory allocation fix - BACO fixes - GPU reset fixes - Min power limit fixes - GFX11 fixes - USB4/TB hotplug fixes - ARM regression fix - GFX9.4.3 fixes - KASAN/KCSAN stack size check fixes - SR-IOV fixes - SMU14 fixes - PSP13 fixes - Display blend fixes - Flexible array size fixes amdkfd: - GPUVM fix radeon: - Flexible array size fixes" * tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm: (83 commits) drm/amd/display: Enable fast update on blendTF change drm/amd/display: Fix blend LUT programming drm/amd/display: Program plane color setting correctly drm/amdgpu: Query and report boot status drm/amdgpu: Add psp v13 function to query boot status drm/amd/swsmu: remove fw version check in sw_init. drm/amd/swsmu: update smu v14_0_0 driver if and metrics table drm/amdgpu: Add C2PMSG_109/126 reg field shift/masks drm/amdgpu: Optimize the asic type fix code drm/amdgpu: fix GRBM read timeout when do mes_self_test drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count drm/amd/pm: only check sriov vf flag once when creating hwmon sysfs drm/amdgpu: Attach eviction fence on alloc drm/amdkfd: Improve amdgpu_vm_handle_moved drm/amd/display: Increase frame warning limit with KASAN or KCSAN in dml2 drm/amd/display: Avoid NULL dereference of timing generator drm/amdkfd: Update cache info for GFX 9.4.3 drm/amdkfd: Populate cache info for GFX 9.4.3 drm/amdgpu: don't put MQDs in VRAM on ARM | ARM64 drm/amdgpu/smu13: drop compute workload workaround ...
2023-11-07drm/amdgpu: add RAS reset/query operations for XGMI v6_4Tao Zhou
Reset/query RAS error status and count. v2: use XGMI IP version instead of WAFL version. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-07drm/amdgpu: handle extra UE register entries for gfx v9_4_3Tao Zhou
The UE registe list is larger than CE list. Reported-by: yipeng.chai@amd.com Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-07drm/amdgpu: Fix sdma 4.4.2 doorbell rptr/wptr initLijo Lazar
Doorbell rptr/wptr can be set through multiple ways including direct register initialization. Disable doorbell during hw_fini once the ring is disabled so that during next module reload direct initialization takes effect. Also, move the direct initialization after minor update is set to 1 since rptr/wptr are reinitialized back to 0 which could be lower than the previous doorbell value (ex: cases like module reload). Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Tested-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-07drm/amdgpu/soc21: add mode2 asic reset for SMU IP v14.0.0Jiadong Zhu
Set the default reset method to mode2 for SMU IP v14.0.0 Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-07drm/amd: Disable XNACK on SRIOV environmentSurbhi Kakarya
The purpose of this patch is to disable XNACK or set XNACK OFF mode on SRIOV platform which doesn't support it. This will prevent user-space application to fail or result into unexpected behaviour whenever the application need to run test-case in XNACK ON mode. Signed-off-by: Surbhi Kakarya <surbhi.kakarya@amd.com> Reviewed-by: Shaoyun Liu <shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: Query and report boot statusHawking Zhang
Query boot status and report boot errors. A follow up change is needed to stop GPU initialization if boot fails. v2: only invoke the call for dGPU (Le/Lijo) Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: Add psp v13 function to query boot statusHawking Zhang
Add psp v13 function to query boot status. v2: limit the use case to dGPU only (Lijo) Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: Optimize the asic type fix codeMa Jun
Use a new struct array to define the asic information which asic type needs to be fixed. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: fix GRBM read timeout when do mes_self_testTim Huang
Use a proper MEID to make sure the CP_HQD_* and CP_GFX_HQD_* registers can be touched when initialize the compute and gfx mqd in mes_self_test. Otherwise, we expect no response from CP and an GRBM eventual timeout. Signed-off-by: Tim Huang <Tim.Huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_countTao Zhou
Handle xgmi hive case. Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: Attach eviction fence on allocFelix Kuehling
Instead of attaching the eviction fence when a KFD BO is first mapped, attach it when it is allocated or imported. This in preparation to allow KFD BOs to be mapped using the render node API. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdkfd: Improve amdgpu_vm_handle_movedFelix Kuehling
Let amdgpu_vm_handle_moved update all BO VA mappings of BOs reserved by the caller. This will be useful for handling extra BO VA mappings in KFD VMs that are managed through the render node API. v2: rebase against drm_exec changes (Alex) Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: don't put MQDs in VRAM on ARM | ARM64Alex Deucher
Issues were reported with commit 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM") on an ADLINK Ampere Altra Developer Platform (AVA developer platform). Various ARM systems seem to have problems related to PCIe and MMIO access. In this case, I'm not sure if this is specific to the ADLINK platform or ARM in general. Seems to be some coherency issue with VRAM. For now, just don't put MQDs in VRAM on ARM. Link: https://lists.freedesktop.org/archives/amd-gfx/2023-October/100453.html Fixes: 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM") Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: alexey.klimov@linaro.org
2023-11-03drm/amdgpu: add a retry for IP discovery initAlex Deucher
AMD dGPUs have integrated FW that runs as soon as the device gets power and initializes the board (determines the amount of memory, provides configuration details to the driver, etc.). For direct PCIe attached cards this happens as soon as power is applied and normally completes well before the OS has even started loading. However, with hotpluggable ports like USB4, the driver needs to wait for this to complete before initializing the device. This normally takes 60-100ms, but could take longer on some older boards periodically due to memory training. Retry for up to a second. In the non-hotplug case, there should be no change in behavior and this should complete on the first try. v2: adjust test criteria v3: adjust checks for the masks, only enable on removable devices v4: skip bif_fb_en check Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: ungate power gating when system suspendPerry Yuan
[Why] During suspend, if GFX DPM is enabled and GFXOFF feature is enabled the system may get hung. So, it is suggested to disable GFXOFF feature during suspend and enable it after resume. [How] Update the code to disable GFXOFF feature during suspend and enable it after resume. [ 311.396526] amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 [ 311.396530] amdgpu 0000:03:00.0: amdgpu: Fail to disable dpm features! [ 311.396531] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62 Acked-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Kun Liu <kun.liu2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: don't use pci_is_thunderbolt_attached()Alex Deucher
It's only valid on Intel systems with the Intel VSEC. Use dev_is_removable() instead. This should do the right thing regardless of the platform. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: don't use ATRM for external devicesAlex Deucher
The ATRM ACPI method is for fetching the dGPU vbios rom image on laptops and all-in-one systems. It should not be used for external add in cards. If the dGPU is thunderbolt connected, don't try ATRM. v2: pci_is_thunderbolt_attached only works for Intel. Use pdev->external_facing instead. v3: dev_is_removable() seems to be what we want Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu/gfx10,11: use memcpy_to/fromio for MQDsAlex Deucher
Since they were moved to VRAM, we need to use the IO variants of memcpy. Fixes: 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: use mode-2 reset for RAS poison consumptionTao Zhou
Switch from mode-1 reset to mode-2 for poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu doorbell range should be set when gpu recoveryLin.Cao
GFX doorbell range should be set after flr otherwise the gfx doorbell range will be overlap with MEC. v2: remove "amdgpu_sriov_vf" and "amdgpu_in_reset" check, and add grbm select for the case of 2 gfx rings. Signed-off-by: Lin.Cao <lincao12@amd.com> Acked-by: ZhenGuo Yin <zhenguo.yin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-02Merge tag 'pci-v6.7-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Use acpi_evaluate_dsm_typed() instead of open-coding _DSM evaluation to learn device characteristics (Andy Shevchenko) - Tidy multi-function header checks using new PCI_HEADER_TYPE_MASK definition (Ilpo Järvinen) - Simplify config access error checking in various drivers (Ilpo Järvinen) - Use pcie_capability_clear_word() (not pcie_capability_clear_and_set_word()) when only clearing (Ilpo Järvinen) - Add pci_get_base_class() to simplify finding devices using base class only (ignoring subclass and programming interface) (Sui Jingfeng) - Add pci_is_vga(), which includes ancient PCI_CLASS_NOT_DEFINED_VGA devices from before the Class Code was added to PCI (Sui Jingfeng) - Use pci_is_vga() for vgaarb, sysfs "boot_vga", virtio, qxl to include ancient VGA devices (Sui Jingfeng) Resource management: - Make pci_assign_unassigned_resources() non-init because sparc uses it after init (Randy Dunlap) Driver binding: - Retain .remove() and .probe() callbacks (previously __init) because sysfs may cause them to be called later (Uwe Kleine-König) - Prevent xHCI driver from claiming AMD VanGogh USB3 DRD device, so it can be claimed by dwc3 instead (Vicki Pfau) PCI device hotplug: - Add Ampere Altra Attention Indicator extension driver for acpiphp (D Scott Phillips) Power management: - Quirk VideoPropulsion Torrent QN16e with longer delay after reset (Lukas Wunner) - Prevent users from overriding drivers that say we shouldn't use D3cold (Lukas Wunner) - Avoid PME from D3hot/D3cold for AMD Rembrandt and Phoenix USB4 because wakeup interrupts from those states don't work if amd-pmc has put the platform in a hardware sleep state (Mario Limonciello) IOMMU: - Disable ATS for Intel IPU E2000 devices with invalidation message endianness erratum (Bartosz Pawlowski) Error handling: - Factor out interrupt enable/disable into helpers (Kai-Heng Feng) Peer-to-peer DMA: - Fix flexible-array usage in struct pci_p2pdma_pagemap in case we ever use pagemaps with multiple entries (Gustavo A. R. Silva) ASPM: - Revert a change that broke when drivers disabled L1 and users later enabled an L1.x substate via sysfs, and fix a similar issue when users disabled L1 via sysfs (Heiner Kallweit) Endpoint framework: - Fix double free in __pci_epc_create() (Dan Carpenter) - Use IS_ERR_OR_NULL() to simplify endpoint core (Ruan Jinjie) Cadence PCIe controller driver: - Drop unused "is_rc" member (Li Chen) Freescale Layerscape PCIe controller driver: - Enable 64-bit addressing in endpoint mode (Guanhua Gao) Intel VMD host bridge driver: - Fix multi-function header check (Ilpo Järvinen) Microsoft Hyper-V host bridge driver: - Annotate struct hv_dr_state with __counted_by (Kees Cook) NVIDIA Tegra194 PCIe controller driver: - Drop setting of LNKCAP_MLW (max link width) since dw_pcie_setup() already does this via dw_pcie_link_set_max_link_width() (Yoshihiro Shimoda) Qualcomm PCIe controller driver: - Use PCIE_SPEED2MBS_ENC() to simplify encoding of link speed (Manivannan Sadhasivam) - Add a .write_dbi2() callback so DBI2 register writes, e.g., for setting the BAR size, work correctly (Manivannan Sadhasivam) - Enable ASPM for platforms that use 1.9.0 ops, because the PCI core doesn't enable ASPM states that haven't been enabled by the firmware (Manivannan Sadhasivam) Renesas R-Car Gen4 PCIe controller driver: - Add DesignWare core support (set max link width, EDMA_UNROLL flag, .pre_init(), .deinit(), etc) for use by R-Car Gen4 driver (Yoshihiro Shimoda) - Add driver and DT schema for DesignWare-based Renesas R-Car Gen4 controller in both host and endpoint mode (Yoshihiro Shimoda) Xilinx NWL PCIe controller driver: - Update ECAM size to support 256 buses (Thippeswamy Havalige) - Stop setting bridge primary/secondary/subordinate bus numbers, since PCI core does this (Thippeswamy Havalige) Xilinx XDMA controller driver: - Add driver and DT schema for Zynq UltraScale+ MPSoCs devices with Xilinx XDMA Soft IP (Thippeswamy Havalige) Miscellaneous: - Use FIELD_GET()/FIELD_PREP() to simplify and reduce use of _SHIFT macros (Ilpo Järvinen, Bjorn Helgaas) - Remove logic_outb(), _outw(), outl() duplicate declarations (John Sanpe) - Replace unnecessary UTF-8 in Kconfig help text because menuconfig doesn't render it correctly (Liu Song)" * tag 'pci-v6.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (102 commits) PCI: qcom-ep: Add dedicated callback for writing to DBI2 registers PCI: Simplify pcie_capability_clear_and_set_word() to ..._clear_word() PCI: endpoint: Fix double free in __pci_epc_create() PCI: xilinx-xdma: Add Xilinx XDMA Root Port driver dt-bindings: PCI: xilinx-xdma: Add schemas for Xilinx XDMA PCIe Root Port Bridge PCI: xilinx-cpm: Move IRQ definitions to a common header PCI: xilinx-nwl: Modify ECAM size to enable support for 256 buses PCI: xilinx-nwl: Rename the NWL_ECAM_VALUE_DEFAULT macro dt-bindings: PCI: xilinx-nwl: Modify ECAM size in the DT example PCI: xilinx-nwl: Remove redundant code that sets Type 1 header fields PCI: hotplug: Add Ampere Altra Attention Indicator extension driver PCI/AER: Factor out interrupt toggling into helpers PCI: acpiphp: Allow built-in drivers for Attention Indicators PCI/portdrv: Use FIELD_GET() PCI/VC: Use FIELD_GET() PCI/PTM: Use FIELD_GET() PCI/PME: Use FIELD_GET() PCI/ATS: Use FIELD_GET() PCI/ATS: Show PASID Capability register width in bitmasks PCI/ASPM: Fix L1 substate handling in aspm_attr_store_common() ...