summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2022-11-09drm/amdgpu: Remove redundant I2C EEPROM addressLuben Tuikov
Remove redundant EEPROM_I2C_MADDR_54H address, since we already have it represented (ARCTURUS), and since we don't include the I2C device type identifier in EEPROM memory addresses, i.e. that high up in the device abstraction--we only use EEPROM memory addresses, as memory is continuously represented by EEPROM device(s) on the I2C bus. Add a comment describing what these memory addresses are, how they come about and how they're usually extracted from the device address byte. Cc: Candice Li <candice.li@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: c9bdc6c3cf39df ("drm/amdgpu: Add EEPROM I2C address support for ip discovery") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amd/pm: enable mode1 reset on smu_v13_0_10Kenneth Feng
enable mode1 reset and prioritize debug port on smu_v13_0_10 as a more reliable message processing v2 - move mode1 reset callback to smu_v13_0_0_ppt.c Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdgpu: Drop eviction lock when allocating PT BOPhilip Yang
Re-take the eviction lock immediately again after the allocation is completed, to fix circular locking warning with drm_buddy allocator. Move amdgpu_vm_eviction_lock/unlock/trylock to amdgpu_vm.h as they are called from multiple files. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdgpu: Unlock bo_list_mutex after error handlingPhilip Yang
Get below kernel WARNING backtrace when pressing ctrl-C to kill kfdtest application. If amdgpu_cs_parser_bos returns error after taking bo_list_mutex, as caller amdgpu_cs_ioctl will not unlock bo_list_mutex, this generates the kernel WARNING. Add unlock bo_list_mutex after amdgpu_cs_parser_bos error handling to cleanup bo_list userptr bo. WARNING: kfdtest/2930 still has locks held! 1 lock held by kfdtest/2930: (&list->bo_list_mutex){+.+.}-{3:3}, at: amdgpu_cs_ioctl+0xce5/0x1f10 [amdgpu] stack backtrace: dump_stack_lvl+0x44/0x57 get_signal+0x79f/0xd00 arch_do_signal_or_restart+0x36/0x7b0 exit_to_user_mode_prepare+0xfd/0x1b0 syscall_exit_to_user_mode+0x19/0x40 do_syscall_64+0x40/0x80 Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdgpu: Fix the kerneldoc descriptionRajneesh Bhardwaj
amdgpu_ttm_tt_set_userptr() is also called by the KFD as part of initializing the user pages for userptr BOs and also while initializing the GPUVM for a KFD process so update the function description. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09drm/amdgpu: workaround for TLB seq raceChristian König
It can happen that we query the sequence value before the callback had a chance to run. Workaround that by grabbing the fence lock and releasing it again. Should be replaced by hw handling soon. Signed-off-by: Christian König <christian.koenig@amd.com> CC: stable@vger.kernel.org # 5.19+ Fixes: 5255e146c99a6 ("drm/amdgpu: rework TLB flushing") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2113 Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Philip Yang <Philip.Yang@amd.com> Tested-by: Stefan Springer <stefanspr94@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-08Merge tag 'amd-drm-next-6.2-2022-11-04' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.2-2022-11-04: amdgpu: - Add TMZ support for GC 11.0.1 - More IP version check conversions - Mode2 reset fixes for sienna cichlid - SMU 13.x fixes - RAS enablement on MP 13.x - Replace kmap with kmap_local_page() - Misc Clang warning fixes - SR-IOV fixes for GC 11.x - PCI AER fix - DCN 3.2.x commit sequence rework - SDMA 4.x doorbell fix - Expose additional new GC 11.x firmware versions - Misc code cleanups - S0i3 fixes - More DC FPU cleanup - Add more DC kerneldoc - Misc spelling and grammer fixes - DCN 3.1.x fixes - Plane modifier fix - MCA RAS enablement - Secure display locking fix - RAS TA rework - RAS EEPROM fixes - Fail suspend if eviction fails - Drop AMD specific DSC workarounds in favor of drm EDID quirks - SR-IOV suspend/resume fixes - Enable DCN support for ARM - Enable secure display on DCN 2.1 amdkfd: - Cache size fixes for GC 10.3.x - kfd_dev struct cleanup - GC11.x CWSR trap handler fix - Userptr fixes - Warning fixes radeon: - Replace kmap with kmap_local_page() UAPI: - Expose additional new GC 11.x firmware versions via the existing INFO query drm: - Add some new EDID DSC quirks Signed-off-by: Dave Airlie <airlied@redhat.com> # Conflicts: # drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221104205827.6008-1-alexander.deucher@amd.com
2022-11-05drm/fb-helper: Remove unnecessary include statementsThomas Zimmermann
Remove include statements for <drm/drm_fb_helper.h> where it is not required (i.e., most of them). In a few places include other header files that are required by the source code. v3: * fix amdgpu include statements * fix rockchip include statements Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221103151446.2638-23-tzimmermann@suse.de
2022-11-05drm/fb-helper: Move generic fbdev emulation into separate source fileThomas Zimmermann
Move the generic fbdev implementation into its own source and header file. Adapt drivers. No functional changes, but some of the internal helpers have been renamed to fit into the drm_fbdev_ naming scheme. v3: * rename drm_fbdev.{c,h} to drm_fbdev_generic.{c,h} * rebase onto vmwgfx changes * rebase onto xlnx changes * fix include statements in amdgpu Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221103151446.2638-22-tzimmermann@suse.de
2022-11-05drm/amdgpu: Don't set struct drm_driver.output_poll_changedThomas Zimmermann
Don't set struct drm_driver.output_poll_changed. It's used to restore the fbdev console. But as amdgpu uses generic fbdev emulation, the console is being restored by the DRM client helpers already. See the functions drm_kms_helper_hotplug_event() and drm_kms_helper_connector_hotplug_event() in drm_probe_helper.c. v2: * fix commit description (Christian) Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221103151446.2638-5-tzimmermann@suse.de
2022-11-05Merge drm/drm-next into drm-misc-nextThomas Zimmermann
Backmerging drm/drm-next to get the latest changes in the xlnx driver. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2022-11-04drm/amdkfd: Remove skiping userptr buffer mapping when mmu notifier marks it ↵Xiaogang Chen
as invalid mmu notifier does not always hold mm->sem during call back. That causes a race condition between kfd userprt buffer mapping and mmu notifier which leds to gpu shadder or SDMA access userptr buffer before it has been mapped to gpu VM. Always map userptr buffer to avoid that though it may make some userprt buffers mapped two times. Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: fix for suspend/resume sequence under sriovVictor Zhao
- clear kiq ring after suspend/resume under sriov to aviod kiq ring test failure - update irq after resume to fix kiq interrput loss Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amd/amdgpu: temporary workaround to skip ras error for gc_v11_0_3Kenneth Feng
temporary workaround to skip ras error for gc_v11_0_3 until IFWI release later Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: switch to select_se_sh wrapper for gfx v9_0Hawking Zhang
To allow invoking ip specific callbacks Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: Fix type of second parameter in trans_msg() callbackNathan Chancellor
With clang's kernel control flow integrity (kCFI, CONFIG_CFI_CLANG), indirect call targets are validated against the expected function pointer prototype to make sure the call target is valid to help mitigate ROP attacks. If they are not identical, there is a failure at run time, which manifests as either a kernel panic or thread getting killed. A proposed warning in clang aims to catch these at compile time, which reveals: drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:412:15: error: incompatible function pointer types initializing 'void (*)(struct amdgpu_device *, u32, u32, u32, u32)' (aka 'void (*)(struct amdgpu_device *, unsigned int, unsigned int, unsigned int, unsigned int)') with an expression of type 'void (struct amdgpu_device *, enum idh_request, u32, u32, u32)' (aka 'void (struct amdgpu_device *, enum idh_request, unsigned int, unsigned int, unsigned int)') [-Werror,-Wincompatible-function-pointer-types-strict] .trans_msg = xgpu_ai_mailbox_trans_msg, ^~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c:435:15: error: incompatible function pointer types initializing 'void (*)(struct amdgpu_device *, u32, u32, u32, u32)' (aka 'void (*)(struct amdgpu_device *, unsigned int, unsigned int, unsigned int, unsigned int)') with an expression of type 'void (struct amdgpu_device *, enum idh_request, u32, u32, u32)' (aka 'void (struct amdgpu_device *, enum idh_request, unsigned int, unsigned int, unsigned int)') [-Werror,-Wincompatible-function-pointer-types-strict] .trans_msg = xgpu_nv_mailbox_trans_msg, ^~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. The type of the second parameter in the prototype should be 'enum idh_request' instead of 'u32'. Update it to clear up the warnings. Link: https://github.com/ClangBuiltLinux/linux/issues/1750 Reported-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: Replace one-element array with flexible-array memberPaulo Miguel Almeida
One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element array with flexible-array member in struct _ATOM_FAKE_EDID_PATCH_RECORD and refactor the rest of the code accordingly. Important to mention is that doing a build before/after this patch results in no binary output differences. This helps with the ongoing efforts to tighten the FORTIFY_SOURCE routines on memcpy() and help us make progress towards globally enabling -fstrict-flex-arrays=3 [1]. Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/238 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101836 [1] Signed-off-by: Paulo Miguel Almeida <paulo.miguel.almeida.rodenas@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu/gfx11: set gfx.funcs in early initAlex Deucher
So the callbacks are set early in case we need them. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu/gfx10: set gfx.funcs in early initAlex Deucher
So the callbacks are set early in case we need them. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu/gfx9: set gfx.funcs in early initAlex Deucher
So the callbacks are set before we use them. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: Remove unnecessary register program in SRIOVPeng Ju Zhou
Remove unnecessary register program in SRIOV Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: Disable MCBP from soc21 for SRIOVYiqing Yao
[why] Start from soc21, CP does not support MCBP, so disable it. [how] Used amgpu_mcbp flag alone instead of checking if is in SRIOV to enable/disable MCBP. Only set flag to enable on asic_type prior to soc21 in SRIOV. Signed-off-by: Yiqing Yao <yiqing.yao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: Clean up soc21 early init for SRIOVYiqing Yao
Use virt_init_setting instead of per ip version setting. Signed-off-by: Yiqing Yao <yiqing.yao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: extend halt_if_hws_hang to MESGraham Sider
Hang on MES timeout if halt_if_hws_hang is set to 1. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04Merge tag 'drm-misc-next-2022-11-03' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 6.2: UAPI Changes: Cross-subsystem Changes: - dma-buf: locking improvements - firmware: New API in the RaspberryPi firmware driver used by vc4 Core Changes: - client: Null pointer dereference fix in drm_client_buffer_delete() - mm/buddy: Add back random seed log - ttm: Convert ttm_resource to use size_t for its size, fix for an undefined behaviour Driver Changes: - bridge: - adv7511: use dev_err_probe - it6505: Fix return value check of pm_runtime_get_sync - panel: - sitronix: Fixes and clean-ups - lcdif: Increase DMA burst size - rockchip: runtime_pm improvements - vc4: Fix for a regression preventing the use of 4k @ 60Hz, and further HDMI rate constraints check. - vmwgfx: Cursor improvements Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20221103083437.ksrh3hcdvxaof62l@houat
2022-11-03drm/scheduler: rename dependency callback into prepare_jobChristian König
This now matches much better what this is doing. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-14-christian.koenig@amd.com
2022-11-03drm/amdgpu: use scheduler dependencies for CSChristian König
Entirely remove the sync obj in the job. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-11-christian.koenig@amd.com
2022-11-03drm/amdgpu: use scheduler dependencies for UVD msgsChristian König
Instead of putting that into the job sync object. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-10-christian.koenig@amd.com
2022-11-03drm/amdgpu: use scheduler dependencies for VM updatesChristian König
Instead of putting that into the job sync object. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-9-christian.koenig@amd.com
2022-11-03drm/amdgpu: move explicit sync check into the CSChristian König
This moves the memory allocation out of the critical code path. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-8-christian.koenig@amd.com
2022-11-03drm/amdgpu: cleanup scheduler job initialization v2Christian König
Init the DRM scheduler base class while allocating the job. This makes the whole handling much more cleaner. v2: fix coding style Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-7-christian.koenig@amd.com
2022-11-03drm/amdgpu: drop amdgpu_sync from amdgpu_vmid_grab v2Christian König
Instead return the fence directly. Avoids memory allocation to store the fence. v2: cleanup coding style as well Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-6-christian.koenig@amd.com
2022-11-03drm/amdgpu: drop the fence argument from amdgpu_vmid_grabChristian König
This is always the job anyway. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-5-christian.koenig@amd.com
2022-11-03drm/amdgpu: use drm_sched_job_add_resv_dependencies for movesChristian König
Use the new common scheduler functions to figure out what to wait for. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-4-christian.koenig@amd.com
2022-11-01drm/amdgpu: Skip program gfxhub_v3_0_3 system aperture registers under SRIOVYifan Zha
[Why] gfxhub_v3_0_3 system aperture registers are removed from RLCG register access range. [How] Skip access gfxhub_v3_0_3 system aperture registers under SRIOV VF. These registers will be programmed on host side. Signed-off-by: Yifan Zha <Yifan.Zha@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-01drm/amdgpu: Skip access SDMA0_F32_CNTL in sdma_v6_0_enable under SRIOVYifan Zha
[Why] SDMA0_F32_CNTL is a PF_only regitser which will be blocked by L1. RLCG will not program the register as well. [How] Skip to program SDMA0_F32_CNTL under SRIOV VF. Signed-off-by: Yifan Zha <Yifan.Zha@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-01drm/amdgpu: Skip access GRBM_CNTL under SRIOV on gfx_v11Yifan Zha
[Why] GRBM_CNTL is a PF_only register on gfx_v11. RLCG interface will return "out of range" under SRIOV VF. [How] Skip access GRBM_CNTL under gfx_v11 SRIOV VF. Signed-off-by: Yifan Zha <Yifan.Zha@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-01drm/amdgpu: Disable GPU reset on SRIOV before remove pci.Gavin Wan
The recent change brought a bug on SRIOV envrionment. It caused unloading amdgpu failed on Guest VM. The reason is that the VF FLR was requested while unloading amdgpu driver, but the VF FLR of SRIOV sequence is wrong while removing PCI device. For SRIOV, the guest driver should not trigger the whole XGMI hive to do the reset. Host driver control how the device been reset. Fixes: f5c7e7797060 ("drm/amdgpu: Adjust removal control flow for smu v13_0_2") Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Gavin Wan <Gavin.Wan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-01drm/amdgpu: Enable GFX RAS feature for gfx v11_0_3Candice Li
v1: Support gfx ras feature enablement for gfx v11_0_3. v2: Update function name and error message. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdkfd: Cleanup kfd_dev structMukul Joshi
Cleanup kfd_dev struct by removing ddev and pdev as both drm_device and pci_dev can be fetched from amdgpu_device. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: disable GFXOFF during compute for GFX11Graham Sider
Temporary workaround to fix issues observed in some compute applications when GFXOFF is enabled on GFX11. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amd: Fail the suspend if resources can't be evictedMario Limonciello
If a system does not have swap and memory is under 100% usage, amdgpu will fail to evict resources. Currently the suspend carries on proceeding to reset the GPU: ``` [drm] evicting device resources failed [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12 [drm] free PSP TMR buffer [TTM] Failed allocating page table [drm] evicting device resources failed amdgpu 0000:03:00.0: amdgpu: MODE1 reset amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset ``` At this point if the suspend actually succeeded I think that amdgpu would have recovered because the GPU would have power cut off and restored. However the kernel fails to continue the suspend from the memory pressure and amdgpu fails to run the "resume" from the aborted suspend. ``` ACPI: PM: Preparing to enter system sleep state S3 SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO) cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0 node 0: slabs: 22, objs: 1122, free: 0 ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651) [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed! [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62 amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62). PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62 amdgpu 0000:03:00.0: PM: failed to resume async: error -62 ``` To avoid this series of unfortunate events, fail amdgpu's suspend when the memory eviction fails. This will let the system gracefully recover and the user can try suspend again when the memory pressure is relieved. Reported-by: post@davidak.de Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: Add EEPROM I2C address support for ip discoveryCandice Li
1. Update EEPROM_I2C_MADDR_SMU_13_0_0 to EEPROM_I2C_MADDR_54H 2. Add EEPROM I2C address support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10Candice Li
Enable RAS EEPROM support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: Optimize TA load/unload/invoke debugfs interfacesCandice Li
1. Add a function pointer structure ta_funcs to psp context 2. Make the interfaces generic to all TAs 3. Leverage exisitng TA context and remove unused functions 4. Fix return code bugs v2: Add comments for ta funcs macros and correct typo Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Candice Li <candice.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: Optimize RAS TA initialization and TA unload funcsCandice Li
1. Save TA unload psp response status 2. Add RAS TA loading status check for initializaiton 3. Drop RAS context teardown to allow RAS TA to be reloaded Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Candice Li <candice.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: remove deprecated MES version varsGraham Sider
MES scheduler and kiq versions are stored in mes.sched_version and mes.kiq_version, respectively, which are read from a register after their queues are initialized. Remove mes.ucode_fw_version and mes.data_fw_version which tried to read this versioning info from the firmware headers (which don't contain this information). Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Jack Xiao <Jack.Xiao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: correct MES debugfs versionsGraham Sider
Use mes.sched_version, mes.kiq_version for debugfs as mes.ucode_fw_version does not contain correct versioning information. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Jack Xiao <Jack.Xiao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: Move the mutex_lock to protect the return status of ↵Alan Liu
securedisplay command buffer [Why] Before we call psp_securedisplay_invoke(), we call psp_prep_securedisplay_cmd_buf() to prepare and initialize the command buffer. However, we didn't use the mutex_lock to protect the status of command buffer. So when multiple threads are using the command buffer, after thread A return from psp_securedisplay_invoke() and the command buffer status is set to SUCCESS, another thread B may call psp_prep_securedisplay_cmd_buf() and initialize the status to FAILURE again, and cause Thread A to get a failure return status. [How] Move the mutex_lock out of psp_securedisplay_invoke() to its caller to cover psp_prep_securedisplay_cmd_buf() and the code checking the return status of command buffer. Signed-off-by: Alan Liu <HaoPing.Liu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amdgpu: remove ras_error_status parameter for UMC poison handlerTao Zhou
Make the code simpler. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>