summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2023-11-03drm/amdgpu: fix GRBM read timeout when do mes_self_testTim Huang
Use a proper MEID to make sure the CP_HQD_* and CP_GFX_HQD_* registers can be touched when initialize the compute and gfx mqd in mes_self_test. Otherwise, we expect no response from CP and an GRBM eventual timeout. Signed-off-by: Tim Huang <Tim.Huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_countTao Zhou
Handle xgmi hive case. Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: Attach eviction fence on allocFelix Kuehling
Instead of attaching the eviction fence when a KFD BO is first mapped, attach it when it is allocated or imported. This in preparation to allow KFD BOs to be mapped using the render node API. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdkfd: Improve amdgpu_vm_handle_movedFelix Kuehling
Let amdgpu_vm_handle_moved update all BO VA mappings of BOs reserved by the caller. This will be useful for handling extra BO VA mappings in KFD VMs that are managed through the render node API. v2: rebase against drm_exec changes (Alex) Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: don't put MQDs in VRAM on ARM | ARM64Alex Deucher
Issues were reported with commit 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM") on an ADLINK Ampere Altra Developer Platform (AVA developer platform). Various ARM systems seem to have problems related to PCIe and MMIO access. In this case, I'm not sure if this is specific to the ADLINK platform or ARM in general. Seems to be some coherency issue with VRAM. For now, just don't put MQDs in VRAM on ARM. Link: https://lists.freedesktop.org/archives/amd-gfx/2023-October/100453.html Fixes: 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM") Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: alexey.klimov@linaro.org
2023-11-03drm/amdgpu: add a retry for IP discovery initAlex Deucher
AMD dGPUs have integrated FW that runs as soon as the device gets power and initializes the board (determines the amount of memory, provides configuration details to the driver, etc.). For direct PCIe attached cards this happens as soon as power is applied and normally completes well before the OS has even started loading. However, with hotpluggable ports like USB4, the driver needs to wait for this to complete before initializing the device. This normally takes 60-100ms, but could take longer on some older boards periodically due to memory training. Retry for up to a second. In the non-hotplug case, there should be no change in behavior and this should complete on the first try. v2: adjust test criteria v3: adjust checks for the masks, only enable on removable devices v4: skip bif_fb_en check Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: ungate power gating when system suspendPerry Yuan
[Why] During suspend, if GFX DPM is enabled and GFXOFF feature is enabled the system may get hung. So, it is suggested to disable GFXOFF feature during suspend and enable it after resume. [How] Update the code to disable GFXOFF feature during suspend and enable it after resume. [ 311.396526] amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 [ 311.396530] amdgpu 0000:03:00.0: amdgpu: Fail to disable dpm features! [ 311.396531] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62 Acked-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Kun Liu <kun.liu2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: don't use pci_is_thunderbolt_attached()Alex Deucher
It's only valid on Intel systems with the Intel VSEC. Use dev_is_removable() instead. This should do the right thing regardless of the platform. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu: don't use ATRM for external devicesAlex Deucher
The ATRM ACPI method is for fetching the dGPU vbios rom image on laptops and all-in-one systems. It should not be used for external add in cards. If the dGPU is thunderbolt connected, don't try ATRM. v2: pci_is_thunderbolt_attached only works for Intel. Use pdev->external_facing instead. v3: dev_is_removable() seems to be what we want Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-11-03drm/amdgpu/gfx10,11: use memcpy_to/fromio for MQDsAlex Deucher
Since they were moved to VRAM, we need to use the IO variants of memcpy. Fixes: 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu: use mode-2 reset for RAS poison consumptionTao Zhou
Switch from mode-1 reset to mode-2 for poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03drm/amdgpu doorbell range should be set when gpu recoveryLin.Cao
GFX doorbell range should be set after flr otherwise the gfx doorbell range will be overlap with MEC. v2: remove "amdgpu_sriov_vf" and "amdgpu_in_reset" check, and add grbm select for the case of 2 gfx rings. Signed-off-by: Lin.Cao <lincao12@amd.com> Acked-by: ZhenGuo Yin <zhenguo.yin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amdgpu: remove unused macro HW_REVYang Wang
remove unused macro HW_REV Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amdgpu: Fix the vram base start addressArunpravin Paneer Selvam
If the size returned by drm buddy allocator is higher than the required size, we take the higher size to calculate the buffer start address. This is required if we couldn't trim the buffer to the requested size. This will fix the display corruption issue on APU's which has limited VRAM size. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2859 Fixes: 0a1844bf0b53 ("drm/buddy: Improve contiguous memory allocation") Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amdgpu: set XGMI IP version manually for v6_4Tao Zhou
The version can't be queried from discovery table. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amdgpu: add unmap latency when gfx11 set kiq resourcesTong Liu01
[why] If driver does not set unmap latency for KIQ, the default value of KIQ unmap latency is zero. When do unmap queue, KIQ will return that almost immediately after receiving unmap command. So, the queue status will be saved to MQD incorrectly or lost in some chance. [how] Set unmap latency when do kiq set resources. The unmap latency is set to be 1 second that is synchronized with Windows driver. Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Tong Liu01 <Tong.Liu01@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amd/pm: fix the high voltage and temperature issueKenneth Feng
fix the high voltage and temperature issue after the driver is unloaded on smu 13.0.0, smu 13.0.7 and smu 13.0.10 v2 - fix the code format and make sure it is used on the unload case only. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amdgpu: remove amdgpu_mes_self_test in gpu recoverYifan Zhang
gpu tlb flush is skipped if reset sem is held, it makes mes_self_test fail since it involves add_hw_queue/remove_hw_queue which needs tlb flush functional. Remove mes_self_test in gpu recover sequence. This patch is to fix the recover failure in gfx11. [ 1831.768292] [drm] ring sdma_32769.3.3 was added [ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass [ 1831.768337] [drm] ring compute_32769.2.2 ib test pass [ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process pid 0 thread pid 0) [ 1831.768434] amdgpu 0000:c2:00.0: amdgpu: in page starting at address 0x0000aec200000000 from client 10 [ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30 [ 1831.768473] amdgpu 0000:c2:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [ 1831.768489] amdgpu 0000:c2:00.0: amdgpu: MORE_FAULTS: 0x0 [ 1831.768501] amdgpu 0000:c2:00.0: amdgpu: WALKER_ERROR: 0x0 [ 1831.768513] amdgpu 0000:c2:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [ 1831.768521] amdgpu 0000:c2:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 1831.768529] amdgpu 0000:c2:00.0: amdgpu: RW: 0x0 [ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110) [ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3 Fixes: e2e3788850b9 ("drm/amdgpu: rework lock handling for flush_tlb v2") Reported-by: Li Ma <li.ma@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amdgpu: Drop deferred error in uncorrectable error checkCandice Li
Drop checking deferred error which can be handled by poison consumption. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31drm/amdgpu: check RAS supported first in ras_reset_error_countTao Zhou
Not all platforms support RAS. Fixes: 73582be11ac8 ("drm/amdgpu: bypass RAS error reset in some conditions") Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31Merge tag 'amd-drm-next-6.7-2023-10-27' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.7-2023-10-27: amdgpu: - RAS fixes - Seamless boot fixes - NBIO 7.7 fix - SMU 14.0 fixes - GC 11.5 fixes - DML2 fixes - ASPM fixes - VPE fixes - Misc code cleanups - SRIOV fixes - Add some missing copyright notices - DCN 3.5 fixes - FAMS fixes - Backlight fix - S/G display fix - fdinfo cleanups - EXT_COHERENT fixes for APU and NUMA systems amdkfd: - Misc fixes - Misc code cleanups - SVM fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231027200343.57132-1-alexander.deucher@amd.com
2023-10-31Merge tag 'drm-misc-next-2023-10-27' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for v6.7-rc1: drm-misc-next-2023-10-19 + following: UAPI Changes: Cross-subsystem Changes: - Convert fbdev drivers to use fbdev i/o mem helpers. Core Changes: - Use cross-references for macros in docs. - Make drm_client_buffer_addb use addfb2. - Add NV20 and NV30 YUV formats. - Documentation updates for create_dumb ioctl. - CI fixes. - Allow variable number of run-queues in scheduler. Driver Changes: - Rename drm/ast constants. - Make ili9882t its own driver. - Assorted fixes in ivpu, vc4, bridge/synopsis, amdgpu. - Add planar formats to rockchip. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/3d92fae8-9b1b-4165-9ca8-5fda11ee146b@linux.intel.com
2023-10-27drm/amdgpu: Remove unused variables from amdgpu_show_fdinfoUmio Yasuno
Remove unused variables from amdgpu_show_fdinfo Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Umio Yasuno <coelacanth_dream@protonmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27drm/amdgpu: Remove duplicate fdinfo fieldsRob Clark
Some of the fields that are handled by drm_show_fdinfo() crept back in when rebasing the patch. Remove them again. Fixes: 376c25f8ca47 ("drm/amdgpu: Switch to fdinfo helper") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: <alexander.deucher@amd.com> Co-developed-by: Umio Yasuno <coelacanth_dream@protonmail.com> Signed-off-by: Umio Yasuno <coelacanth_dream@protonmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloadedKenneth Feng
avoid to disable gfxhub interrupt when driver is unloaded on gmc 11 Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systemsDavid Francis
On gfx943 APU, EXT_COHERENT should give MTYPE_CC for local and MTYPE_UC for nonlocal memory. On NUMA systems, local memory gets the local mtype, set by an override callback. If EXT_COHERENT is set, memory will be set as MTYPE_UC by default, with local memory MTYPE_CC. Add an option in the override function for this case, and add a check to ensure it is not used on UNCACHED memory. V2: Combined APU and NUMA code into one patch V3: Fixed a potential nullptr in amdgpu_vm_bo_update Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27drm/amdgpu: Retrieve CE count from ce_count_lo_chip in EccInfo tableCandice Li
Retrieve correctable error count from ce_count_lo_chip instead of mca_umc_status. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27drm/amdgpu: Identify data parity error corrected in replay modeCandice Li
Use ErrorCodeExt field to identify data parity error in replay mode. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27drm/amdgpu: Fix typo in IP discovery parsingMukul Joshi
Fix a typo in parsing of the GC info table header when reading the IP discovery table. Fixes: 0e64c9aad031 ("drm/amdgpu: add type conversion for gc info") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: Use pcie domain of xcc acpi objectsLijo Lazar
PCI domain/segment information of xccs is available through ACPI DSM methods. Consider that also while looking for devices. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: Add a read to GFX v9.4.3 ring testLijo Lazar
Issue a read to confirm the register write before ringing doorbell. With multiple XCCs there is chance for race condition. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: get RAS poison status from DF v4_6_2Tao Zhou
Add DF block and RAS poison mode query for DF v4_6_2. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Acked-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: Use discovery table's subrevisionLijo Lazar
Use subrevision of IP version in discovery table to identify SOC revision id for NBIO v7.9 SOCs. Only newer bootloaders update subrevision field. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amd: Explicitly disable ASPM when dynamic switching disabledMario Limonciello
Currently there are separate but related checks: * amdgpu_device_should_use_aspm() * amdgpu_device_aspm_support_quirk() * amdgpu_device_pcie_dynamic_switching_supported() Simplify into checking whether DPM was enabled or not in the auto case. This works because amdgpu_device_pcie_dynamic_switching_supported() populates that value. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amd: Move AMD_IS_APU check for ASPM into top level functionMario Limonciello
There is no need for every ASIC driver to perform the same check. Move the duplicated code into amdgpu_device_should_use_aspm(). Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amd: Disable PP_PCIE_DPM_MASK when dynamic speed switching not supportedMario Limonciello
Rather than individual ASICs checking for the quirk, set the quirk at the driver level. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu remove restriction of sriov max_pfn on Vega10Lin.Cao
Remove restriction of sriov max_pfn so that TBA and TMA can move to high 47 bits address. Regression test: change range alloc flag of libdrm as AMDGPU_VA_RANGE_HIGH and there is no flr occur when testing amdgpu_test of drm. Signed-off-by: Lin.Cao <lincao12@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULLQu Huang
In certain types of chips, such as VEGA20, reading the amdgpu_regs_smc file could result in an abnormal null pointer access when the smc_rreg pointer is NULL. Below are the steps to reproduce this issue and the corresponding exception log: 1. Navigate to the directory: /sys/kernel/debug/dri/0 2. Execute command: cat amdgpu_regs_smc 3. Exception Log:: [4005007.702554] BUG: kernel NULL pointer dereference, address: 0000000000000000 [4005007.702562] #PF: supervisor instruction fetch in kernel mode [4005007.702567] #PF: error_code(0x0010) - not-present page [4005007.702570] PGD 0 P4D 0 [4005007.702576] Oops: 0010 [#1] SMP NOPTI [4005007.702581] CPU: 4 PID: 62563 Comm: cat Tainted: G OE 5.15.0-43-generic #46-Ubunt u [4005007.702590] RIP: 0010:0x0 [4005007.702598] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [4005007.702600] RSP: 0018:ffffa82b46d27da0 EFLAGS: 00010206 [4005007.702605] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffa82b46d27e68 [4005007.702609] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9940656e0000 [4005007.702612] RBP: ffffa82b46d27dd8 R08: 0000000000000000 R09: ffff994060c07980 [4005007.702615] R10: 0000000000020000 R11: 0000000000000000 R12: 00007f5e06753000 [4005007.702618] R13: ffff9940656e0000 R14: ffffa82b46d27e68 R15: 00007f5e06753000 [4005007.702622] FS: 00007f5e0755b740(0000) GS:ffff99479d300000(0000) knlGS:0000000000000000 [4005007.702626] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [4005007.702629] CR2: ffffffffffffffd6 CR3: 00000003253fc000 CR4: 00000000003506e0 [4005007.702633] Call Trace: [4005007.702636] <TASK> [4005007.702640] amdgpu_debugfs_regs_smc_read+0xb0/0x120 [amdgpu] [4005007.703002] full_proxy_read+0x5c/0x80 [4005007.703011] vfs_read+0x9f/0x1a0 [4005007.703019] ksys_read+0x67/0xe0 [4005007.703023] __x64_sys_read+0x19/0x20 [4005007.703028] do_syscall_64+0x5c/0xc0 [4005007.703034] ? do_user_addr_fault+0x1e3/0x670 [4005007.703040] ? exit_to_user_mode_prepare+0x37/0xb0 [4005007.703047] ? irqentry_exit_to_user_mode+0x9/0x20 [4005007.703052] ? irqentry_exit+0x19/0x30 [4005007.703057] ? exc_page_fault+0x89/0x160 [4005007.703062] ? asm_exc_page_fault+0x8/0x30 [4005007.703068] entry_SYSCALL_64_after_hwframe+0x44/0xae [4005007.703075] RIP: 0033:0x7f5e07672992 [4005007.703079] Code: c0 e9 b2 fe ff ff 50 48 8d 3d fa b2 0c 00 e8 c5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 e c 28 48 89 54 24 [4005007.703083] RSP: 002b:00007ffe03097898 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [4005007.703088] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f5e07672992 [4005007.703091] RDX: 0000000000020000 RSI: 00007f5e06753000 RDI: 0000000000000003 [4005007.703094] RBP: 00007f5e06753000 R08: 00007f5e06752010 R09: 00007f5e06752010 [4005007.703096] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000022000 [4005007.703099] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [4005007.703105] </TASK> [4005007.703107] Modules linked in: nf_tables libcrc32c nfnetlink algif_hash af_alg binfmt_misc nls_ iso8859_1 ipmi_ssif ast intel_rapl_msr intel_rapl_common drm_vram_helper drm_ttm_helper amd64_edac t tm edac_mce_amd kvm_amd ccp mac_hid k10temp kvm acpi_ipmi ipmi_si rapl sch_fq_codel ipmi_devintf ipm i_msghandler msr parport_pc ppdev lp parport mtd pstore_blk efi_pstore ramoops pstore_zone reed_solo mon ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) iommu_v 2 amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm igb ahci xhci_pci libahci i2c_piix4 i2c_algo_bit xhci_pci_renesas dca [4005007.703184] CR2: 0000000000000000 [4005007.703188] ---[ end trace ac65a538d240da39 ]--- [4005007.800865] RIP: 0010:0x0 [4005007.800871] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [4005007.800874] RSP: 0018:ffffa82b46d27da0 EFLAGS: 00010206 [4005007.800878] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffa82b46d27e68 [4005007.800881] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9940656e0000 [4005007.800883] RBP: ffffa82b46d27dd8 R08: 0000000000000000 R09: ffff994060c07980 [4005007.800886] R10: 0000000000020000 R11: 0000000000000000 R12: 00007f5e06753000 [4005007.800888] R13: ffff9940656e0000 R14: ffffa82b46d27e68 R15: 00007f5e06753000 [4005007.800891] FS: 00007f5e0755b740(0000) GS:ffff99479d300000(0000) knlGS:0000000000000000 [4005007.800895] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [4005007.800898] CR2: ffffffffffffffd6 CR3: 00000003253fc000 CR4: 00000000003506e0 Signed-off-by: Qu Huang <qu.huang@linux.dev> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: bypass RAS error reset in some conditionsTao Zhou
PMFW is responsible for RAS error reset in some conditions, driver can skip the operation. v2: add check for ras->in_recovery, it's set earlier than amdgpu_in_reset. v3: fix error in gpu reset check. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: enable RAS poison mode for APUTao Zhou
Enable it by default on APU platform. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu/vpe: correct queue stop programingLang Yu
Otherwise IB test would fail during GPU reset. Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amd: Disable ASPM for VI w/ all Intel systemsMario Limonciello
Originally we were quirking ASPM disabled specifically for VI when used with Alder Lake, but it appears to have problems with Rocket Lake as well. Like we've done in the case of dpm for newer platforms, disable ASPM for all Intel systems. Cc: stable@vger.kernel.org # 5.15+ Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Reported-and-tested-by: Paolo Gentili <paolo.gentili@canonical.com> Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: Add API to get full IP versionLijo Lazar
Fetch the full version of IP including variant and subrevision. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: add tmz support for GC IP v11.5.0Jiadong Zhu
Add tmz support for GC 11.5.0. Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: modify if condition in nbio_v7_7.cLi Ma
remove unnecessary "enable" in if condition. Signed-off-by: Li Ma <li.ma@amd.com> Reviewed-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: refine ras error kernel log printYang Wang
refine ras error kernel log to avoid user-ridden ambiguity. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: fix find ras error node errorYang Wang
the origin function might return the wrong node. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26drm/amdgpu: move buffer funcs setting up a levelAlex Deucher
Rather than doing this in the IP code for the SDMA paging engine, move it up to the core device level init level. This should fix the scheduler init ordering. v2: drop extra parens v3: drop SDMA helpers v4: Added a Fixes tag because amdgpu dereferences an uninitialized scheduler without this patch, and this patch fixes this. (Luben) Tested-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20231025171928.3318505-1-alexander.deucher@amd.com Acked-by: Christian König <christian.koenig@amd.com> Fixes: 56e449603f0ac5 ("drm/sched: Convert the GPU scheduler to variable number of run-queues") Signed-off-by: Luben Tuikov <ltuikov89@gmail.com>
2023-10-26drm/sched: Convert the GPU scheduler to variable number of run-queuesLuben Tuikov
The GPU scheduler has now a variable number of run-queues, which are set up at drm_sched_init() time. This way, each driver announces how many run-queues it requires (supports) per each GPU scheduler it creates. Note, that run-queues correspond to scheduler "priorities", thus if the number of run-queues is set to 1 at drm_sched_init(), then that scheduler supports a single run-queue, i.e. single "priority". If a driver further sets a single entity per run-queue, then this creates a 1-to-1 correspondence between a scheduler and a scheduled entity. Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Russell King <linux+etnaviv@armlinux.org.uk> Cc: Qiang Yu <yuq825@gmail.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Danilo Krummrich <dakr@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Emma Anholt <emma@anholt.net> Cc: etnaviv@lists.freedesktop.org Cc: lima@lists.freedesktop.org Cc: linux-arm-msm@vger.kernel.org Cc: freedreno@lists.freedesktop.org Cc: nouveau@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20231023032251.164775-1-luben.tuikov@amd.com
2023-10-25Merge tag 'amd-drm-next-6.7-2023-10-20' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.7-2023-10-20: amdgpu: - SMU 13 updates - UMSCH updates - DC MPO fixes - RAS updates - MES 11 fixes - Fix possible memory leaks in error pathes - GC 11.5 fixes - Kernel doc updates - PSP updates - APU IMU fixes - Misc code cleanups - SMU 11 fixes - OD fix - Frame size warning fixes - SR-IOV fixes - NBIO 7.11 updates - NBIO 7.7 updates - XGMI fixes - devcoredump updates amdkfd: - Misc code cleanups - SVM fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231020195043.4937-1-alexander.deucher@amd.com