summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2021-06-08drm/amdgpu: Fix incorrect register offsets for Sienna CichlidRohit Khaire
RLC_CP_SCHEDULERS and RLC_SPARE_INT0 have different offsets for Sienna Cichlid Signed-off-by: Rohit Khaire <rohit.khaire@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-08drm/amdgpu: Use drm_dbg_kms for reporting failure to get a GEM FBMichel Dänzer
drm_err meant broken user space could spam dmesg. Fixes: f258907fdd835e "drm/amdgpu: Verify bo size can fit framebuffer size on init." Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-08drm/amdgpu: switch kzalloc to kvzalloc in amdgpu_bo_createChangfeng
It will cause error when alloc memory larger than 128KB in amdgpu_bo_create->kzalloc. So it needs to switch kzalloc to kvzalloc. Call Trace: alloc_pages_current+0x6a/0xe0 kmalloc_order+0x32/0xb0 kmalloc_order_trace+0x1e/0x80 __kmalloc+0x249/0x2d0 amdgpu_bo_create+0x102/0x500 [amdgpu] ? xas_create+0x264/0x3e0 amdgpu_bo_create_vm+0x32/0x60 [amdgpu] amdgpu_vm_pt_create+0xf5/0x260 [amdgpu] amdgpu_vm_init+0x1fd/0x4d0 [amdgpu] Signed-off-by: Changfeng <Changfeng.Zhu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-02drm/amdgpu: make sure we unpin the UVD BONirmoy Das
Releasing pinned BOs is illegal now. UVD 6 was missing from: commit 2f40801dc553 ("drm/amdgpu: make sure we unpin the UVD BO") Fixes: 2f40801dc553 ("drm/amdgpu: make sure we unpin the UVD BO") Cc: stable@vger.kernel.org Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-02drm/amd/amdgpu:save psp ring wptr to avoid attackVictor Zhao
[Why] When some tools performing psp mailbox attack, the readback value of register can be a random value which may break psp. [How] Use a psp wptr cache machanism to aovid the change made by attack. v2: unify change and add detailed reason Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-02drm/amdgpu: Don't query CE and UE errorsLuben Tuikov
On QUERY2 IOCTL don't query counts of correctable and uncorrectable errors, since when RAS is enabled and supported on Vega20 server boards, this takes insurmountably long time, in O(n^3), which slows the system down to the point of it being unusable when we have GUI up. Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2") Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-02drm/amdgpu: refine amdgpu_fru_get_product_infoJiansong Chen
1. eliminate potential array index out of bounds. 2. return meaningful value for failure. Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-02drm/amdgpu: add judgement for dc supportAsher Song
Drop DC initialization when DCN is harvested in VBIOS. The way doesn't affect virtual display ip initialization. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Asher Song <Asher.Song@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-20drm/amdgpu/jpeg3: add cancel_delayed_work_sync before power gateJames Zhu
Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-20drm/amdgpu/jpeg2.5: add cancel_delayed_work_sync before power gateJames Zhu
Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-20drm/amdgpu/jpeg2.0: add cancel_delayed_work_sync before power gateJames Zhu
Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-20drm/amdgpu/vcn3: add cancel_delayed_work_sync before power gateJames Zhu
Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-20drm/amdgpu/vcn2.5: add cancel_delayed_work_sync before power gateJames Zhu
Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-20drm/amdgpu/vcn2.0: add cancel_delayed_work_sync before power gateJames Zhu
Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-20drm/amdgpu/vcn1: add cancel_delayed_work_sync before power gateJames Zhu
Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-20drm/amdkfd: correct sienna_cichlid SDMA RLC register offset errorKevin Wang
1.correct KFD SDMA RLC queue register offset error. (all sdma rlc register offset is base on SDMA0.RLC0_RLC0_RB_CNTL) 2.HQD_N_REGS (19+6+7+12) 12: the 2 more resgisters than navi1x (SDMAx_RLCy_MIDCMD_DATA{9,10}) the patch also can be fixed NULL pointer issue when read /sys/kernel/debug/kfd/hqds on sienna_cichlid chip. Signed-off-by: Kevin Wang <kevin1.wang@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-19drm/amdgpu: stop touching sched.ready in the backendChristian König
This unfortunately comes up in regular intervals and breaks GPU reset for the engine in question. The sched.ready flag controls if an engine can't get working during hw_init, but should never be set to false during hw_fini. v2: squash in unused variable fix (Alex) Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19drm/amd/amdgpu: fix a potential deadlock in gpu resetLang Yu
When amdgpu_ib_ring_tests failed, the reset logic called amdgpu_device_ip_suspend twice, then deadlock occurred. Deadlock log: [ 805.655192] amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). [ 806.290952] [drm] free PSP TMR buffer [ 806.319406] ============================================ [ 806.320315] WARNING: possible recursive locking detected [ 806.321225] 5.11.0-custom #1 Tainted: G W OEL [ 806.322135] -------------------------------------------- [ 806.323043] cat/2593 is trying to acquire lock: [ 806.323825] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.325668] but task is already holding lock: [ 806.326664] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.328430] other info that might help us debug this: [ 806.329539] Possible unsafe locking scenario: [ 806.330549] CPU0 [ 806.330983] ---- [ 806.331416] lock(&adev->dm.dc_lock); [ 806.332086] lock(&adev->dm.dc_lock); [ 806.332738] *** DEADLOCK *** [ 806.333747] May be due to missing lock nesting notation [ 806.334899] 3 locks held by cat/2593: [ 806.335537] #0: ffff888100d3f1b8 (&attr->mutex){+.+.}-{3:3}, at: simple_attr_read+0x4e/0x110 [ 806.337009] #1: ffff888136b1fd78 (&adev->reset_sem){++++}-{3:3}, at: amdgpu_device_lock_adev+0x42/0x94 [amdgpu] [ 806.339018] #2: ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.340869] stack backtrace: [ 806.341621] CPU: 6 PID: 2593 Comm: cat Tainted: G W OEL 5.11.0-custom #1 [ 806.342921] Hardware name: AMD Celadon-CZN/Celadon-CZN, BIOS WLD0C23N_Weekly_20_12_2 12/23/2020 [ 806.344413] Call Trace: [ 806.344849] dump_stack+0x93/0xbd [ 806.345435] __lock_acquire.cold+0x18a/0x2cf [ 806.346179] lock_acquire+0xca/0x390 [ 806.346807] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.347813] __mutex_lock+0x9b/0x930 [ 806.348454] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.349434] ? amdgpu_device_indirect_rreg+0x58/0x70 [amdgpu] [ 806.350581] ? _raw_spin_unlock_irqrestore+0x47/0x50 [ 806.351437] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.352437] ? rcu_read_lock_sched_held+0x4f/0x80 [ 806.353252] ? rcu_read_lock_sched_held+0x4f/0x80 [ 806.354064] mutex_lock_nested+0x1b/0x20 [ 806.354747] ? mutex_lock_nested+0x1b/0x20 [ 806.355457] dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.356427] ? soc15_common_set_clockgating_state+0x17d/0x19 [amdgpu] [ 806.357736] amdgpu_device_ip_suspend_phase1+0x78/0xd0 [amdgpu] [ 806.360394] amdgpu_device_ip_suspend+0x21/0x70 [amdgpu] [ 806.362926] amdgpu_device_pre_asic_reset+0xb3/0x270 [amdgpu] [ 806.365560] amdgpu_device_gpu_recover.cold+0x679/0x8eb [amdgpu] Signed-off-by: Lang Yu <Lang.Yu@amd.com> Acked-by: Christian KÃnig <christian.koenig@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19drm/amdgpu: update sdma golden setting for Navi12Guchun Chen
Current golden setting is out of date. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-19drm/amdgpu: update gc golden setting for Navi12Guchun Chen
Current golden setting is out of date. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-19drm/amdgpu: Fix a use-after-freexinhui pan
looks like we forget to set ttm->sg to NULL. Hit panic below [ 1235.844104] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b7b4b: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI [ 1235.989074] Call Trace: [ 1235.991751] sg_free_table+0x17/0x20 [ 1235.995667] amdgpu_ttm_backend_unbind.cold+0x4d/0xf7 [amdgpu] [ 1236.002288] amdgpu_ttm_backend_destroy+0x29/0x130 [amdgpu] [ 1236.008464] ttm_tt_destroy+0x1e/0x30 [ttm] [ 1236.013066] ttm_bo_cleanup_memtype_use+0x51/0xa0 [ttm] [ 1236.018783] ttm_bo_release+0x262/0xa50 [ttm] [ 1236.023547] ttm_bo_put+0x82/0xd0 [ttm] [ 1236.027766] amdgpu_bo_unref+0x26/0x50 [amdgpu] [ 1236.032809] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7aa/0xd90 [amdgpu] [ 1236.040400] kfd_ioctl_alloc_memory_of_gpu+0xe2/0x330 [amdgpu] [ 1236.046912] kfd_ioctl+0x463/0x690 [amdgpu] Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19drm/amdgpu: add video_codecs query support for aldebaranJames Zhu
Add video_codecs query support for aldebaran. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19drm/amd/amdgpu: fix refcount leakJingwen Chen
[Why] the gem object rfb->base.obj[0] is get according to num_planes in amdgpufb_create, but is not put according to num_planes [How] put rfb->base.obj[0] in amdgpu_fbdev_destroy according to num_planes Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19drm/amdgpu: disable 3DCGCG on picasso/raven1 to avoid compute hangChangfeng
There is problem with 3DCGCG firmware and it will cause compute test hang on picasso/raven1. It needs to disable 3DCGCG in driver to avoid compute hang. Signed-off-by: Changfeng <Changfeng.Zhu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-19drm/amdgpu: Fix GPU TLB update error when PAGE_SIZE > AMDGPU_PAGE_SIZEYi Li
When PAGE_SIZE is larger than AMDGPU_PAGE_SIZE, the number of GPU TLB entries which need to update in amdgpu_map_buffer() should be multiplied by AMDGPU_GPU_PAGES_IN_CPU_PAGE (PAGE_SIZE / AMDGPU_PAGE_SIZE). Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Yi Li <liyi@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-13drm/amdgpu: update vcn1.0 Non-DPG suspend sequenceSathishkumar S
update suspend register settings in Non-DPG mode. Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-13drm/amdgpu: set vcn mgcg flag for picassoSathishkumar S
enable vcn mgcg flag for picasso. Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-13drm/amdgpu: update the method for harvest IP for specific SKULikun Gao
Update the method of disabling VCN IP for specific SKU for navi1x ASIC, it will judge whether should add the related IP at the function of amdgpu_device_ip_block_add(). Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-13drm/amdgpu: add judgement when add ip blocks (v2)Likun GAO
Judgement whether to add an sw ip according to the harvest info. v2: fix indentation (Alex) Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-07Merge tag 'amd-drm-fixes-5.13-2021-05-05' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-fixes-5.13-2021-05-05: amdgpu: - MPO hang workaround - Fix for concurrent VM flushes on vega/navi - dcefclk is not adjustable on navi1x and newer - MST HPD debugfs fix - Suspend/resumes fixes - Register VGA clients late in case driver fails to load - Fix GEM leak in user framebuffer create - Add support for polaris12 with 32 bit memory interface - Fix duplicate cursor issue when using overlay - Fix corruption with tiled surfaces on VCN3 - Add BO size and stride check to fix BO size verification radeon: - Fix off-by-one in power state parsing - Fix possible memory leak in power state parsing Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210506033929.3875-1-alexander.deucher@amd.com
2021-05-05drm/amdgpu: Use device specific BO size & stride check.Bas Nieuwenhuizen
The builtin size check isn't really the right thing for AMD modifiers due to a couple of reasons: 1) In the format structs we don't do set any of the tilesize / blocks etc. to avoid having format arrays per modifier/GPU 2) The pitch on the main plane is pixel_pitch * bytes_per_pixel even for tiled ... 3) The pitch for the DCC planes is really the pixel pitch of the main surface that would be covered by it ... Note that we only handle GFX9+ case but we do this after converting the implicit modifier to an explicit modifier, so on GFX9+ all framebuffers should be checked here. There is a TODO about DCC alignment, but it isn't worse than before and I'd need to dig a bunch into the specifics. Getting this out in a reasonable timeframe to make sure it gets the appropriate testing seemed more important. Finally as I've found that debugging addfb2 failures is a pita I was generous adding explicit error messages to every failure case. Fixes: f258907fdd83 ("drm/amdgpu: Verify bo size can fit framebuffer size on init.") Tested-by: Simon Ser <contact@emersion.fr> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-05drm/amdgpu: Init GFX10_ADDR_CONFIG for VCN v3 in DPG mode.Bas Nieuwenhuizen
Otherwise tiling modes that require the values form this field (In particular _*_X) would be corrupted upon video decode. Copied from the VCN v2 code. Fixes: 99541f392b4d ("drm/amdgpu: add mc resume DPG mode for VCN3.0") Reviewed-and-Tested by: Leo Liu <leo.liu@amd.com> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-05-04drm/amdgpu: add new MC firmware for Polaris12 32bit ASICEvan Quan
Polaris12 32bit ASIC needs a special MC firmware. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-04-29amdgpu: fix GEM obj leak in amdgpu_display_user_framebuffer_createSimon Ser
This error code-path is missing a drm_gem_object_put call. Other error code-paths are fine. Signed-off-by: Simon Ser <contact@emersion.fr> Fixes: 1769152ac64b ("drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)") Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Harry Wentland <hwentlan@amd.com> Cc: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-04-29drm/amdgpu: Register VGA clients after init can no longer failKai-Heng Feng
When an amdgpu device fails to init, it makes another VGA device cause kernel splat: kernel: amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_init failed kernel: amdgpu 0000:08:00.0: amdgpu: Fatal error during GPU init kernel: amdgpu: probe of 0000:08:00.0 failed with error -110 ... kernel: amdgpu 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none kernel: BUG: kernel NULL pointer dereference, address: 0000000000000018 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 0 P4D 0 kernel: Oops: 0000 [#1] SMP NOPTI kernel: CPU: 6 PID: 1080 Comm: Xorg Tainted: G W 5.12.0-rc8+ #12 kernel: Hardware name: HP HP EliteDesk 805 G6/872B, BIOS S09 Ver. 02.02.00 12/30/2020 kernel: RIP: 0010:amdgpu_device_vga_set_decode+0x13/0x30 [amdgpu] kernel: Code: 06 31 c0 c3 b8 ea ff ff ff 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 8b 87 90 06 00 00 48 89 e5 53 89 f3 <48> 8b 40 18 40 0f b6 f6 e8 40 58 39 fd 80 fb 01 5b 5d 19 c0 83 e0 kernel: RSP: 0018:ffffae3c0246bd68 EFLAGS: 00010002 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 kernel: RDX: ffff8dd1af5a8560 RSI: 0000000000000000 RDI: ffff8dce8c160000 kernel: RBP: ffffae3c0246bd70 R08: ffff8dd1af5985c0 R09: ffffae3c0246ba38 kernel: R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000246 kernel: R13: 0000000000000000 R14: 0000000000000003 R15: ffff8dce81490000 kernel: FS: 00007f9303d8fa40(0000) GS:ffff8dd1af580000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000018 CR3: 0000000103cfa000 CR4: 0000000000350ee0 kernel: Call Trace: kernel: vga_arbiter_notify_clients.part.0+0x4a/0x80 kernel: vga_get+0x17f/0x1c0 kernel: vga_arb_write+0x121/0x6a0 kernel: ? apparmor_file_permission+0x1c/0x20 kernel: ? security_file_permission+0x30/0x180 kernel: vfs_write+0xca/0x280 kernel: ksys_write+0x67/0xe0 kernel: __x64_sys_write+0x1a/0x20 kernel: do_syscall_64+0x38/0x90 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae kernel: RIP: 0033:0x7f93041e02f7 kernel: Code: 75 05 48 83 c4 58 c3 e8 f7 33 ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 kernel: RSP: 002b:00007fff60e49b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f93041e02f7 kernel: RDX: 000000000000000b RSI: 00007fff60e49b40 RDI: 000000000000000f kernel: RBP: 00007fff60e49b40 R08: 00000000ffffffff R09: 00007fff60e499d0 kernel: R10: 00007f93049350b5 R11: 0000000000000246 R12: 000056111d45e808 kernel: R13: 0000000000000000 R14: 000056111d45e7f8 R15: 000056111d46c980 kernel: Modules linked in: nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_seq input_leds snd_seq_device snd_timer snd soundcore joydev kvm_amd serio_raw k10temp mac_hid hp_wmi ccp kvm sparse_keymap wmi_bmof ucsi_acpi efi_pstore typec_ucsi rapl typec video wmi sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log hid_generic usbhid hid amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops ghash_clmulni_intel cec rc_core aesni_intel crypto_simd psmouse cryptd r8169 i2c_piix4 drm ahci xhci_pci realtek libahci xhci_pci_renesas gpio_amdpt gpio_generic kernel: CR2: 0000000000000018 kernel: ---[ end trace 76d04313d4214c51 ]--- Commit 4192f7b57689 ("drm/amdgpu: unmap register bar on device init failure") makes amdgpu_driver_unload_kms() skips amdgpu_device_fini(), so the VGA clients remain registered. So when vga_arbiter_notify_clients() iterates over registered clients, it causes NULL pointer dereference. Since there's no reason to register VGA clients that early, so solve the issue by putting them after all the goto cleanups. v2: - Remove redundant vga_switcheroo cleanup in failed: label. Fixes: 4192f7b57689 ("drm/amdgpu: unmap register bar on device init failure") Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-29drm/amdgpu: Handling of amdgpu_device_resume return value for graceful teardownPavan Kumar Ramayanam
The runtime resume PM op disregards the return value from amdgpu_device_resume(), masking errors for failed resumes at the PM layer. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Pavan Kumar Ramayanam <pavan.ramayanam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-29drm/amdgpu: fix r initial valuesVictor Zhao
Sriov gets suspend of IP block <dce_virtual> failed as return value was not initialized. v2: return 0 directly to align original code semantic before this was broken out into a separate helper function instead of setting initial values Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-04-28drm/amdgpu: fix concurrent VM flushes on Vega/Navi v2Christian König
Starting with Vega the hardware supports concurrent flushes of VMID which can be used to implement per process VMID allocation. But concurrent flushes are mutual exclusive with back to back VMID allocations, fix this to avoid a VMID used in two ways at the same time. v2: don't set ring to NULL Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Tested-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-04-20drm/amdgpu/gmc9: remove dummy read workaround for newer chipsAlex Deucher
Aldebaran has a hw fix so no longer requires the workaround. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: Add mem sync flag for IB allocated by SAJinzhou Su
The buffer of SA bo will be used by many cases. So it's better to invalidate the cache of indirect buffer allocated by SA before commit the IB. Signed-off-by: Jinzhou Su <Jinzhou.Su@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: Fix SDMA RAS error reporting on AldebaranMukul Joshi
Fix the following issues with SDMA RAS error reporting: 1. Read the EDC_COUNTER2 register also to fetch error counts for all sub-blocks in SDMA. 2. SDMA RAS on Aldebaran suports single-bit uncorrectable errors only. So, report error count in UE count instead of CE count. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-By: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: Reset RAS error count and status regsMukul Joshi
Reset the RAS error count and error status registers after reading to prevent over reporting error counts on Aldebaran. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-By: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20Revert "drm/amdgpu: workaround the TMR MC address issue (v2)"Oak Zeng
This reverts commit 2f055097daef498da57552f422f49de50a1573e6. 2f055097daef498da57552f422f49de50a1573e6 was a driver workaround when PSP firmware was not ready. Now the PSP fw is ready so we revert this driver workaround. Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: fix GCR_GENERAL_CNTL offset for dimgrey_cavefishJiansong Chen
dimgrey_cavefish has similar gc_10_3 ip with sienna_cichlid, so follow its registers offset setting. Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: resolve erroneous gfx_v9_4_2 printsJohn Clements
resolve bug on aldebaran where gfx error counts will print on driver load when there are no errors present Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: fix a error injection failed issueDennis Li
because "sscanf(str, "retire_page")" always return 0, if application use the raw data for error injection, it always wrongly falls into "op == 3". Change to use strstr instead. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: only harvest gcea/mmea error status in aldebaranHawking Zhang
In aldebaran, driver only needs to harvest SDP RdRspStatus, WrRspStatus and first parity error on RdRsp data. Check error type before harvest error information. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: only harvest gcea/mmea error status in arcturusHawking Zhang
SDP RdRspStatus/WrRspStatus or first parity error on RdRsp data can cause system fatal error in arcturus. GPU will be freezed in such case. Driver needs to harvest these error information before reset the GPU. Check error type to avoid harvest normal gcea/mmea information. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: enable tmz on renoir asicsHuang Rui
The tmz functions are verified on renoir chips as well. So enable it by default. Signed-off-by: Huang Rui <ray.huang@amd.com> Tested-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20drm/amdgpu: correct default gfx wdt timeout settingHawking Zhang
When gfx wdt was configured to fatal_disable, the timeout period should be configured to 0x0 (timeout disabled) Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>