pm24.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-08-23	drm/amdkfd: Change kfd/svm page fault drain handling	Xiaogang Chen
	When app unmap vm ranges(munmap) kfd/svm starts drain pending page fault and not handle any incoming pages fault of this process until a deferred work item got executed by default system wq. The time period of "not handle page fault" can be long and is unpredicable. That is advese to kfd performance on page faults recovery. This patch uses time stamp of incoming page fault to decide to drop or recover page fault. When app unmap vm ranges kfd records each gpu device's ih ring current time stamp. These time stamps are used at kfd page fault recovery routine. Any page fault happened on unmapped ranges after unmap events is application bug that accesses vm range after unmap. It is not driver work to cover that. By using time stamp of page fault do not need drain page faults at deferred work. So, the time period that kfd does not handle page faults is reduced and can be controlled. Signed-off-by: Xiaogang.Chen <Xiaogang.Chen@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14	drm/amdgpu: add additional VM bits	Alex Deucher
	Add additional VM PTE bits. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05	drm/amdgpu: Update the impelmentation of AMDGPU_PTE_MTYPE_VG10	Shane Xiao
	This patch changes the implementation of AMDGPU_PTE_MTYPE_VG10, clear the bits before setting the new one. Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: longlyao <Longlong.Yao@amd.com> Signed-off-by: Shane Xiao <shane.xiao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05	drm/amdgpu: Update the impelmentation of AMDGPU_PTE_MTYPE_NV10	Shane Xiao
	This patch changes the implementation of AMDGPU_PTE_MTYPE_NV10, clear the bits before setting the new one. Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: longlyao <Longlong.Yao@amd.com> Signed-off-by: Shane Xiao <shane.xiao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05	drm/amdgpu: Update the impelmentation of AMDGPU_PTE_MTYPE_GFX12	Shane Xiao
	This patch changes the implementation of AMDGPU_PTE_MTYPE_GFX12, clear the bits before setting the new one. This fixed the potential issue that GFX12 setting memory to NC. v2: Clear mtype field before setting the new one (Alex) v3: Fix typo (Felix) Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: longlyao <Longlong.Yao@amd.com> Signed-off-by: Shane Xiao <shane.xiao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-17	drm/amdgpu: Add amdgpu_bo_is_vm_bo helper	Tvrtko Ursulin
	Help code readability by replacing a bunch of: bo->tbo.base.resv == vm->root.bo->tbo.base.resv With: amdgpu_vm_is_bo_always_valid(vm, bo) No functional changes. v2: * Rename helper and move to amdgpu_vm. (Christian) v3: * Use Christian's kerneldoc. v4: * Fixed logic inversion in amdgpu_vm_bo_get_memory. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-30	drm/amdgpu: support gfx v12 specific pte/pde fields	Hawking Zhang
	Add gfx v12 pte/pde support to gmc common helper. v2: squash in fixes (Alex) Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-30	drm/amdgpu: Add gfx v12 pte/pde format change	Hawking Zhang
	Add gfx v12 pte/pde format change. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22	drm/amdgpu: sync page table freeing with tlb flush	Shashank Sharma
	The idea behind this patch is to delay the freeing of PT entry objects until the TLB flush is done. This patch: - Adds a tlb_flush_waitlist in amdgpu_vm_update_params which will keep the objects that need to be freed after tlb_flush. - Adds PT entries in this list in amdgpu_vm_ptes_update after finding the PT entry. - Changes functionality of amdgpu_vm_pt_free_dfs from (df_search + free) to simply freeing of the BOs, also renames it to amdgpu_vm_pt_free_list to reflect this same. - Exports function amdgpu_vm_pt_free_list to be called directly. - Calls amdgpu_vm_pt_free_list directly from amdgpu_vm_update_range. V2: rebase V4: Addressed review comments from Christian - add only locked PTEs entries in TLB flush waitlist. - do not create a separate function for list flush. - do not create a new lock for TLB flush. - there is no need to wait on tlb_flush_fence exclusively. V5: Addressed review comments from Christian - change the amdgpu_vm_pt_free_dfs's functionality to simple freeing of the objects and rename it. - add all the PTE objects in params->tlb_flush_waitlist - let amdgpu_vm_pt_free_root handle the freeing of BOs independently - call amdgpu_vm_pt_free_list directly V6: Rebase V7: Rebase V8: Added a NULL check to fix this backtrace issue: [ 415.351447] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 415.359245] #PF: supervisor write access in kernel mode [ 415.365081] #PF: error_code(0x0002) - not-present page [ 415.370817] PGD 101259067 P4D 101259067 PUD 10125a067 PMD 0 [ 415.377140] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 415.382004] CPU: 0 PID: 25481 Comm: test_with_MPI.e Tainted: G OE 5.18.2-mi300-build-140423-ubuntu-22.04+ #24 [ 415.394437] Hardware name: AMD Corporation Sh51p/Sh51p, BIOS RMO1001AS 02/21/2024 [ 415.402797] RIP: 0010:amdgpu_vm_ptes_update+0x6fd/0xa10 [amdgpu] [ 415.409648] Code: 4c 89 ff 4d 8d 66 30 e8 f1 ed ff ff 48 85 db 74 42 48 39 5d a0 74 40 48 8b 53 20 48 8b 4b 18 48 8d 43 18 48 8d 75 b0 4c 89 ff <48 > 89 51 08 48 89 0a 49 8b 56 30 48 89 42 08 48 89 53 18 4c 89 63 [ 415.430621] RSP: 0018:ffffc9000401f990 EFLAGS: 00010287 [ 415.436456] RAX: ffff888147bb82f0 RBX: ffff888147bb82d8 RCX: 0000000000000000 [ 415.444426] RDX: 0000000000000000 RSI: ffffc9000401fa30 RDI: ffff888161f80000 [ 415.452397] RBP: ffffc9000401fa80 R08: 0000000000000000 R09: ffffc9000401fa00 [ 415.460368] R10: 00000007f0cc0000 R11: 00000007f0c85000 R12: ffffc9000401fb20 [ 415.468340] R13: 00000007f0d00000 R14: ffffc9000401faf0 R15: ffff888161f80000 [ 415.476312] FS: 00007f132ff89840(0000) GS:ffff889f87c00000(0000) knlGS:0000000000000000 [ 415.485350] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 415.491767] CR2: 0000000000000008 CR3: 0000000161d46003 CR4: 0000000000770ef0 [ 415.499738] PKRU: 55555554 [ 415.502750] Call Trace: [ 415.505482] <TASK> [ 415.507825] amdgpu_vm_update_range+0x32a/0x880 [amdgpu] [ 415.513869] amdgpu_vm_clear_freed+0x117/0x250 [amdgpu] [ 415.519814] amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x18c/0x250 [amdgpu] [ 415.527729] kfd_ioctl_unmap_memory_from_gpu+0xed/0x340 [amdgpu] [ 415.534551] kfd_ioctl+0x3b6/0x510 [amdgpu] V9: Addressed review comments from Christian - No NULL check reqd for root PT freeing - Free PT list regardless of needs_flush - Move adding BOs in list in a separate function V10: Added Christian's RB V11: squash in list fix Cc: Christian König <Christian.Koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Reviewed-by: Christian König <Christian.Koenig@amd.com> Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20	drm/amdgpu: implement TLB flush fence	Christian Koenig
	The problem is that when (for example) 4k pages are replaced with a single 2M page we need to wait for change to be flushed out by invalidating the TLB before the PT can be freed. Solve this by moving the TLB flush into a DMA-fence object which can be used to delay the freeing of the PT BOs until it is signaled. V2: (Shashank) - rebase - set dma_fence_error only in case of error - add tlb_flush fence only when PT/PD BO is locked (Felix) - use vm->pasid when f is NULL (Mukul) V4: - add a wait for (f->dependency) in tlb_fence_work (Christian) - move the misplaced fence_create call to the end (Philip) V5: - free the f->dependency properly V6: (Shashank) - light code movement, moved all the clean-up in previous patch - introduce params.needs_flush and its usage in this patch - rebase without TLB HW sequence patch V7: - Keep the vm->last_update_fence and tlb_cb code until we can fix the HW sequencing (Christian) - Move all the tlb_fence related code in a separate function so that its easier to read and review V9: Addressed review comments from Christian - start PT update only when we have callback memory allocated V10: - handle device unlock in OOM case (Christian, Mukul) - added Christian's R-B Cc: Christian Koenig <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Reviewed-by: Shashank Sharma <shashank.sharma@amd.com> Reviewed-by: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20	drm/amdgpu: add recent pagefault info in vm_manager	Sunil Khatri
	Currently page fault information is stored per vm and which could be freed or stale during reset. Add it pagefault information in the vm_manager which is a global space for vm's and remains valid across. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-06	drm/amdgpu: remove unused code	Jesse Zhang
	Remove the unused function - amdgpu_vm_pt_is_root_clean and remove the impossible condition v1: entries == 0 is not possible any more, so this condition could probably be removed (Felix) Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Suggested-by：Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04	drm/amdgpu: change vm->task_info handling	Shashank Sharma
	This patch changes the handling and lifecycle of vm->task_info object. The major changes are: - vm->task_info is a dynamically allocated ptr now, and its uasge is reference counted. - introducing two new helper funcs for task_info lifecycle management - amdgpu_vm_get_task_info: reference counts up task_info before returning this info - amdgpu_vm_put_task_info: reference counts down task_info - last put to task_info() frees task_info from the vm. This patch also does logistical changes required for existing usage of vm->task_info. V2: Do not block all the prints when task_info not found (Felix) V3: Fixed review comments from Felix - Fix wrong indentation - No debug message for -ENOMEM - Add NULL check for task_info - Do not duplicate the debug messages (ti vs no ti) - Get first reference of task_info in vm_init(), put last in vm_fini() V4: Fixed review comments from Felix - fix double reference increment in create_task_info - change amdgpu_vm_get_task_info_pasid - additional changes in amdgpu_gem.c while porting Cc: Christian Koenig <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04	Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for ↵	Jesse Zhang
	Raven fix the issue: "amdgpu: Failed to create process VM object". [Why]when amdgpu initialized, seq64 do mampping and update bo mapping in vm page table. But when clifo run. It also initializes a vm for a process device through the function kfd_process_device_init_vm and ensure the root PD is clean through the function amdgpu_vm_pt_is_root_clean. So they have a conflict, and clinfo always failed. v1: - remove all the pte_supports_ats stuff from the amdgpu_vm code (Felix) Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-16	drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole	Felix Kuehling
	The TBA and TMA, along with an unused IB allocation, reside at low addresses in the VM address space. A stray VM fault which hits these pages must be serviced by making their page table entries invalid. The scheduler depends upon these pages being resident and fails, preventing a debugger from inspecting the failure state. By relocating these pages above 47 bits in the VM address space they can only be reached when bits [63:48] are set to 1. This makes it much less likely for a misbehaving program to generate accesses to them. The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL access with a small offset. v2: - Move it to the reserved space to avoid concflicts with Mesa - Add macros to make reserved space management easier v3: - Move VM max PFN calculation into AMDGPU_VA_RESERVED macros Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-14	drm/amdgpu: Reduce VA_RESERVED_BOTTOM to 64KB	Felix Kuehling
	The reservation is there to catch NULL pointer dereferences from the GPU. Reduce the size to 64KB to make sure that shared virtual address programming models can map all CPU-accessible virtual addresses for GPU access. This is also the default for CPU virtual address mappings as seen in /proc/sys/vm/mmap_min_addr. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22	drm/amdgpu: Enable seq64 manager and fix bugs	Arunpravin Paneer Selvam
	- Enable the seq64 mapping sequence. - Fix wflinfo va conflict and other bugs. v1: - The seq64 area needs to be included in the AMDGPU_VA_RESERVED_SIZE otherwise the areas will conflict with user space allocations (Alex) - It needs to be mapped read only in the user VM (Alex) v2: - Instead of just one define for TOP/BOTTOM reserved space separate them into two (Christian) - Fix the CPU and VA calculations and while at it also cleanup error handling and kerneldoc (Christian) Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>
2024-01-15	drm/amdgpu: Auto-validate DMABuf imports in compute VMs	Felix Kuehling
	DMABuf imports in compute VMs are not wrapped in a kgd_mem object on the process_info->kfd_bo_list. There is no explicit KFD API call to validate them or add eviction fences to them. This patch automatically validates and fences dymanic DMABuf imports when they are added to a compute VM. Revalidation after evictions is handled in the VM code. v2: * Renamed amdgpu_vm_validate_evicted_bos to amdgpu_vm_validate * Eliminated evicted_user state, use evicted state for VM BOs and user BOs * Fixed and simplified amdgpu_vm_fence_imports, depends on reserved BOs * Moved dma_resv_reserve_fences for amdgpu_vm_fence_imports into amdgpu_vm_validate, outside the vm->status_lock * Added dummy version of amdgpu_amdkfd_bo_validate_and_fence for builds without KFD v4: Eliminate amdgpu_vm_fence_imports. It's not needed because the reservation with its fences is shared with the export, as long as all imports are from KFD, with the exports already reserved, validated and fenced by the KFD restore worker. v5: Reintroduced separate evicted_user state to simplify the state machine and CS error handling when amdgpu_vm_validate is called without a ticket. Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09	drm/amdgpu: make a correction on comment	James Zhu
	Use a generic comment for AMDGPU_VM_RESERVED_VRAM size. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-17	drm/amdkfd: Move TLB flushing logic into amdgpu	Felix Kuehling
	This will make it possible for amdgpu GEM ioctls to flush TLBs on compute VMs. This removes VMID-based TLB flushing and always uses PASID-based flushing. This still works because it scans the VMID-PASID mapping registers to find the right VMID. It's only slightly less efficient. This is not a production use case. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03	drm/amdkfd: Improve amdgpu_vm_handle_moved	Felix Kuehling
	Let amdgpu_vm_handle_moved update all BO VA mappings of BOs reserved by the caller. This will be useful for handling extra BO VA mappings in KFD VMs that are managed through the render node API. v2: rebase against drm_exec changes (Alex) Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27	drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems	David Francis
	On gfx943 APU, EXT_COHERENT should give MTYPE_CC for local and MTYPE_UC for nonlocal memory. On NUMA systems, local memory gets the local mtype, set by an override callback. If EXT_COHERENT is set, memory will be set as MTYPE_UC by default, with local memory MTYPE_CC. Add an option in the override function for this case, and add a check to ensure it is not used on UNCACHED memory. V2: Combined APU and NUMA code into one patch V3: Fixed a potential nullptr in amdgpu_vm_bo_update Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-05	drm/amdgpu: add new INFO ioctl query for the last GPU page fault	Alex Deucher
	Add a interface to query the last GPU page fault for the process. Useful for debugging context lost errors. v2: split vmhub representation between kernel and userspace v3: add locking when fetching fault info in INFO IOCTL Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23238 libdrm MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23238 Cc: samuel.pitoiset@gmail.com Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-04	drm/amdgpu: add cached GPU fault structure to vm struct	Alex Deucher
	When we get a GPU page fault, cache the fault for later analysis. Cc: samuel.pitoiset@gmail.com Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-04	Merge tag 'amd-drm-next-6.6-2023-07-28' of ↵	Daniel Vetter
	https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.6-2023-07-28: amdgpu: - Lots of checkpatch cleanups - GFX 9.4.3 updates - Add USB PD and IFWI flashing documentation - GPUVM updates - RAS fixes - DRR fixes - FAMS fixes - Virtual display fixes - Soft IH fixes - SMU13 fixes - Rework PSP firmware loading for other IPs - Kernel doc fixes - DCN 3.0.1 fixes - LTTPR fixes - DP MST fixes - DCN 3.1.6 fixes - SubVP fixes - Display bandwidth calculation fixes - VCN4 secure submission fixes - Allow building DC on RISC-V - Add visible FB info to bo_print_info - HBR3 fixes - Add PSP 14.0 support - GFX9 MCBP fix - GMC10 vmhub index fix - GMC11 vmhub index fix - Create a new doorbell manager - SR-IOV fixes amdkfd: - Cleanup CRIU dma-buf handling - Use KIQ to unmap HIQ - GFX 9.4.3 debugger updates - GFX 9.4.2 debugger fixes - Enable cooperative groups fof gfx11 - SVM fixes radeon: - Lots of checkpatch cleanups Merge conflicts: - drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c The switch to drm eu helpers in 8a206685d36f ("drm/amdgpu: use drm_exec for GEM and CSA handling v2") clashed with the cosmetic cleanups from 30953c4d000b ("drm/amdgpu: Fix style issues in amdgpu_gem.c"). I kept the former since the cleanup up code is gone. - drivers/gpu/drm/amd/amdgpu/atom.c. adf64e214280 ("drm/amd: Avoid reading the VBIOS part number twice") removed code that 992b8fe106ab ("drm/radeon: Replace all non-returning strlcpy with strscpy") polished. From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230728214228.8102-1-alexander.deucher@amd.com [sima: some merge conflict wrangling as noted] Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
2023-07-18	drm/amdgpu: fix slab-out-of-bounds issue in amdgpu_vm_pt_create	Guchun Chen
	Recent code set xcp_id stored from file private data when opening device to amdgpu bo for accounting memory usage etc, but not all VMs are attached to this fpriv structure like the vm cases in amdgpu_mes_self_test, otherwise, KASAN will complain below out of bound access. And more importantly, VM code should not touch fpriv structure, so drop fpriv code handling from amdgpu_vm_pt. [ 77.292314] BUG: KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.293845] Read of size 4 at addr ffff888102c48a48 by task modprobe/1069 [ 77.294146] Call Trace: [ 77.294178] <TASK> [ 77.294208] dump_stack_lvl+0x49/0x63 [ 77.294260] print_report+0x16f/0x4a6 [ 77.294307] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.295979] ? kasan_complete_mode_report_info+0x3c/0x200 [ 77.296057] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.297556] kasan_report+0xb4/0x130 [ 77.297609] ? amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.299202] __asan_load4+0x6f/0x90 [ 77.299272] amdgpu_vm_pt_create+0x17e/0x4b0 [amdgpu] [ 77.300796] ? amdgpu_init+0x6e/0x1000 [amdgpu] [ 77.302222] ? amdgpu_vm_pt_clear+0x750/0x750 [amdgpu] [ 77.303721] ? preempt_count_sub+0x18/0xc0 [ 77.303786] amdgpu_vm_init+0x39e/0x870 [amdgpu] [ 77.305186] ? amdgpu_vm_wait_idle+0x90/0x90 [amdgpu] [ 77.306683] ? kasan_set_track+0x25/0x30 [ 77.306737] ? kasan_save_alloc_info+0x1b/0x30 [ 77.306795] ? __kasan_kmalloc+0x87/0xa0 [ 77.306852] amdgpu_mes_self_test+0x169/0x620 [amdgpu] v2: without specifying xcp partition for PD/PT bo, the xcp id is -1. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2686 Fixes: 3ebfd221c1a8 ("drm/amdkfd: Store xcp partition id to amdgpu bo") Signed-off-by: Guchun Chen <guchun.chen@amd.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-12	drm/amdgpu: use the new drm_exec object for CS v3	Christian König
	Use the new component here as well and remove the old handling. v2: drop dupplicate handling v3: fix memory leak pointed out by Tatsuyuki Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230711133122.3710-7-christian.koenig@amd.com
2023-07-12	drm/amdkfd: switch over to using drm_exec v3	Christian König
	Avoids quite a bit of logic and kmalloc overhead. v2: fix multiple problems pointed out by Felix v3: two more nit picks from Felix fixed Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230711133122.3710-4-christian.koenig@amd.com
2023-07-07	drm/amdgpu: have bos for PDs/PTS cpu accessible when kfd uses cpu to update vm	Xiaogang Chen
	When kfd uses cpu to update vm iterates all current PDs/PTs bos, adds AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag and kmap them to kernel virtual address space before kfd updates the vm that was created by gfx. Signed-off-by: Xiaogang Chen <Xiaogang.Chen@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-07	drm/amdgpu: Update invalid PTE flag setting	Mukul Joshi
	Update the invalid PTE flag setting with TF enabled. This is to ensure, in addition to transitioning the retry fault to a no-retry fault, it also causes the wavefront to enter the trap handler. With the current setting, the fault only transitions to a no-retry fault. Additionally, have 2 sets of invalid PTE settings, one for TF enabled, the other for TF disabled. The setting with TF disabled, doesn't work with TF enabled. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15	drm/amdgpu: add VM generation token	Christian König
	Instead of using the VRAM lost counter add a 64bit token which indicates if a context or job is still valid to use. Should the VRAM be lost or the page tables need re-creation the token will change indicating that userspace needs to act and re-create the contexts and re-submit the work. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09	drm/amdgpu: Add memory partition id to amdgpu_vm	Philip Yang
	If xcp_mgr is initialized, add mem_id to amdgpu_vm structure to store memory partition number when creating amdgpu_vm for the xcp. The xcp number is decided when opening the render device, for example /dev/dri/renderD129 is xcp_id 0, /dev/dri/renderD130 is xcp_id 1. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09	drm/amdkfd: Update interrupt handling for GFX9.4.3	Mukul Joshi
	Update interrupt handling in CPX mode for GFX9.4.3 by using the VMID space instead of SDMA client id to determine if an interrupt should be processed by a KFD node. This is especially needed for handling retry faults from MMHUB. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09	drm/amdkfd: pass kfd_node ref to svm migration api	Alex Sierra
	This work is required for GC 9.4.3, previous to support memory partitions per node at SVM. When multiple partition is configured, every BO should be allocated inside one specific partition which corresponds to the current amdgpu_device and kfd_node. v2: squash in compilation fix (Alex) v3: squash in fix for pre-gfx 9.4.3 (Alex) v4: squash in best_loc fix (Alex) Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09	drm/amdgpu: introduce vmhub definition for multi-partition cases (v3)	Hawking Zhang
	v1: Each partition has its own gfxhub or mmhub. adjust the num of MAX_VMHUBS and the GFXHUB/MMHUB layout (Le) v2: re-design the AMDGPU_GFXHUB/AMDGPU_MMHUB layout (Le) v3: apply the gfxhub/mmhub layout to new IPs (Hawking) v4: fix up gmc11 (Alex) v5: rebase (Alex) Signed-off-by: Le Ma <le.ma@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-13	drm/amdgpu: expose more memory stats in fdinfo	Marek Olšák
	This will be used for performance investigations. Signed-off-by: Marek Olšák <marek.olsak@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-04	Merge tag 'drm-misc-next-2023-01-03' of ↵	Daniel Vetter
	git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for v6.3: UAPI Changes: * connector: Support analog-TV mode property * media: Add MEDIA_BUS_FMT_RGB565_1X24_CPADHI, MEDIA_BUS_FMT_RGB666_1X18 and MEDIA_BUS_FMT_RGB666_1X24_CPADHI Cross-subsystem Changes: * dma-buf: Documentation fixes * i2c: Introduce i2c_client_get_device_id() helper Core Changes: * Improve support for analog TV output * bridge: Remove unused drm_bridge_chain functions * debugfs: Add per-device helpers and convert various DRM drivers * dp-mst: Various fixes * fbdev emulation: Always pick 32 bpp as default * KUnit: Add tests for managed helpers; Various cleanups * panel-orientation: Add quirks for Lenovo Yoga Tab 3 X90F and DynaBook K50 * TTM: Open-code ttm_bo_wait() and remove the helper Driver Changes: * Fix preferred depth and bpp values throughout DRM drivers * Remove #CONFIG_PM guards throughout DRM drivers * ast: Various fixes * bridge: Implement i2c's probe_new in various drivers; Fixes; ite-it6505: Locking fixes, Cache EDID data; ite-it66121: Support IT6610 chip, Cleanups; lontium-tl9611: Fix HDMI on DragonBoard 845c; parade-ps8640: Use atomic bridge functions * gud: Convert to DRM shadow-plane helpers; Perform flushing synchronously during atomic update * ili9486: Support 16-bit pixel data * imx: Split off IPUv3 driver; Various fixes * mipi-dbi: Convert to DRM shadow-plane helpers plus rsp driver changes;i Support separate I/O-voltage supply * mxsfb: Depend on ARCH_MXS or ARCH_MXC * omapdrm: Various fixes * panel: Use ktime_get_boottime() to measure power-down delay in various drivers; Fix auto-suspend delay in various drivers; orisetech-ota5601a: Add support * sprd: Cleanups * sun4i: Convert to new TV-mode property * tidss: Various fixes * v3d: Various fixes * vc4: Convert to new TV-mode property; Support Kunit tests; Cleanups; dpi: Support RGB565 and RGB666 formats; dsi: Convert DSI driver to bridge * virtio: Improve tracing * vkms: Support small cursors in IGT tests; Various fixes Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/Y7QIwlfElAYWxRcR@linux-uq9g
2022-12-14	drm/amdgpu: rework reserved VMID handling	Christian König
	Instead of reserving a VMID for a single process allow that many processes use the reserved ID. This allows for proper isolation between the processes. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-06	drm/ttm: merge ttm_bo_api.h and ttm_bo_driver.h v2	Christian König
	Merge and cleanup the two headers into a single description of the object API. Also move all the documentation to the implementation and drop unnecessary includes from the header. No functional change. v2: minimal checkpatch.pl cleanup Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221125102137.1801-4-christian.koenig@amd.com
2022-11-09	drm/amdgpu: Drop eviction lock when allocating PT BO	Philip Yang
	Re-take the eviction lock immediately again after the allocation is completed, to fix circular locking warning with drm_buddy allocator. Move amdgpu_vm_eviction_lock/unlock/trylock to amdgpu_vm.h as they are called from multiple files. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-09	drm/amdgpu: workaround for TLB seq race	Christian König
	It can happen that we query the sequence value before the callback had a chance to run. Workaround that by grabbing the fence lock and releasing it again. Should be replaced by hw handling soon. Signed-off-by: Christian König <christian.koenig@amd.com> CC: stable@vger.kernel.org # 5.19+ Fixes: 5255e146c99a6 ("drm/amdgpu: rework TLB flushing") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2113 Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Philip Yang <Philip.Yang@amd.com> Tested-by: Stefan Springer <stefanspr94@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-21	drm/amdgpu: Fix amdgpu_vm_pt_free warning	Philip Yang
	Free page table BO from vm resv unlocked context generate below warnings. Add a pt_free_work in vm to free page table BO from vm->pt_freed list. pass vm resv unlock status from page table update caller, and add vm_bo entry to vm->pt_freed list and schedule the pt_free_work if calling with vm resv unlocked. WARNING: CPU: 12 PID: 3238 at drivers/gpu/drm/ttm/ttm_bo.c:106 ttm_bo_set_bulk_move+0xa1/0xc0 Call Trace: amdgpu_vm_pt_free+0x42/0xd0 [amdgpu] amdgpu_vm_pt_free_dfs+0xb3/0xf0 [amdgpu] amdgpu_vm_ptes_update+0x52d/0x850 [amdgpu] amdgpu_vm_update_range+0x2a6/0x640 [amdgpu] svm_range_unmap_from_gpus+0x110/0x300 [amdgpu] svm_range_cpu_invalidate_pagetables+0x535/0x600 [amdgpu] __mmu_notifier_invalidate_range_start+0x1cd/0x230 unmap_vmas+0x9d/0x140 unmap_region+0xa8/0x110 Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-21	drm/amdgpu: Rename vm invalidate lock to status_lock	Philip Yang
	The vm status_lock will be used to protect all vm status lists. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28	Merge tag 'amd-drm-next-5.19-2022-04-15' of ↵	Dave Airlie
	https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-5.19-2022-04-15: amdgpu: - USB-C updates - GPUVM updates - TMZ fixes for RV - DCN 3.1 pstate fixes - Display z state fixes - RAS fixes - Misc code cleanups and spelling fixes - More DC FP rework - GPUVM TLB handling rework - Power management sysfs code cleanup - Add RAS support for VCN - Backlight fix - Add unique id support for more asics - Misc display updates - SR-IOV fixes - Extend CG and PG flags to 64 bits - Enable VCN clk sysfs nodes for navi12 amdkfd: - Fix IO link cleanup during device removal - RAS fixes - Retry fault fixes - Asynchronously free events - SVM fixes radeon: - Drop some dead code - Misc code cleanups From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220415135144.5700-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-04-14	drm/amdgpu: Fix one use-after-free of VM	xinhui pan
	VM might already be freed when amdgpu_vm_tlb_seq_cb() is called. We see the calltrace below. Fix it by keeping the last flush fence around and wait for it to signal BUG kmalloc-4k (Not tainted): Poison overwritten 0xffff9c88630414e8-0xffff9c88630414e8 @offset=5352. First byte 0x6c instead of 0x6b Allocated in amdgpu_driver_open_kms+0x9d/0x360 [amdgpu] age=44 cpu=0 pid=2343 __slab_alloc.isra.0+0x4f/0x90 kmem_cache_alloc_trace+0x6b8/0x7a0 amdgpu_driver_open_kms+0x9d/0x360 [amdgpu] drm_file_alloc+0x222/0x3e0 [drm] drm_open+0x11d/0x410 [drm] Freed in amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] age=22 cpu=1 pid=2485 kfree+0x4a2/0x580 amdgpu_driver_postclose_kms+0x3e9/0x550 [amdgpu] drm_file_free+0x24e/0x3c0 [drm] drm_close_helper.isra.0+0x90/0xb0 [drm] drm_release+0x97/0x1a0 [drm] __fput+0xb6/0x280 ____fput+0xe/0x10 task_work_run+0x64/0xb0 Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-05	drm/amdgpu: fix TLB flushing during eviction	Christian König
	Testing the valid bit is not enough to figure out if we need to invalidate the TLB or not. During eviction it is quite likely that we move a BO from VRAM to GTT and update the page tables immediately to the new GTT address. Rework the whole function to get all the necessary parameters directly as value. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-05	Merge drm/drm-next into drm-misc-next	Maxime Ripard
	Let's start the 5.19 development cycle. Signed-off-by: Maxime Ripard <maxime@cerno.tech>
2022-03-29	drm/ttm: rework bulk move handling v5	Christian König
	Instead of providing the bulk move structure for each LRU update set this as property of the BO. This should avoid costly bulk move rebuilds with some games under RADV. v2: some name polishing, add a few more kerneldoc words. v3: add some lockdep v4: fix bugs, handle pin/unpin as well v5: improve kerneldoc Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20220321132601.2161-5-christian.koenig@amd.com
2022-03-25	drm/amdgpu: remove table_freed param from the VM code	Christian König
	Better to leave the decision when to flush the VM changes in the TLB to the VM code. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25	drm/amdgpu: rework TLB flushing	Christian König
	Instead of tracking the VM updates through the dependencies just use a sequence counter for page table updates which indicates the need to flush the TLB. This reduces the need to flush the TLB drastically. v2: squash in NULL check fix (Christian) Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>