Age | Commit message (Collapse) | Author |
|
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.
Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
For xgmi sriov, the reset is handled by host driver and hive->reset_domain
is not initialized so need to check if it exists before doing a put.
Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Adjust removal control flow for smu v13_0_2:
During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.
V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
a global variable.
V3:
1. Update commit comments.
2. Split a patch into multiple patches.
3. The current patch does:
a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into
the existing gpu recover path, which make all devices
in hive list only have HW reset but no resume (except
the base IP).
b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and
AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover
in amdgpu_pci_remove when removing the first device in
hive list.
c. When removing the first device, the IP blocks keyword
function call sequence is as follows:
.suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini.
^ |
|-<----------<---------<----|
The first three sequences are because of a call to
amdgpu_device_gpu_recover. The three sequences will be
executed in a loop until all devices in the hive list
are iterated.
The sequences starting from .hw_fini only apply to the
first device. Since .suspend has been called before,
except the resumed phase1 basic ip blocks, all other ip
blocks .hw_fini of current device will do nothing.
d. When removing other devices, the calling sequences is the
same as legacy:
.hw_fini -> .early_fini -> .sw_fini.
Since .suspend has been called when removing the first device,
except the resumed phase1 basic ip blocks, all of other ip
blocks .hw_fini of current device will do nothing.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed
v2: move this part out from the asic specific part
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
A list of devices to be reset is already created in
amdgpu_device_gpu_recover function. Creating another list with the
same nodes is incorrect and not supported in list_head. Instead, pass
the device list as part of reset context.
Fixes: 9e08564727fc (drm/amdgpu: Refactor mode2 reset logic for v13.0.2)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Seems I forgot to add this to the relevant commit
when submitting.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220210031724.440943-1-andrey.grodzovsky@amd.com
|
|
This functions needs to be split into 2 parts where
one is called only once for locking single instance of
reset_domain's sem and reset flag and the other part
which handles MP1 states should still be called for
each device in XGMI hive.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74118.html
|
|
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
|
|
We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
|
|
The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset domain refcounted and pointed
by each member of the hive and the hive itself.
v4:
Fix crash on boot witrh XGMI hive by adding type to reset_domain.
XGMI will only create a new reset_domain if prevoius was of single
device type meaning it's first boot. Otherwsie it will take a
refocunt to exsiting reset_domain from the amdgou device.
Add a wrapper around reset_domain->refcount get/put
and a wrapper around send to reset wq (Lijo)
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
|
|
Fix header guard and make internal functions static. Fixes the below warnings:
drivers/gpu/drm/amd/amdgpu/../amdgpu/amdgpu_reset.h:24:9: warning: '__AMDUGPU_RESET_H__' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]
drivers/gpu/drm/amd/amdgpu/aldebaran.c:110:6: warning: no previous prototype for function 'aldebaran_async_reset' [-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/aldebaran_ppt.c:1435:5: warning: no previous prototype for function 'aldebaran_mode2_reset' [-Wmissing-prototypes]
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
v1: Add generic amdgpu_reset_control to handle different types of resets. It
may be added at device, hive or ip level. Each reset control has a list
of handlers associated with it to handle different types of reset. Reset
control is responsible for choosing the right handler given a particular
reset context.
Handler objects may implement a set of functions on how to handle a
particular type of reset.
prepare_env = Prepare environment/software context (not used currently).
prepare_hwcontext = Prepare hardware context for the reset.
perform_reset = Perform the type of reset.
restore_hwcontext = Restore the hw context after reset.
restore_env = Restore the environment after reset (not used currently).
Reset context carries the context of reset, as of now this is based on
the parameters used for current set of resets.
v2: Fix coding style
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|