summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
AgeCommit message (Collapse)Author
2021-03-23drm/amdgpu: Reset the devices in the XGMI hive duirng probeshaoyunl
In passthrough configuration, hypervisior will trigger the SBR(Secondary bus reset) to the devices without sync to each other. This could cause device hang since for XGMI configuration, all the devices within the hive need to be reset at a limit time slot. This serial of patches try to solve this issue by co-operate with new SMU which will only do minimum house keeping to response the SBR request but don't do the real reset job and leave it to driver. Driver need to do the whole sw init and minimum HW init to bring up the SMU and trigger the reset(possibly BACO) on all the ASICs at the same time Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Acked-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26drm/amdgpu: remove unnecessary reading for epprom headerDennis Li
If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-01-05drm/amdgpu: enable ras eeprom support for sienna cichlidJohn Clements
added I2C address and asic support flag Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-11-02drm/amdgpu/amdgpu: improve code indentation and alignmentDeepak R Varma
General code indentation and alignment changes such as replace spaces by tabs or align function arguments as per the coding style guidelines. The patch corrects issues for various amdgpu_*.c files for this driver. Issue reported by checkpatch script. Signed-off-by: Deepak R Varma <mh12gx2825@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30drm/amdgpu: remove redundant GPU resetDennis Li
Because bad pages saving has been moved to UMC error interrupt callback, which will trigger a new GPU reset after saving. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <hawking.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-21drm/amdgpu: protect eeprom update from GPU resetDennis Li
because i2c is unstable in GPU reset, driver need protect eeprom update from GPU reset, to not miss any bad page record. Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: added RAS EEPROM device support checkJohn Clements
updated RAS EEPROM init/threshold sequences to check for device support Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: update eeprom once specifying one bigger threshold(v3)Guchun Chen
During driver's probe, when it hits bad gpu tag in eeprom i2c init calling(the tag was set when reported bad page reaches bad page threshold in last driver's working loop), there are some strategys to deal with the cases: 1. when the module parameter amdgpu_bad_page_threshold = 0, that means page retirement feature is disabled, so just resetting the eeprom is fine. 2. When amdgpu_bad_page_threshold is not 0, and moreover, user sets one bigger valid data in order to make current boot up succeeds, correct eeprom header tag and do not break booting. 3. For other cases, driver's probe will be broken. v2: Just update eeprom header tag instead of resetting the whole table header when user sets one bigger threshold data. v3: Use dev_info/dev_err to print PCI device information, which helps in mGPU case. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: break GPU recovery once it's in bad state(v4)Guchun Chen
When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: schedule ras recovery when reaching bad page threshold(v2)Guchun Chen
Once the bad page saved to eeprom reaches the configured threshold, ras recovery will be issued to notify user. v2: Fix spelling typo. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: break driver init process when it's bad GPU(v5)Guchun Chen
When retrieving bad gpu tag from eeprom, GPU init should fail as the GPU needs to be retired for further check. v2: Fix spelling typo, correct the condition to detect bad gpu tag and refine error message. v3: Refine function argument name. v4: Fix missing check of returning value of i2c initialization error case. v5: Use dev_err to print PCI information in dmesg instead of DRM_ERROR. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: add bad gpu tag definitionGuchun Chen
This tag will be hired for bad gpu detection in eeprom's access. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: validate bad page threshold in ras(v3)Guchun Chen
Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-16drm/amdgpu: Move EEPROM I2C adapter to amdgpu_deviceAndrey Grodzovsky
Puts the i2c adapter in common place for sharing by RAS and upcoming data read from FRU EEPROM feature. v2: Move i2c adapter to amdgpu_pm and rename it. v3: Move i2c adapter init to ASIC specific code and get rid of the switch case in amdgpu_device Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amdgpu: Add Arcturus D342 page retire supportJohn Clements
Check Arcturus SKU type to select I2C address of page retirement EEPROM Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-26drm/amdgpu: Resolved offchip EEPROM I/O issueJohn Clements
Updated target I2C address Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-25drm/amdgpu: Use ARCTURUS in RAS EEPROM.Andrey Grodzovsky
Add Arcturus EEPROM/I2C support in generic EEPROM code. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03drm/amdgpu: remove redundant variable definitionGuchun Chen
No need to define the same variables in each loop of the function. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu:Fix EEPROM checksum calculation.Andrey Grodzovsky
Fix checksum calculation after manually resetting the table. Unify reset and empty EEPROM init flow. Protect the table reset with lock. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Add amdgpu_ras_eeprom_reset_tableAndrey Grodzovsky
This will allow to reset the table on the fly. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/madgpu: Fix EEPROM Checksum calculation.Andrey Grodzovsky
Fix typo which messed up the calculation. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-29drm/amdgpu: fix spelling mistake "jumpimng" -> "jumping"Colin Ian King
There is a spelling mistake in a DRM_DEBUG_DRIVER debug message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-27drm/amdgpu: Vega20 SMU I2C HW engine controller.Andrey Grodzovsky
Implement HW I2C enigne controller to be used by the RAS EEPROM table manager. This is based on code from ATITOOLs. v2: Rename the file and all function prefixes to smu_v11_0_i2c By Luben's observation always fill the TX fifo to full so we don't have garbadge interpreted by the slave as valid data. v3: Remove preemption disable as the HW I2C controller will not stop the clock on empty TX fifo and so it's not critical to keep not empty queue. Switch to fast mode 400 khz SCL clock for faster read and write. v5: Restore clock gating before releasing I2C bus and fix some style comments. v6: squash in warning fix, fix includes (Alex) Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-27drm/amdgpu: Add RAS EEPROM table.Andrey Grodzovsky
Add RAS EEPROM table manager to eanble RAS errors to be stored upon appearance and retrived on driver load. v2: Fix some prints. v3: Fix checksum calculation. Make table record and header structs packed to do correct byte value sum. Fix record crossing EEPROM page boundry. v4: Fix byte sum val calculation for record - look at sizeof(record). Fix some style comments. v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>