summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
AgeCommit message (Collapse)Author
2023-06-09drm/amdgpu: Set EEPROM ras infoStanley.Yang
Set EEPROM ras info: rma status, health percent and bad page threshold. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Add support EEPROM table v2.1Stanley.Yang
Add ras info to EEPROM table, app can analyse device ECC status without GPU driver through EEPROM table ras info. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Add RAS table v2.1 macro definitionStanley.Yang
Add RAS EEPROM table version 2.1 macro definition. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Rename ras table versionStanley.Yang
Rename RAS_TABLE_VER to RAS_TABLE_VER_V1, move RAS_TABLE_VER_V1 from amdgpu_ras_eeprom.c to amdgpu_ras_eeprom.h. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15drm/amdgpu: message smu to update bad channel infoStanley.Yang
It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-16drm/amdgpu: Drop inline from amdgpu_ras_eeprom_max_record_countMichel Dänzer
This was unusual; normally, inline functions are declared static as well, and defined in a header file if used by multiple compilation units. The latter would be more involved in this case, so just drop the inline declaration for now. Fixes compile failure building for ppc64le on RHEL 8: In file included from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:32, from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:33: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_recovery_init’: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h:90:17: error: inlining failed in call to ‘always_inline’ ‘amdgpu_ras_eeprom_max_record_count’: function body not available 90 | inline uint32_t amdgpu_ras_eeprom_max_record_count(void); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1985:34: note: called from here 1985 | max_eeprom_records_len = amdgpu_ras_eeprom_max_record_count(); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: c84d46707ebb "drm/amdgpu: validate bad page threshold in ras(v3)" Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: RAS EEPROM table is now in debugfsLuben Tuikov
Add "ras_eeprom_size" file in debugfs, which reports the maximum size allocated to the RAS table in EEROM, as the number of bytes and the number of records it could store. For instance, $cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size 262144 bytes or 10921 records $_ Add "ras_eeprom_table" file in debugfs, which dumps the RAS table stored EEPROM, in a formatted way. For instance, $cat ras_eeprom_table Signature Version FirstOffs Size Checksum 0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA Index Offset ErrType Bank/CU TimeStamp Offs/Addr MemChl MCUMCID RetiredPage 0 0x00014 ue 0x00 0x00000000607608DC 0x000000000000 0x00 0x00 0x000000000000 1 0x0002C ue 0x00 0x00000000607608DC 0x000000001000 0x00 0x00 0x000000000001 2 0x00044 ue 0x00 0x00000000607608DC 0x000000002000 0x00 0x00 0x000000000002 3 0x0005C ue 0x00 0x00000000607608DC 0x000000003000 0x00 0x00 0x000000000003 4 0x00074 ue 0x00 0x00000000607608DC 0x000000004000 0x00 0x00 0x000000000004 5 0x0008C ue 0x00 0x00000000607608DC 0x000000005000 0x00 0x00 0x000000000005 6 0x000A4 ue 0x00 0x00000000607608DC 0x000000006000 0x00 0x00 0x000000000006 7 0x000BC ue 0x00 0x00000000607608DC 0x000000007000 0x00 0x00 0x000000000007 8 0x000D4 ue 0x00 0x00000000607608DD 0x000000008000 0x00 0x00 0x000000000008 $_ Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: Optimize EEPROM RAS table I/OLuben Tuikov
Split functionality between read and write, which simplifies the code and exposes areas of optimization and more or less complexity, and take advantage of that. Read and write the table in one go; use a separate stage to decode or encode the data, as opposed to on the fly, which keeps the I2C bus busy. Use a single read/write to read/write the table or at most two if the number of records we're reading/writing wraps around. Check the check-sum of a table in EEPROM on init. Update the checksum at the same time as when updating the table header signature, when the threshold was increased on boot. Take advantage of arithmetic modulo 256, that is, use a byte!, to greatly simplify checksum arithmetic. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: Get rid of test functionLuben Tuikov
The code is now tested from userspace. Remove already macroed out test function. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: Some renamesLuben Tuikov
Qualify with "ras_". Use kernel's own--don't redefine your own. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: Use explicit cardinality for clarityLuben Tuikov
RAS_MAX_RECORD_NUM may mean the maximum record number, as in the maximum house number on your street, or it may mean the maximum number of records, as in the count of records, which is also a number. To make this distinction whether the number is ordinal (index) or cardinal (count), rename this macro to RAS_MAX_RECORD_COUNT. This makes it easy to understand what it refers to, especially when we compute quantities such as, how many records do we have left in the table, especially when there are so many other numbers, quantities and numerical macros around. Also rename the long, amdgpu_ras_eeprom_get_record_max_length() to the more succinct and clear, amdgpu_ras_eeprom_max_record_count(). When computing the threshold, which also deals with counts, i.e. "how many", use cardinal "max_eeprom_records_count", than the quantitative "max_eeprom_records_len". Simplify the logic here and there, as well. Cc: Guchun Chen <guchun.chen@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: Simplify RAS EEPROM checksum calculationsLuben Tuikov
Rename update_table_header() to write_table_header() as this function is actually writing it to EEPROM. Use kernel types; use u8 to carry around the checksum, in order to take advantage of arithmetic modulo 8-bits (256). Tidy up to 80 columns. When updating the checksum, just recalculate the whole thing. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: RAS xfer to read/writeLuben Tuikov
Wrap amdgpu_ras_eeprom_xfer(..., bool write), into amdgpu_ras_eeprom_read() and amdgpu_ras_eeprom_write(), as that makes reading and understanding the code clearer. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: Rename misspelled functionLuben Tuikov
Instead of fixing the spelling in amdgpu_ras_eeprom_process_recods(), rename it to, amdgpu_ras_eeprom_xfer(), to look similar to other I2C and protocol transfer (read/write) functions. Also to keep the column span to within reason by using a shorter name. Change the "num" function parameter from "int" to "const u32" since it is the number of items (records) to xfer, i.e. their count, which cannot be a negative number. Also swap the order of parameters, keeping the pointer to records and their number next to each other, while the direction now becomes the last parameter. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01drm/amdgpu: RAS and FRU now use 19-bit I2C addressLuben Tuikov
Convert RAS and FRU code to use the 19-bit I2C memory address and remove all "slave_addr", as this is now absolved into the 19-bit address. Cc: Jean Delvare <jdelvare@suse.de> Cc: John Clements <john.clements@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26drm/amdgpu: remove unnecessary reading for epprom headerDennis Li
If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: break GPU recovery once it's in bad state(v4)Guchun Chen
When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: break driver init process when it's bad GPU(v5)Guchun Chen
When retrieving bad gpu tag from eeprom, GPU init should fail as the GPU needs to be retired for further check. v2: Fix spelling typo, correct the condition to detect bad gpu tag and refine error message. v3: Refine function argument name. v4: Fix missing check of returning value of i2c initialization error case. v5: Use dev_err to print PCI information in dmesg instead of DRM_ERROR. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04drm/amdgpu: validate bad page threshold in ras(v3)Guchun Chen
Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-21drm/amdgpu: move i2c bus lock out of ras structureAlex Deucher
It's not really ras related. It's just a lock for the bus in general. This removes the ras dependency from the smu i2c bus. Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-16drm/amdgpu: Move EEPROM I2C adapter to amdgpu_deviceAndrey Grodzovsky
Puts the i2c adapter in common place for sharing by RAS and upcoming data read from FRU EEPROM feature. v2: Move i2c adapter to amdgpu_pm and rename it. v3: Move i2c adapter init to ASIC specific code and get rid of the switch case in amdgpu_device Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-26drm/amdgpu: Resolved offchip EEPROM I/O issueJohn Clements
Updated target I2C address Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Add amdgpu_ras_eeprom_reset_tableAndrey Grodzovsky
This will allow to reset the table on the fly. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-27drm/amdgpu: Add RAS EEPROM table.Andrey Grodzovsky
Add RAS EEPROM table manager to eanble RAS errors to be stored upon appearance and retrived on driver load. v2: Fix some prints. v3: Fix checksum calculation. Make table record and header structs packed to do correct byte value sum. Fix record crossing EEPROM page boundry. v4: Fix byte sum val calculation for record - look at sizeof(record). Fix some style comments. v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Luben Tuikov <Luben.Tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>