main.rst - OpenGrok cross reference for /linux-6.14.4/Documentation/admin-guide/RAS/main.rst

Lines Matching +full:error +full:- +full:correction
1 .. SPDX-License-Identifier: GPL-2.0
37 -------------
48 * Memory – add error correction logic (ECC) to detect and correct errors;
51   Self-Monitoring, Analysis and Reporting Technology (SMART).
53 By monitoring the number of occurrences of error detections, it is possible
59 ---------------
62 Codes that allow error correction when the number of errors on a bit packet
64 can indicate with a high degree of confidence that an error happened, but
67 Also, sometimes an error occur on a component that it is not used. For
72 * **Correctable Error (CE)** - the error detection mechanism detected and
73   corrected the error. Such errors are usually not fatal, although some
76 * **Uncorrected Error (UE)** - the amount of errors happened above the error
77   correction threshold, and the system was unable to auto-correct.
79 * **Fatal Error** - when an UE error happens on a critical component of the
83 * **Non-fatal Error** - when an UE error happens on an unused component,
88   Also, when an error happens on a userspace process, it is also possible to
91 The mechanism for handling non-fatal errors is usually complex and may
96 ------------------------------------
102 So, it requires not only error logging facilities, but also mechanisms that
103 will translate the error message to the silkscreen or component label for
117 		Locator: ChannelA-DIMM0
125 On the above example, a DDR4 SO-DIMM memory module is located at the
128 *data width*. It means that such memory module doesn't have error
129 detection/correction mechanisms.
136 		Error Information Handle: Not Provided
153 it has 8 extra bits to be used by error detection and correction mechanisms.
154 Such kind of memory is called Error-correcting code memory (ECC memory).
161 ----------
164 used for error correction. In the above example, a memory module has
166 bits which are used for the error detection and correction mechanisms
171 using Hamming code, or some other error correction code, like SECDED+,
180 there was an error, and if the ECC code was able to fix such error.
181 If the error was corrected, a Corrected Error (CE) happened. If not, an
182 Uncorrected Error (UE) happened.
191   mode called "Lock-Step", where it groups two memory modules together,
192   doing 128-bit reads/writes. That gives 16 bits for error correction, with
193   significantly improves the error correction mechanism, at the expense
194   that, when an error happens, there's no way to know what memory module is
200   identical data. On such configuration, when an error happens, there's no
202   memory modules (or 4 memory modules, if the system is also on Lock-step
208 EDAC - Error Detection And Correction
214    was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
222 -------
228 ------
244 -----------------------
249 This new device type allows for non-memory type of ECC hardware detectors
261 ----------------
267 There are several add-in adapters that do **not** follow the PCI specification
274 the EDAC PCI scanning code. If that attribute is set, PCI parity/error
284 ----------
297 -------
302 hardware-specific modules and have the dependencies load the necessary
314 ---------------
329 ----------------------------
332 are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
335 .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
337   packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
339   Platform Error Record (CPER) section to be an SMBIOS Memory Device
350 for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
353 	+------------+-----------------------+
355 	+------------+-----------+-----------+
359 	+------------+-----------+-----------+
361 	+------------+-----------+-----------+
363 	+------------+-----------+-----------+
365 	+------------+-----------+-----------+
367 	+------------+-----------+-----------+
369 	+------------+-----------+-----------+
374 	+---------+---------+
376 	+---------+---------+
378 	+---------+---------+
380 Labels for these slots are usually silk-screened on the motherboard.
404 		   |->mc0
405 		   |->mc1
406 		   |->mc2
414 		|->csrow0
415 		|->csrow2
416 		|->csrow3
421 order to have dual-channel mode be operational. Since both csrow2 and
429 -------------------
436 	Documentation/ABI/testing/sysfs-devices-edac
440 ----------------------------------
498 - ``size`` - Total memory managed by this csrow attribute file
503 - ``dimm_ue_count`` - Uncorrectable Errors count attribute file
510 - ``dimm_ce_count`` - Correctable Errors count attribute file
516 	monitored for non-zero values and report such information
519 - ``dimm_dev_type``  - Device type attribute file
525 		- x1
526 		- x2
527 		- x4
528 		- x8
530 - ``dimm_edac_mode`` - EDAC Mode of operation attribute file
532 	This attribute file will display what type of Error detection
533 	and correction is being utilized.
535 - ``dimm_label`` - memory module label control file
549 - ``dimm_location`` - location of the memory module
556 		- *csrow* and *channel* - used when the memory controller
557 		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
558 		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
560 		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
562 - ``dimm_mem_type`` - Memory Type attribute file
568 		- Registered-DDR
569 		- Unbuffered-DDR
581 ----------------------
584 directories. As this API doesn't work properly for Rambus, FB-DIMMs and
592 - ``ue_count`` - Total Uncorrectable Errors count attribute file
600 - ``ce_count`` - Total Correctable Errors count attribute file
606 	monitored for non-zero values and report such information
610 - ``size_mb`` - Total memory managed by this csrow attribute file
616 - ``mem_type`` - Memory Type attribute file
622 		- Registered-DDR
623 		- Unbuffered-DDR
626 - ``edac_mode`` - EDAC Mode of operation attribute file
628 	This attribute file will display what type of Error detection
629 	and correction is being utilized.
632 - ``dev_type`` - Device type attribute file
638 		- x1
639 		- x2
640 		- x4
641 		- x8
644 - ``ch0_ce_count`` - Channel 0 CE Count attribute file
650 - ``ch0_ue_count`` - Channel 0 UE Count attribute file
656 - ``ch0_dimm_label`` - Channel 0 DIMM Label control file
672 - ``ch1_ce_count`` - Channel 1 CE Count attribute file
679 - ``ch1_ue_count`` - Channel 1 UE Count attribute file
686 - ``ch1_dimm_label`` - Channel 1 DIMM Label control file
702 --------------
713 	+---------------------------------------+-------------+
717 	+---------------------------------------+-------------+
718 	| Error type                            | CE          |
719 	+---------------------------------------+-------------+
721 	+---------------------------------------+-------------+
723 	+---------------------------------------+-------------+
725 	| or resolution of the error            |             |
726 	+---------------------------------------+-------------+
727 	| The error syndrome                    | 0xb741      |
728 	+---------------------------------------+-------------+
730 	+---------------------------------------+-------------+
732 	+---------------------------------------+-------------+
734 	+---------------------------------------+-------------+
735 	| And then an optional, driver-specific |             |
738 	+---------------------------------------+-------------+
740 Both UEs and CEs with no info will lack all but memory controller, error
741 type, a notice of "no info" and then an optional, driver-specific error
746 ------------------------
749 parity error regardless of whether parity is enabled on the device or
756 -------------------
762 - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
777 - ``pci_parity_count`` - Parity Count
784 -----------------
786 - ``edac_mc_panic_on_ue`` - Panic on UE control file
788 	An uncorrectable error will cause a machine panic.  This is usually
789 	desirable.  It is a bad idea to continue when an uncorrectable error
790 	occurs - it is indeterminate what was uncorrected and the operating
804 - ``edac_mc_log_ue`` - Log UE control file
820 - ``edac_mc_log_ce`` - Log CE control file
836 - ``edac_mc_poll_msec`` - Polling period control file
839 	The time period, in milliseconds, for polling for error information.
842 	locating the error.  1000 milliseconds (once each second) is the current
855 - ``panic_on_pci_parity`` - Panic on PCI PARITY Error
859 	error has been detected.
877 ----------------
891 	/sys/devices/system/edac/test-instance
913 			One out-of-tree driver uses controls here to allow
914 			for ERROR INJECTION operations to hardware
921 ---------
926 	+----------------+
927 	| test-instance0 |
928 	+----------------+
940 ------
945 	+-------------+
946 	| test-block0 |
947 	+-------------+
962 	test-block-bits-0	for every POLL cycle this counter
964 	test-block-bits-1	every 10 cycles, this counter is bumped once,
965 				and test-block-bits-0 is set to 0
966 	test-block-bits-2	every 100 cycles, this counter is bumped once,
967 				and test-block-bits-1 is set to 0
968 	test-block-bits-3	every 1000 cycles, this counter is bumped once,
969 				and test-block-bits-2 is set to 0
974 	reset-counters		writing ANY thing to this control will
987 --------------------------------------------------
1038    implement this functionality via some error injection nodes:
1040    For injecting a memory error, there are some sysfs nodes, under
1043    - ``inject_addrmatch/*``:
1044       Controls the error injection mask register. It is possible to specify
1045       several characteristics of the address to match an error code::
1049          channel = the channel that will generate an error;
1058       For example, to generate an error at rank 1 of dimm 2, for any channel,
1069    - ``inject_eccmask``:
1072    - ``inject_section``:
1073        specifies what ECC cache section will get the error::
1079    - ``inject_type``:
1080        specifies the type of error, being a combination of the following bits::
1082 		bit 0 - repeat
1083 		bit 1 - ecc
1084 		bit 2 - parity
1086    - ``inject_enable``:
1087        starts the error generation when something different than 0 is written.
1091    Datasheet states that the error will only be generated after a write on an
1093    also produce an error.
1095    For example, the following code will generate an error for any write access
1108    The generated error message will look like::
1110 …-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome…
1112 3) Corrected Error memory register counters
1142    The hardware will increment udimm0 for an error at the first dimm at either
1145    The hardware will increment udimm1 for an error at the second dimm at either
1148    The hardware will increment udimm2 for an error at the third dimm at either
1151 4) Standard error counters
1153    The standard error counters are generated when an mcelog error is received
1159 ------------------------------------------
1162 (available from http://support.amd.com/en-us/search/tech-docs):
1185 	  Models 30h-3Fh Processors
1189    :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1192 	  Models 60h-6Fh Processors
1196    :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1199 	  Models 00h-0Fh Processors
1210   - 7 Dec 2005
1211   - 17 Jul 2007	Updated
1215   - 05 Aug 2009	Nehalem interface
1216   - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1220   - Doug Thompson, Dave Jiang, Dave Peterson et al,
1221   - Mauro Carvalho Chehab
1222   - Borislav Petkov
1223   - original author: Thayne Harbaugh