xref: /aosp_15_r20/external/arm-trusted-firmware/docs/components/ras.rst (revision 54fd6939e177f8ff529b10183254802c76df6d08)
1*54fd6939SJiyong ParkReliability, Availability, and Serviceability (RAS) Extensions
2*54fd6939SJiyong Park==============================================================
3*54fd6939SJiyong Park
4*54fd6939SJiyong ParkThis document describes |TF-A| support for Arm Reliability, Availability, and
5*54fd6939SJiyong ParkServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
6*54fd6939SJiyong Parklater CPUs, and also an optional extension to the base Armv8.0 architecture.
7*54fd6939SJiyong Park
8*54fd6939SJiyong ParkIn conjunction with the |EHF|, support for RAS extension enables firmware-first
9*54fd6939SJiyong Parkparadigm for handling platform errors: exceptions resulting from errors are
10*54fd6939SJiyong Parkrouted to and handled in EL3. Said errors are Synchronous External Abort (SEA),
11*54fd6939SJiyong ParkAsynchronous External Abort (signalled as SErrors), Fault Handling and Error
12*54fd6939SJiyong ParkRecovery interrupts.  The |EHF| document mentions various :ref:`error handling
13*54fd6939SJiyong Parkuse-cases <delegation-use-cases>` .
14*54fd6939SJiyong Park
15*54fd6939SJiyong ParkFor the description of Arm RAS extensions, Standard Error Records, and the
16*54fd6939SJiyong Parkprecise definition of RAS terminology, please refer to the Arm Architecture
17*54fd6939SJiyong ParkReference Manual. The rest of this document assumes familiarity with
18*54fd6939SJiyong Parkarchitecture and terminology.
19*54fd6939SJiyong Park
20*54fd6939SJiyong ParkOverview
21*54fd6939SJiyong Park--------
22*54fd6939SJiyong Park
23*54fd6939SJiyong ParkAs mentioned above, the RAS support in |TF-A| enables routing to and handling of
24*54fd6939SJiyong Parkexceptions resulting from platform errors in EL3. It allows the platform to
25*54fd6939SJiyong Parkdefine an External Abort handler, and to register RAS nodes and interrupts. RAS
26*54fd6939SJiyong Parkframework also provides `helpers`__ for accessing Standard Error Records as
27*54fd6939SJiyong Parkintroduced by the RAS extensions.
28*54fd6939SJiyong Park
29*54fd6939SJiyong Park.. __: `Standard Error Record helpers`_
30*54fd6939SJiyong Park
31*54fd6939SJiyong ParkThe build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run
32*54fd6939SJiyong Parktime firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also
33*54fd6939SJiyong Parkbe set ``1``. ``RAS_TRAP_LOWER_EL_ERR_ACCESS`` controls the access to the RAS
34*54fd6939SJiyong Parkerror record registers from lower ELs.
35*54fd6939SJiyong Park
36*54fd6939SJiyong Park.. _ras-figure:
37*54fd6939SJiyong Park
38*54fd6939SJiyong Park.. image:: ../resources/diagrams/draw.io/ras.svg
39*54fd6939SJiyong Park
40*54fd6939SJiyong ParkSee more on `Engaging the RAS framework`_.
41*54fd6939SJiyong Park
42*54fd6939SJiyong ParkPlatform APIs
43*54fd6939SJiyong Park-------------
44*54fd6939SJiyong Park
45*54fd6939SJiyong ParkThe RAS framework allows the platform to define handlers for External Abort,
46*54fd6939SJiyong ParkUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
47*54fd6939SJiyong Parkrefer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
48*54fd6939SJiyong Park
49*54fd6939SJiyong ParkRegistering RAS error records
50*54fd6939SJiyong Park-----------------------------
51*54fd6939SJiyong Park
52*54fd6939SJiyong ParkRAS nodes are components in the system capable of signalling errors to PEs
53*54fd6939SJiyong Parkthrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
54*54fd6939SJiyong Parknodes contain one or more error records, which are registers through which the
55*54fd6939SJiyong Parknodes advertise various properties of the signalled error. Arm recommends that
56*54fd6939SJiyong Parkerror records are implemented in the Standard Error Record format. The RAS
57*54fd6939SJiyong Parkarchitecture allows for error records to be accessible via system or
58*54fd6939SJiyong Parkmemory-mapped registers.
59*54fd6939SJiyong Park
60*54fd6939SJiyong ParkThe platform should enumerate the error records providing for each of them:
61*54fd6939SJiyong Park
62*54fd6939SJiyong Park-  A handler to probe error records for errors;
63*54fd6939SJiyong Park-  When the probing identifies an error, a handler to handle it;
64*54fd6939SJiyong Park-  For memory-mapped error record, its base address and size in KB; for a system
65*54fd6939SJiyong Park   register-accessed record, the start index of the record and number of
66*54fd6939SJiyong Park   continuous records from that index;
67*54fd6939SJiyong Park-  Any node-specific auxiliary data.
68*54fd6939SJiyong Park
69*54fd6939SJiyong ParkWith this information supplied, when the run time firmware receives one of the
70*54fd6939SJiyong Parknotification mechanisms, the RAS framework can iterate through and probe error
71*54fd6939SJiyong Parkrecords for error, and invoke the appropriate handler to handle it.
72*54fd6939SJiyong Park
73*54fd6939SJiyong ParkThe RAS framework provides the macros to populate error record information. The
74*54fd6939SJiyong Parkmacros are versioned, and the latest version as of this writing is 1. These
75*54fd6939SJiyong Parkmacros create a structure of type ``struct err_record_info`` from its arguments,
76*54fd6939SJiyong Parkwhich are later passed to probe and error handlers.
77*54fd6939SJiyong Park
78*54fd6939SJiyong ParkFor memory-mapped error records:
79*54fd6939SJiyong Park
80*54fd6939SJiyong Park.. code:: c
81*54fd6939SJiyong Park
82*54fd6939SJiyong Park    ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
83*54fd6939SJiyong Park
84*54fd6939SJiyong ParkAnd, for system register ones:
85*54fd6939SJiyong Park
86*54fd6939SJiyong Park.. code:: c
87*54fd6939SJiyong Park
88*54fd6939SJiyong Park    ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
89*54fd6939SJiyong Park
90*54fd6939SJiyong ParkThe probe handler must have the following prototype:
91*54fd6939SJiyong Park
92*54fd6939SJiyong Park.. code:: c
93*54fd6939SJiyong Park
94*54fd6939SJiyong Park    typedef int (*err_record_probe_t)(const struct err_record_info *info,
95*54fd6939SJiyong Park                    int *probe_data);
96*54fd6939SJiyong Park
97*54fd6939SJiyong ParkThe probe handler must return a non-zero value if an error was detected, or 0
98*54fd6939SJiyong Parkotherwise. The ``probe_data`` output parameter can be used to pass any useful
99*54fd6939SJiyong Parkinformation resulting from probe to the error handler (see `below`__). For
100*54fd6939SJiyong Parkexample, it could return the index of the record.
101*54fd6939SJiyong Park
102*54fd6939SJiyong Park.. __: `Standard Error Record helpers`_
103*54fd6939SJiyong Park
104*54fd6939SJiyong ParkThe error handler must have the following prototype:
105*54fd6939SJiyong Park
106*54fd6939SJiyong Park.. code:: c
107*54fd6939SJiyong Park
108*54fd6939SJiyong Park    typedef int (*err_record_handler_t)(const struct err_record_info *info,
109*54fd6939SJiyong Park               int probe_data, const struct err_handler_data *const data);
110*54fd6939SJiyong Park
111*54fd6939SJiyong ParkThe ``data`` constant parameter describes the various properties of the error,
112*54fd6939SJiyong Parkincluding the reason for the error, exception syndrome, and also ``flags``,
113*54fd6939SJiyong Park``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
114*54fd6939SJiyong Park<EL3 interrupts>`.
115*54fd6939SJiyong Park
116*54fd6939SJiyong ParkThe platform is expected populate an array using the macros above, and register
117*54fd6939SJiyong Parkthe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
118*54fd6939SJiyong Parkpassing it the name of the array describing the records. Note that the macro
119*54fd6939SJiyong Parkmust be used in the same file where the array is defined.
120*54fd6939SJiyong Park
121*54fd6939SJiyong ParkStandard Error Record helpers
122*54fd6939SJiyong Park~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123*54fd6939SJiyong Park
124*54fd6939SJiyong ParkThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for
125*54fd6939SJiyong Parkboth memory-mapped and System Register accesses:
126*54fd6939SJiyong Park
127*54fd6939SJiyong Park.. code:: c
128*54fd6939SJiyong Park
129*54fd6939SJiyong Park    int ras_err_ser_probe_memmap(const struct err_record_info *info,
130*54fd6939SJiyong Park                int *probe_data);
131*54fd6939SJiyong Park
132*54fd6939SJiyong Park    int ras_err_ser_probe_sysreg(const struct err_record_info *info,
133*54fd6939SJiyong Park                int *probe_data);
134*54fd6939SJiyong Park
135*54fd6939SJiyong ParkWhen the platform enumerates error records, for those records in the Standard
136*54fd6939SJiyong ParkError Record format, these helpers maybe used instead of rolling out their own.
137*54fd6939SJiyong ParkBoth helpers above:
138*54fd6939SJiyong Park
139*54fd6939SJiyong Park-  Return non-zero value when an error is detected in a Standard Error Record;
140*54fd6939SJiyong Park-  Set ``probe_data`` to the index of the error record upon detecting an error.
141*54fd6939SJiyong Park
142*54fd6939SJiyong ParkRegistering RAS interrupts
143*54fd6939SJiyong Park--------------------------
144*54fd6939SJiyong Park
145*54fd6939SJiyong ParkRAS nodes can signal errors to the PE by raising Fault Handling and/or Error
146*54fd6939SJiyong ParkRecovery interrupts. For the firmware-first handling paradigm for interrupts to
147*54fd6939SJiyong Parkwork, the platform must setup and register with |EHF|. See `Interaction with
148*54fd6939SJiyong ParkException Handling Framework`_.
149*54fd6939SJiyong Park
150*54fd6939SJiyong ParkFor each RAS interrupt, the platform has to provide structure of type ``struct
151*54fd6939SJiyong Parkras_interrupt``:
152*54fd6939SJiyong Park
153*54fd6939SJiyong Park-  Interrupt number;
154*54fd6939SJiyong Park-  The associated error record information (pointer to the corresponding
155*54fd6939SJiyong Park   ``struct err_record_info``);
156*54fd6939SJiyong Park-  Optionally, a cookie.
157*54fd6939SJiyong Park
158*54fd6939SJiyong ParkThe platform is expected to define an array of ``struct ras_interrupt``, and
159*54fd6939SJiyong Parkregister it with the RAS framework using the macro
160*54fd6939SJiyong Park``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
161*54fd6939SJiyong Parkmacro must be used in the same file where the array is defined.
162*54fd6939SJiyong Park
163*54fd6939SJiyong ParkThe array of ``struct ras_interrupt`` must be sorted in the increasing order of
164*54fd6939SJiyong Parkinterrupt number. This allows for fast look of handlers in order to service RAS
165*54fd6939SJiyong Parkinterrupts.
166*54fd6939SJiyong Park
167*54fd6939SJiyong ParkDouble-fault handling
168*54fd6939SJiyong Park---------------------
169*54fd6939SJiyong Park
170*54fd6939SJiyong ParkA Double Fault condition arises when an error is signalled to the PE while
171*54fd6939SJiyong Parkhandling of a previously signalled error is still underway. When a Double Fault
172*54fd6939SJiyong Parkcondition arises, the Arm RAS extensions only require for handler to perform
173*54fd6939SJiyong Parkorderly shutdown of the system, as recovery may be impossible.
174*54fd6939SJiyong Park
175*54fd6939SJiyong ParkThe RAS extensions part of Armv8.4 introduced new architectural features to deal
176*54fd6939SJiyong Parkwith Double Fault conditions, specifically, the introduction of ``NMEA`` and
177*54fd6939SJiyong Park``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
178*54fd6939SJiyong Parksoftware which runs part of its entry/exit routines with exceptions momentarily
179*54fd6939SJiyong Parkmasked—meaning, in such systems, External Aborts/SErrors are not immediately
180*54fd6939SJiyong Parkhandled when they occur, but only after the exceptions are unmasked again.
181*54fd6939SJiyong Park
182*54fd6939SJiyong Park|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
183*54fd6939SJiyong ParkThis means that all exceptions routed to EL3 are handled immediately. |TF-A|
184*54fd6939SJiyong Parkthus is able to detect a Double Fault conditions in software, without needing
185*54fd6939SJiyong Parkthe intended advantages of Armv8.4 Double Fault architecture extensions.
186*54fd6939SJiyong Park
187*54fd6939SJiyong ParkDouble faults are fatal, and terminate at the platform double fault handler, and
188*54fd6939SJiyong Parkdoesn't return.
189*54fd6939SJiyong Park
190*54fd6939SJiyong ParkEngaging the RAS framework
191*54fd6939SJiyong Park--------------------------
192*54fd6939SJiyong Park
193*54fd6939SJiyong ParkEnabling RAS support is a platform choice constructed from three distinct, but
194*54fd6939SJiyong Parkrelated, build options:
195*54fd6939SJiyong Park
196*54fd6939SJiyong Park-  ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware;
197*54fd6939SJiyong Park
198*54fd6939SJiyong Park-  ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See
199*54fd6939SJiyong Park   `Interaction with Exception Handling Framework`_;
200*54fd6939SJiyong Park
201*54fd6939SJiyong Park-  ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to
202*54fd6939SJiyong Park   EL3.
203*54fd6939SJiyong Park
204*54fd6939SJiyong ParkThe RAS support in |TF-A| introduces a default implementation of
205*54fd6939SJiyong Park``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION``
206*54fd6939SJiyong Parkis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
207*54fd6939SJiyong Parktop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
208*54fd6939SJiyong Parkto through platform-supplied error records, probe them, and when an error is
209*54fd6939SJiyong Parkidentified, look up and invoke the corresponding error handler.
210*54fd6939SJiyong Park
211*54fd6939SJiyong ParkNote that, if the platform chooses to override the ``plat_ea_handler`` function
212*54fd6939SJiyong Parkand intend to use the RAS framework, it must explicitly call
213*54fd6939SJiyong Park``ras_ea_handler()`` from within.
214*54fd6939SJiyong Park
215*54fd6939SJiyong ParkSimilarly, for RAS interrupts, the framework defines
216*54fd6939SJiyong Park``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
217*54fd6939SJiyong Parkwhen  a RAS interrupt taken at EL3. The function bisects the platform-supplied
218*54fd6939SJiyong Parksorted array of interrupts to look up the error record information associated
219*54fd6939SJiyong Parkwith the interrupt number. That error handler for that record is then invoked to
220*54fd6939SJiyong Parkhandle the error.
221*54fd6939SJiyong Park
222*54fd6939SJiyong ParkInteraction with Exception Handling Framework
223*54fd6939SJiyong Park---------------------------------------------
224*54fd6939SJiyong Park
225*54fd6939SJiyong ParkAs mentioned in earlier sections, RAS framework interacts with the |EHF| to
226*54fd6939SJiyong Parkarbitrate handling of RAS exceptions with others that are routed to EL3. This
227*54fd6939SJiyong Parkmeans that the platform must partition a :ref:`priority level <Partitioning
228*54fd6939SJiyong Parkpriority levels>` for handling RAS exceptions. The platform must then define
229*54fd6939SJiyong Parkthe macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
230*54fd6939SJiyong ParkPlatforms would typically want to allocate the highest secure priority for
231*54fd6939SJiyong ParkRAS handling.
232*54fd6939SJiyong Park
233*54fd6939SJiyong ParkHandling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
234*54fd6939SJiyong Park<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
235*54fd6939SJiyong Parkdocumentation. I.e., for interrupts, the priority management is implicit; but
236*54fd6939SJiyong Parkfor non-interrupt exceptions, they're explicit using :ref:`EHF APIs
237*54fd6939SJiyong Park<Activating and Deactivating priorities>`.
238*54fd6939SJiyong Park
239*54fd6939SJiyong Park--------------
240*54fd6939SJiyong Park
241*54fd6939SJiyong Park*Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.*
242