1*54fd6939SJiyong ParkReliability, Availability, and Serviceability (RAS) Extensions 2*54fd6939SJiyong Park============================================================== 3*54fd6939SJiyong Park 4*54fd6939SJiyong ParkThis document describes |TF-A| support for Arm Reliability, Availability, and 5*54fd6939SJiyong ParkServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 6*54fd6939SJiyong Parklater CPUs, and also an optional extension to the base Armv8.0 architecture. 7*54fd6939SJiyong Park 8*54fd6939SJiyong ParkIn conjunction with the |EHF|, support for RAS extension enables firmware-first 9*54fd6939SJiyong Parkparadigm for handling platform errors: exceptions resulting from errors are 10*54fd6939SJiyong Parkrouted to and handled in EL3. Said errors are Synchronous External Abort (SEA), 11*54fd6939SJiyong ParkAsynchronous External Abort (signalled as SErrors), Fault Handling and Error 12*54fd6939SJiyong ParkRecovery interrupts. The |EHF| document mentions various :ref:`error handling 13*54fd6939SJiyong Parkuse-cases <delegation-use-cases>` . 14*54fd6939SJiyong Park 15*54fd6939SJiyong ParkFor the description of Arm RAS extensions, Standard Error Records, and the 16*54fd6939SJiyong Parkprecise definition of RAS terminology, please refer to the Arm Architecture 17*54fd6939SJiyong ParkReference Manual. The rest of this document assumes familiarity with 18*54fd6939SJiyong Parkarchitecture and terminology. 19*54fd6939SJiyong Park 20*54fd6939SJiyong ParkOverview 21*54fd6939SJiyong Park-------- 22*54fd6939SJiyong Park 23*54fd6939SJiyong ParkAs mentioned above, the RAS support in |TF-A| enables routing to and handling of 24*54fd6939SJiyong Parkexceptions resulting from platform errors in EL3. It allows the platform to 25*54fd6939SJiyong Parkdefine an External Abort handler, and to register RAS nodes and interrupts. RAS 26*54fd6939SJiyong Parkframework also provides `helpers`__ for accessing Standard Error Records as 27*54fd6939SJiyong Parkintroduced by the RAS extensions. 28*54fd6939SJiyong Park 29*54fd6939SJiyong Park.. __: `Standard Error Record helpers`_ 30*54fd6939SJiyong Park 31*54fd6939SJiyong ParkThe build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run 32*54fd6939SJiyong Parktime firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also 33*54fd6939SJiyong Parkbe set ``1``. ``RAS_TRAP_LOWER_EL_ERR_ACCESS`` controls the access to the RAS 34*54fd6939SJiyong Parkerror record registers from lower ELs. 35*54fd6939SJiyong Park 36*54fd6939SJiyong Park.. _ras-figure: 37*54fd6939SJiyong Park 38*54fd6939SJiyong Park.. image:: ../resources/diagrams/draw.io/ras.svg 39*54fd6939SJiyong Park 40*54fd6939SJiyong ParkSee more on `Engaging the RAS framework`_. 41*54fd6939SJiyong Park 42*54fd6939SJiyong ParkPlatform APIs 43*54fd6939SJiyong Park------------- 44*54fd6939SJiyong Park 45*54fd6939SJiyong ParkThe RAS framework allows the platform to define handlers for External Abort, 46*54fd6939SJiyong ParkUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 47*54fd6939SJiyong Parkrefer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. 48*54fd6939SJiyong Park 49*54fd6939SJiyong ParkRegistering RAS error records 50*54fd6939SJiyong Park----------------------------- 51*54fd6939SJiyong Park 52*54fd6939SJiyong ParkRAS nodes are components in the system capable of signalling errors to PEs 53*54fd6939SJiyong Parkthrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 54*54fd6939SJiyong Parknodes contain one or more error records, which are registers through which the 55*54fd6939SJiyong Parknodes advertise various properties of the signalled error. Arm recommends that 56*54fd6939SJiyong Parkerror records are implemented in the Standard Error Record format. The RAS 57*54fd6939SJiyong Parkarchitecture allows for error records to be accessible via system or 58*54fd6939SJiyong Parkmemory-mapped registers. 59*54fd6939SJiyong Park 60*54fd6939SJiyong ParkThe platform should enumerate the error records providing for each of them: 61*54fd6939SJiyong Park 62*54fd6939SJiyong Park- A handler to probe error records for errors; 63*54fd6939SJiyong Park- When the probing identifies an error, a handler to handle it; 64*54fd6939SJiyong Park- For memory-mapped error record, its base address and size in KB; for a system 65*54fd6939SJiyong Park register-accessed record, the start index of the record and number of 66*54fd6939SJiyong Park continuous records from that index; 67*54fd6939SJiyong Park- Any node-specific auxiliary data. 68*54fd6939SJiyong Park 69*54fd6939SJiyong ParkWith this information supplied, when the run time firmware receives one of the 70*54fd6939SJiyong Parknotification mechanisms, the RAS framework can iterate through and probe error 71*54fd6939SJiyong Parkrecords for error, and invoke the appropriate handler to handle it. 72*54fd6939SJiyong Park 73*54fd6939SJiyong ParkThe RAS framework provides the macros to populate error record information. The 74*54fd6939SJiyong Parkmacros are versioned, and the latest version as of this writing is 1. These 75*54fd6939SJiyong Parkmacros create a structure of type ``struct err_record_info`` from its arguments, 76*54fd6939SJiyong Parkwhich are later passed to probe and error handlers. 77*54fd6939SJiyong Park 78*54fd6939SJiyong ParkFor memory-mapped error records: 79*54fd6939SJiyong Park 80*54fd6939SJiyong Park.. code:: c 81*54fd6939SJiyong Park 82*54fd6939SJiyong Park ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 83*54fd6939SJiyong Park 84*54fd6939SJiyong ParkAnd, for system register ones: 85*54fd6939SJiyong Park 86*54fd6939SJiyong Park.. code:: c 87*54fd6939SJiyong Park 88*54fd6939SJiyong Park ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 89*54fd6939SJiyong Park 90*54fd6939SJiyong ParkThe probe handler must have the following prototype: 91*54fd6939SJiyong Park 92*54fd6939SJiyong Park.. code:: c 93*54fd6939SJiyong Park 94*54fd6939SJiyong Park typedef int (*err_record_probe_t)(const struct err_record_info *info, 95*54fd6939SJiyong Park int *probe_data); 96*54fd6939SJiyong Park 97*54fd6939SJiyong ParkThe probe handler must return a non-zero value if an error was detected, or 0 98*54fd6939SJiyong Parkotherwise. The ``probe_data`` output parameter can be used to pass any useful 99*54fd6939SJiyong Parkinformation resulting from probe to the error handler (see `below`__). For 100*54fd6939SJiyong Parkexample, it could return the index of the record. 101*54fd6939SJiyong Park 102*54fd6939SJiyong Park.. __: `Standard Error Record helpers`_ 103*54fd6939SJiyong Park 104*54fd6939SJiyong ParkThe error handler must have the following prototype: 105*54fd6939SJiyong Park 106*54fd6939SJiyong Park.. code:: c 107*54fd6939SJiyong Park 108*54fd6939SJiyong Park typedef int (*err_record_handler_t)(const struct err_record_info *info, 109*54fd6939SJiyong Park int probe_data, const struct err_handler_data *const data); 110*54fd6939SJiyong Park 111*54fd6939SJiyong ParkThe ``data`` constant parameter describes the various properties of the error, 112*54fd6939SJiyong Parkincluding the reason for the error, exception syndrome, and also ``flags``, 113*54fd6939SJiyong Park``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler 114*54fd6939SJiyong Park<EL3 interrupts>`. 115*54fd6939SJiyong Park 116*54fd6939SJiyong ParkThe platform is expected populate an array using the macros above, and register 117*54fd6939SJiyong Parkthe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 118*54fd6939SJiyong Parkpassing it the name of the array describing the records. Note that the macro 119*54fd6939SJiyong Parkmust be used in the same file where the array is defined. 120*54fd6939SJiyong Park 121*54fd6939SJiyong ParkStandard Error Record helpers 122*54fd6939SJiyong Park~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 123*54fd6939SJiyong Park 124*54fd6939SJiyong ParkThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for 125*54fd6939SJiyong Parkboth memory-mapped and System Register accesses: 126*54fd6939SJiyong Park 127*54fd6939SJiyong Park.. code:: c 128*54fd6939SJiyong Park 129*54fd6939SJiyong Park int ras_err_ser_probe_memmap(const struct err_record_info *info, 130*54fd6939SJiyong Park int *probe_data); 131*54fd6939SJiyong Park 132*54fd6939SJiyong Park int ras_err_ser_probe_sysreg(const struct err_record_info *info, 133*54fd6939SJiyong Park int *probe_data); 134*54fd6939SJiyong Park 135*54fd6939SJiyong ParkWhen the platform enumerates error records, for those records in the Standard 136*54fd6939SJiyong ParkError Record format, these helpers maybe used instead of rolling out their own. 137*54fd6939SJiyong ParkBoth helpers above: 138*54fd6939SJiyong Park 139*54fd6939SJiyong Park- Return non-zero value when an error is detected in a Standard Error Record; 140*54fd6939SJiyong Park- Set ``probe_data`` to the index of the error record upon detecting an error. 141*54fd6939SJiyong Park 142*54fd6939SJiyong ParkRegistering RAS interrupts 143*54fd6939SJiyong Park-------------------------- 144*54fd6939SJiyong Park 145*54fd6939SJiyong ParkRAS nodes can signal errors to the PE by raising Fault Handling and/or Error 146*54fd6939SJiyong ParkRecovery interrupts. For the firmware-first handling paradigm for interrupts to 147*54fd6939SJiyong Parkwork, the platform must setup and register with |EHF|. See `Interaction with 148*54fd6939SJiyong ParkException Handling Framework`_. 149*54fd6939SJiyong Park 150*54fd6939SJiyong ParkFor each RAS interrupt, the platform has to provide structure of type ``struct 151*54fd6939SJiyong Parkras_interrupt``: 152*54fd6939SJiyong Park 153*54fd6939SJiyong Park- Interrupt number; 154*54fd6939SJiyong Park- The associated error record information (pointer to the corresponding 155*54fd6939SJiyong Park ``struct err_record_info``); 156*54fd6939SJiyong Park- Optionally, a cookie. 157*54fd6939SJiyong Park 158*54fd6939SJiyong ParkThe platform is expected to define an array of ``struct ras_interrupt``, and 159*54fd6939SJiyong Parkregister it with the RAS framework using the macro 160*54fd6939SJiyong Park``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 161*54fd6939SJiyong Parkmacro must be used in the same file where the array is defined. 162*54fd6939SJiyong Park 163*54fd6939SJiyong ParkThe array of ``struct ras_interrupt`` must be sorted in the increasing order of 164*54fd6939SJiyong Parkinterrupt number. This allows for fast look of handlers in order to service RAS 165*54fd6939SJiyong Parkinterrupts. 166*54fd6939SJiyong Park 167*54fd6939SJiyong ParkDouble-fault handling 168*54fd6939SJiyong Park--------------------- 169*54fd6939SJiyong Park 170*54fd6939SJiyong ParkA Double Fault condition arises when an error is signalled to the PE while 171*54fd6939SJiyong Parkhandling of a previously signalled error is still underway. When a Double Fault 172*54fd6939SJiyong Parkcondition arises, the Arm RAS extensions only require for handler to perform 173*54fd6939SJiyong Parkorderly shutdown of the system, as recovery may be impossible. 174*54fd6939SJiyong Park 175*54fd6939SJiyong ParkThe RAS extensions part of Armv8.4 introduced new architectural features to deal 176*54fd6939SJiyong Parkwith Double Fault conditions, specifically, the introduction of ``NMEA`` and 177*54fd6939SJiyong Park``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 178*54fd6939SJiyong Parksoftware which runs part of its entry/exit routines with exceptions momentarily 179*54fd6939SJiyong Parkmasked—meaning, in such systems, External Aborts/SErrors are not immediately 180*54fd6939SJiyong Parkhandled when they occur, but only after the exceptions are unmasked again. 181*54fd6939SJiyong Park 182*54fd6939SJiyong Park|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 183*54fd6939SJiyong ParkThis means that all exceptions routed to EL3 are handled immediately. |TF-A| 184*54fd6939SJiyong Parkthus is able to detect a Double Fault conditions in software, without needing 185*54fd6939SJiyong Parkthe intended advantages of Armv8.4 Double Fault architecture extensions. 186*54fd6939SJiyong Park 187*54fd6939SJiyong ParkDouble faults are fatal, and terminate at the platform double fault handler, and 188*54fd6939SJiyong Parkdoesn't return. 189*54fd6939SJiyong Park 190*54fd6939SJiyong ParkEngaging the RAS framework 191*54fd6939SJiyong Park-------------------------- 192*54fd6939SJiyong Park 193*54fd6939SJiyong ParkEnabling RAS support is a platform choice constructed from three distinct, but 194*54fd6939SJiyong Parkrelated, build options: 195*54fd6939SJiyong Park 196*54fd6939SJiyong Park- ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware; 197*54fd6939SJiyong Park 198*54fd6939SJiyong Park- ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See 199*54fd6939SJiyong Park `Interaction with Exception Handling Framework`_; 200*54fd6939SJiyong Park 201*54fd6939SJiyong Park- ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to 202*54fd6939SJiyong Park EL3. 203*54fd6939SJiyong Park 204*54fd6939SJiyong ParkThe RAS support in |TF-A| introduces a default implementation of 205*54fd6939SJiyong Park``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION`` 206*54fd6939SJiyong Parkis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 207*54fd6939SJiyong Parktop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 208*54fd6939SJiyong Parkto through platform-supplied error records, probe them, and when an error is 209*54fd6939SJiyong Parkidentified, look up and invoke the corresponding error handler. 210*54fd6939SJiyong Park 211*54fd6939SJiyong ParkNote that, if the platform chooses to override the ``plat_ea_handler`` function 212*54fd6939SJiyong Parkand intend to use the RAS framework, it must explicitly call 213*54fd6939SJiyong Park``ras_ea_handler()`` from within. 214*54fd6939SJiyong Park 215*54fd6939SJiyong ParkSimilarly, for RAS interrupts, the framework defines 216*54fd6939SJiyong Park``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 217*54fd6939SJiyong Parkwhen a RAS interrupt taken at EL3. The function bisects the platform-supplied 218*54fd6939SJiyong Parksorted array of interrupts to look up the error record information associated 219*54fd6939SJiyong Parkwith the interrupt number. That error handler for that record is then invoked to 220*54fd6939SJiyong Parkhandle the error. 221*54fd6939SJiyong Park 222*54fd6939SJiyong ParkInteraction with Exception Handling Framework 223*54fd6939SJiyong Park--------------------------------------------- 224*54fd6939SJiyong Park 225*54fd6939SJiyong ParkAs mentioned in earlier sections, RAS framework interacts with the |EHF| to 226*54fd6939SJiyong Parkarbitrate handling of RAS exceptions with others that are routed to EL3. This 227*54fd6939SJiyong Parkmeans that the platform must partition a :ref:`priority level <Partitioning 228*54fd6939SJiyong Parkpriority levels>` for handling RAS exceptions. The platform must then define 229*54fd6939SJiyong Parkthe macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. 230*54fd6939SJiyong ParkPlatforms would typically want to allocate the highest secure priority for 231*54fd6939SJiyong ParkRAS handling. 232*54fd6939SJiyong Park 233*54fd6939SJiyong ParkHandling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt 234*54fd6939SJiyong Park<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| 235*54fd6939SJiyong Parkdocumentation. I.e., for interrupts, the priority management is implicit; but 236*54fd6939SJiyong Parkfor non-interrupt exceptions, they're explicit using :ref:`EHF APIs 237*54fd6939SJiyong Park<Activating and Deactivating priorities>`. 238*54fd6939SJiyong Park 239*54fd6939SJiyong Park-------------- 240*54fd6939SJiyong Park 241*54fd6939SJiyong Park*Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.* 242