xref: /aosp_15_r20/external/gsc-utils/docs/cr50_vboot_troubleshooting.md (revision 4f2df630800bdcf1d4f0decf95d8a1cb87344f5f)
1# Cr50 And Chrome OS Verified Boot Troubleshooting
2
3H1 is a Google security chip installed on most Chrome OS devices. Cr50 is the
4firmware running on the H1. A high level overview of hardware and firmware can
5be found in [this
6presentation](https://2018.osfc.io/uploads/talk/paper/7/gsc_copy.pdf).
7
8This write-up is an attempt to explain how Cr50 participates in the Chrome OS
9device boot process, and what are possible reasons for the dreaded "Chrome OS
10Missing Or Damaged" screen showing up when Chrome OS device reboots.
11
12## Basic overview
13
14The H1 controls reset lines of the EC (embedded controller) and the AP
15(application processor, or SOC). During normal Chromebook operation H1 is
16always powered up as long as battery retains even a minimal amount of charge.
17In Chromeboxes H1 powers on with the rest of the system.
18
19One of the important functions of H1 in the system is a subset of TPM (Trusted
20Platform Module) functionality. The TPM stores verified boot information, this
21is why any **problems communicating with the TPM during the boot up process**
22result in the Chrome OS device falling into **recovery mode**.
23
24Another important function of the H1 in the system is CCD ([closed case
25debugging](https://chromium.googlesource.com/chromiumos/platform/ec/+/fe6ca90e/docs/case_closed_debugging_gsc.md#))
26
27## H1 power states and CCD
28
29During periods of inactivity H1 could enter a *sleep* or *deep sleep* state.
30In *sleep* state most of the clocks are turned off and power consumption is
31minimized, but SRAM contents and the CPU state are maintained. In *deep sleep*
32state the H1 is practically shut down.
33
34The H1 never enters the *deep sleep* state during the Chrome OS boot process,
35but could enter the *sleep* state if the Chrome OS device boot process is
36delayed for whatever reason, and **only when CCD is not active**. This could
37be one of the reasons that there are boot failures when CCD is not connected,
38but the failures go away if CCD is on (the debug cable is plugged in).
39
40To make sure the H1 exits the *sleep* state the AP triggers a wake up event,
41details of which are described below.
42
43## H1 communications with the AP
44
45The H1 could be connected to the AP over the I2C or SPI bus. The same Cr50
46firmware is used in both cases, the decision which of the two interfaces to
47use is made based on resistor straps the Cr50 reads at startup.
48
49Both I2C and SPI interfaces do not fully comply with their respective bus
50standards: the I2C controller does not support clock stretching, and the SPI
51controller can not be clocked faster than 2 MHz.
52
53Look for a text line like the following in the Cr50 console output right after
54power up
55
56> [0.005657 Valid strap: 0xa properties: 0x41]
57
58to confirm that the straps were read properly.
59
60A Cr50 console command allows to see which interface is used to communicate
61with the AP:
62
63> \> brdpprop<br>
64> properties = 0x1141
65
66If the least significant bit of the value is set, the H1 is using the SPI
67interface, if the bit is cleared the H1 is using the I2C interface.
68
69Using H1 imposes additional requirements on the AP interface - the H1 might
70have to be waken up from sleep, and flow controls the AP using an additional
71`AP_INT_L` signal, both described in more details below.
72
73## TPM reset
74
75The H1 is staying up until power is removed, unless it falls into deep sleep.
76TPM is just one of the components of the Cr50 firmware, and the TPM must be
77reset when the AP resets.
78
79There are differences between ARM and X86 reset circuit architectures. ARM
80SOCs have a bidirectional reset signal called `SYS_RST_L`. They (or, rather,
81most of them, but let's not worry about the outliers) generate a pulse on this
82line when the SOC reboots. External device can toggle this line to reset the
83SOC asynchronously, which is what the Cr50 does to reset ARM SOCs.
84
85The X86 SOCs have two separate signals, one output `PLT_RST_L` which is held
86low, while the AP is in reset or in low power mode, and one input,
87`SYS_RST_ODL` which Cr50 toggles to reset the SOC.
88
89In case of X86, when `PLT_RST_L` is held low longer than a second, the Cr50
90considers this an indication of the AP going into a low power mode (S5 or
91lower), which means that the AP will start from the reset vector when it wakes
92up, so Cr50 can take H1 into *deep sleep* mode as well.
93
94On top of that ARM based Chrome OS devices have some additional logic which
95forces the `SYS_RST_L` behave similar to `PLT_RST_L` - it stays low when
96the SOC is in a low power mode and will resume operation from the reset
97vector. This allows H1 to enter deep sleep on ARM devices as well.
98
99Resistor bootstraps tell the Cr50 which kind of reset architecture to expect,
100the SOC reset indication is used both to reset the TPM component and to enter
101the *deep sleep* mode as appropriate.
102
103In the `brdprop` command output bit D5 when set signifies `SYS_RST_L`
104('regular' ARM devices) and bit D6 - `PLT_RST_L` (X86 and modified ARM) type
105of reset.
106
107Boot problems can arise when the AP reboots, without cr50 seeing a pulse on
108the `SYS_RST_L` or `PLT_RST_L` signal: in this case the very first TPM_Startup
109command sent by coreboot returns an error, and the Chrome OS device falls into
110recovery mode.
111
112
113## Cr50 operations synchronization
114
115The H1 microcontroller is very slow (clocked at 24 MHz), the AP is usually
116hundreds of times faster, there is a need to slow down the AP when it tries to
117talk to the TPM during boot up process. The issue is complicated by the
118inability of the I2C controller of stretching the clock.
119
120In both I2C and SPI modes the AP\_INT\_L H1 output signal is used to indicate
121to the AP that the H1 is ready for the next I2C or SPI transaction. By default
122this signal is a 4+ us long low pulse. Some X86 platforms require a pulse of
123100+ us, this pulse extension mode can be configured by setting a bit in a TPM
124register (I2C register address 0x1c or SPI register address 0xfe0).
125
126In any case it is important that the AP firmware is properly configuring the
127pin where the AP\_INT\_L signal is connected as an edge sensitive GPIO, which
128latches on either falling or rising edge of the signal.
129
130AP firmware missing these synchronization pulses results in boot process
131taking very long time and the AP firmware log including messages
132
133> Timeout wait for TPM IRQ!
134
135in case of SPI or
136
137> Cr50 i2c TPM IRQ timeout!
138
139in case of I2C.
140
141## Waking H1 up from sleep
142
143The I2C Start sequence is sufficient for the H1 to resume operation, the AP
144does not have to do anything special. In case of SPI the matters are more
145complicated.
146
147Technically speaking the assertion of the CS SPI bus signal is enough to wake
148up the H1, but it takes time for it to become fully operational, the AP could
149be already transmitting the message by the time the H1 SPI controller is
150ready. This is why in case the previous SPI transaction was a second or more
151ago, the SPI driver is required to first issue a CS pulse without transferring
152any data, just to wake up the H1, then wait for 100 us to let the H1 wake up,
153and then continue with a regular SPI transaction.
154
155If the AP does not follow this protocol and starts transmitting before H1 is
156ready, communications failures are likely, resulting in the Chrome OS device
157falling into recovery. This often happens when the device took a long time to
158find the kernel to boot, and then the AP is trying to lock the TPM state
159before starting up the kernel, but fails, because the H1 was asleep by this
160time and was not properly woken up.
161
162## SPI Message Synchronization
163
164SPI interface is synchronous, and either read or write accesses happen within
165a single transaction. The Trusted Computing Group (TCG) came up with a
166hardware protocol on top of SPI specification to allow the slow device to flow
167control the fast host controller.
168
169The base idea is that each time the AP wants to read or write a TPM register,
170it sends a SPI packet, which consists of the header and data fields.
171
172The header field is always present, it is 4 bytes in size, and includes the
173operation type (read or write), data length and register address.
174
175The header is sent out as soon as the SPI transaction starts, then the AP
176starts monitoring the MOSI line, one byte at a time, paying attention to bit
177D0. The Cr50 keeps sending zeros on that bit, until ready to proceed with the
178operation requested in the transaction header. Once the Cr50 is ready, it
179responds with a byte with bit D0 set to one. At this point the AP knows that
180starting with the next byte the actual data of the transaction can be flowing,
181so it either sends the data in case of write or reads it from the TPM in case
182of reads.
183
184This is described in details in [TCG PC Client Platform TPM Profile (PTP)
185Specification Family "2.0" Level 00 Revision
18600.43](https://drive.google.com/file/d/16r1vDhf1fnggI4BkOBuTXPqOQt4LaFvk/view?usp=sharing)
187in section "6.4 Spi Hardware Protocol".
188
189The AP ignoring this flow control mechanism is yet another common problem
190causing failures to boot, because the driver starts sending or receiving data
191before TPM is ready. This failure is more likely to happen when developing new
192SPI drivers.
193
194## Boot up process examples
195
196A trace of a typical Chrome OS device boot process was collected using the
197[Saleae](https://www.saleae.com/) Logic Pro 16 logic analyzer.
198
199The [full trace](./images/bobba_boot.sal) can be examined in details using the
200Saleae application in the trace analysis mode.
201
202A few detailed snapshots of this trace are shown below (click to expand):
203
204### Full boot sequence
205
206[![Full boot sequence](./images/typical_boot.png)][1] shows communications
207between AP an H1 during a typical Chrome OS boot: first a flurry of
208communications between Coreboot and the H1, then some time spent verifying and
209loading various firmware stages, then a block of communications between
210Depthcarge and the H1.
211
212### Typical read sequence
213
214[![Typical read sequence](./images/typical_read.png)][2] shows the 4 byte
215header where the read of four bytes from register address 0xd40f00 is
216requested. The TPM is not ready and sends all zeros on the MISO line for three
217cycles, then sends a byte of 01 and then the AP reads four bytes of the actual
218register value (0xe01a2800). Then, after H1 is ready to accept the next SPI
219transaction it generates a pulse on AP\_INT\_L.
220
221### Read with wake pulse sequence
222
223[![Read with wake pulse](./images/read_with_wake_pulse.png)][3] is an example
224of a case where the AP toggles the CS line first, without sending any data,
225and then in 100 us starts the actual SPI transaction completed with the
226AP\_INT\_L pulse.
227
228[1]:https://drive.google.com/file/d/16Z_Nw1e6z5akUnyLZyI8ivfT5frxKPQh/view
229[2]:https://drive.google.com/file/d/1weBd6kBiXoQ0I3TGmbpiHZm0dimByYnI/view
230[3]:https://drive.google.com/file/d/13ZSP3up4leG0Etqo4A_gkFK1MeptGDCw/view
231