xref: /aosp_15_r20/external/pigweed/pw_snapshot/setup.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1.. _module-pw_snapshot-setup:
2
3==============================
4Setting up a Snapshot Pipeline
5==============================
6
7-------------------
8Crash Handler Setup
9-------------------
10The Snapshot proto was designed first and foremost as a crash reporting format.
11This section covers how to set up a crash handler to capture Snapshots.
12
13.. image:: images/generic_crash_flow.svg
14  :width: 600
15  :alt: Generic crash handler flow
16
17A typical crash handler has two entry points:
18
191. A software entry path through developer-written ASSERT() or CHECK() calls
20   that indicate a device should go down for a crash if a condition is not met.
212. A hardware-triggered exception handler path that is initiated when a CPU
22   encounters a fault signal (invalid memory access, bad instruction, etc.).
23
24Before deferring to a common crash handler, these entry paths should disable
25interrupts to force the system into a single-threaded execution mode. This
26prevents other threads from operating on potentially bad data or clobbering
27system state that could be useful for debugging.
28
29The first step in a crash handler should always be a check for nested crashes to
30prevent infinitely recursive crashes. Once it's deemed it's safe to continue,
31the crash handler can re-initialize logging, initialize storage for crash report
32capture, and then build a snapshot to later be retrieved from the device. Once
33the crash report collection process is complete, some post-crash callbacks can
34be run on a best-effort basis to clean up the system before rebooting. For
35devices with debug port access, it's helpful to optionally hold the device in
36an infinite loop rather than resetting to allow developers to access the device
37via a hardware debugger.
38
39Assert Handler Setup
40====================
41:ref:`pw_assert <module-pw_assert>` is Pigweed's entry point for software
42crashes. Route any existing assert functions through pw_assert to centralize the
43software crash path. You’ll need to create a :ref:`pw_assert backend
44<module-pw_assert-backend_api>` or a custom :ref:`pw_assert_basic handler
45<module-pw_assert_basic-custom_handler>` to pass collected information to a more
46sophisticated crash handler. One way to do this is to collect the data into a
47statically allocated struct that is passed to a common crash handler. It’s
48important to immediately disable interrupts to prevent the system from doing
49other things while in an impacted state.
50
51.. code-block:: cpp
52
53   // This can be be directly accessed by a crash handler
54   static CrashData crash_data;
55   extern "C" void pw_assert_basic_HandleFailure(const char* file_name,
56                                                 int line_number,
57                                                 const char* format,
58                                                 ...) {
59     // Always disable interrupts first! How this is done depends
60     // on your platform.
61     __disable_irq();
62
63     va_list args;
64     va_start(args, format);
65     crash_data.file_name = file_name;
66     crash_data.line_number = line_number;
67     crash_data.reason_fmt = format;
68     crash_data.reason_args = &args;
69     crash_data.cpu_state = nullptr;
70
71     HandleCrash(crash_data);
72     PW_UNREACHABLE;
73   }
74
75Exception Handler Setup
76=======================
77:ref:`pw_cpu_exception <module-pw_cpu_exception>` is Pigweed's recommended entry
78point for CPU-triggered faults (divide by zero, invalid memory access, etc.).
79You will need to provide a definition for pw_cpu_exception_DefaultHandler() that
80passes the exception state produced by pw_cpu_exception to your common crash
81handler.
82
83.. code-block:: cpp
84
85   static CrashData crash_data;
86   // This helper turns a format string to a va_list that can be used by the
87   // common crash handling path.
88   void HandleExceptionWithString(pw_cpu_exception_State& state,
89                                  const char* fmt,
90                                  ...) {
91     va_list args;
92     va_start(args, fmt);
93     crash_data.cpu_state = state;
94     crash_data.file_name = nullptr;
95     crash_data.reason_fmt = fmt;
96     crash_data.reason_args = &args;
97
98     HandleCrash(crash_data);
99     PW_UNREACHABLE;
100   }
101
102   extern "C" void pw_cpu_exception_DefaultHandler(
103       pw_cpu_exception_State* state) {
104     // Always disable interrupts first! How this is done depends
105     // on your platform.
106     __disable_irq();
107
108     crash_data.state = cpu_state;
109     // The CFSR is an extremely useful register for understanding ARMv7-M and
110     // ARMv8-M CPU faults. Other architectures should put something else here.
111     HandleExceptionWithString(crash_data,
112                               "Exception encountered, cfsr=0x%",
113                               cpu_state->extended.cfsr);
114   }
115
116Common Crash Handler Setup
117==========================
118To minimize duplication of crash handling logic, it's good practice to route the
119pw_assert and pw_cpu_exception handlers to a common crash handling codepath.
120Ensure you can pass both pw_cpu_exception's CPU state and pw_assert's assert
121information to the shared handler.
122
123.. code-block:: cpp
124
125   struct CrashData {
126     pw_cpu_exception_State *cpu_state;
127     const char *reason_fmt;
128     const va_list *reason_args;
129     const char *file_name;
130     int line_number;
131   };
132
133   // This function assumes interrupts are properly disabled BEFORE it is called.
134   [[noreturn]] void HandleCrash(CrashData& crash_info) {
135     // Handle crash
136   }
137
138In the crash handler your project can re-initialize a minimal subset of the
139system needed to safely capture a snapshot before rebooting the device. The
140remainder of this section focuses on ways you can improve the reliability and
141usability of your project's crash handler.
142
143Check for Nested Crashes
144------------------------
145It’s important to include crash handler checks that prevent infinite recursive
146nesting of crashes. Maintain a static variable that checks the crash nesting
147depth. After one or two nested crashes, abort crash handling entirely and reset
148the device or sit in an infinite loop to wait for a hardware debugger to attach.
149It’s simpler to put this logic at the beginning of the shared crash handler, but
150if your assert/exception handlers are complex it might be safer to inject the
151checks earlier in both codepaths.
152
153.. code-block:: cpp
154
155   [[noreturn]] void HandleCrash(CrashData &crash_info) {
156     static size_t crash_depth = 0;
157     if (crash_depth > kMaxCrashDepth) {
158       Abort(/*run_callbacks=*/false);
159     }
160     crash_depth++;
161     ...
162   }
163
164Re-initialize Logging (Optional)
165--------------------------------
166Logging can be helpful for debugging your crash handler, but depending on your
167device/system design may be challenging to safely support at crash time. To
168re-initialize logging, you’ll need to re-construct C++ objects and re-initialize
169any systems/hardware in the logging codepath. You may even need an entirely
170separate logging pipeline that is single-threaded and interrupt-safe. Depending
171on your system’s design, this may be difficult to set up.
172
173Reinitialize Dependencies
174-------------------------
175It's good practice to design a crash handler that can run before C++ static
176constructors have run. This means any initialization (whether manual or through
177constructors) that your crash handler depends on should be manually invoked at
178crash time. If an initialization step might not be safe, evaluate if it's
179possible to omit the dependency.
180
181System Cleanup
182--------------
183After collecting a snapshot, some parts of your system may benefit from some
184cleanup before explicitly resetting a device. This might include flushing
185buffers or safely shutting down attached hardware. The order of shutdown should
186be deterministic, keeping in mind that any of these steps may have the potential
187of causing a nested crash that skips the remainder of the handlers and forces
188the device to immediately reset.
189
190----------------------
191Snapshot Storage Setup
192----------------------
193Use a storage class with a ``pw::stream::Writer`` interface to simplify
194capturing a pw_snapshot proto. This can be a :ref:`pw::BlobStore
195<module-pw_blob_store>`, an in-memory buffer that is flushed to flash, or a
196:ref:`pw::PersistentBuffer <module-pw_persistent_ram-persistent_buffer>` that
197lives in persistent memory. It's good practice to use lazy initialization for
198storage objects used by your Snapshot capture codepath.
199
200.. code-block:: cpp
201
202   // Persistent RAM objects are highly available. They don't rely on
203   // their constructor being run, and require no initialization.
204   PW_PLACE_IN_SECTION(".noinit")
205   pw::persistent_ram::PersistentBuffer<2048> persistent_snapshot;
206
207   void CaptureSnapshot(CrashInfo& crash_info) {
208     ...
209     persistent_snapshot.clear();
210     PersistentBufferWriter& writer = persistent_snapshot.GetWriter();
211     ...
212   }
213
214----------------------
215Snapshot Capture Setup
216----------------------
217
218.. note::
219
220  These instructions do not yet use the ``pw::protobuf::StreamEncoder``.
221
222Capturing a snapshot is as simple as encoding any other proto message. Some
223modules provide helper functions that will populate parts of a Snapshot, which
224eases the burden of custom work that must be set up uniquely for each project.
225
226Capture Reason
227==============
228A snapshot's "reason" should be considered the single most important field in a
229captured snapshot. If a snapshot capture was triggered by a crash, this should
230be the assert string. Other entry paths should describe here why the snapshot
231was captured ("Host communication buffer full!", "Exception encountered at
2320x00000004", etc.).
233
234.. code-block:: cpp
235
236   Status CaptureSnapshot(CrashData& crash_info) {
237     // Temporary buffer for encoding "reason" to.
238     static std::byte temp_buffer[500];
239     // Temporary buffer to encode serialized proto to before dumping to the
240     // final ``pw::stream::Writer``.
241     static std::byte proto_encode_buffer[512];
242     ...
243     pw::protobuf::NestedEncoder<kMaxDepth> proto_encoder(proto_encode_buffer);
244     pw::snapshot::Snapshot::Encoder snapshot_encoder(&proto_encoder);
245     size_t length = snprintf(temp_buffer,
246                              sizeof(temp_buffer,
247                              crash_info.reason_fmt),
248                              *crash_info.reason_args);
249     snapshot_encoder.WriteReason(temp_buffer, length));
250
251     // Final encode and write.
252     Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
253     PW_TRY(encoded_proto.status());
254     PW_TRY(writer.Write(encoded_proto.value()));
255     ...
256   }
257
258Capture CPU State
259=================
260When using pw_cpu_exception, exceptions will automatically collect CPU state
261that can be directly dumped into a snapshot. As it's not always easy to describe
262a CPU exception in a single "reason" string, this captures the information
263needed to more verbosely automatically generate a descriptive reason at analysis
264time once the snapshot is retrieved from the device.
265
266.. code-block:: cpp
267
268   Status CaptureSnapshot(CrashData& crash_info) {
269     ...
270
271     proto_encoder.clear();
272
273     // Write CPU state.
274     if (crash_info.cpu_state) {
275       PW_TRY(DumpCpuStateProto(snapshot_encoder.GetArmv7mCpuStateEncoder(),
276                                *crash_info.cpu_state));
277
278       // Final encode and write.
279       Result<ConstByteSpan> encoded_proto = proto_encoder.Encode();
280       PW_TRY(encoded_proto.status());
281       PW_TRY(writer.Write(encoded_proto.value()));
282     }
283   }
284
285-----------------------
286Snapshot Transfer Setup
287-----------------------
288Pigweed’s pw_rpc system is well suited for retrieving a snapshot from a device.
289Pigweed does not yet provide a generalized transfer service for moving files
290to/from a device. When this feature is added to Pigweed, this section will be
291updated to include guidance for connecting a storage system to a transfer
292service.
293
294----------------------
295Snapshot Tooling Setup
296----------------------
297When using the upstream ``Snapshot`` proto, you can directly use
298``pw_snapshot.process`` to process snapshots into human-readable dumps. If
299you've opted to extend Pigweed's snapshot proto, you'll likely want to extend
300the processing tooling to handle custom project data as well. This can be done
301by creating a light wrapper around
302``pw_snapshot.processor.process_snapshots()``.
303
304.. code-block:: python
305
306   def _process_hw_failures(serialized_snapshot: bytes) -> str:
307       """Custom handler that checks wheel state."""
308       wheel_state = wheel_state_pb2.WheelStateSnapshot()
309       output = []
310       wheel_state.ParseFromString(serialized_snapshot)
311
312       if len(wheel_state.wheels) != 2:
313           output.append(f'Expected 2 wheels, found {len(wheel_state.wheels)}')
314
315       if len(wheel_state.wheels) < 2:
316           output.append('Wheels fell off!')
317
318       # And more...
319
320       return '\n'.join(output)
321
322
323   def process_my_snapshots(serialized_snapshot: bytes) -> str:
324       """Runs the snapshot processor with a custom callback."""
325       return pw_snapshot.processor.process_snapshots(
326           serialized_snapshot, user_processing_callback=_process_hw_failures)
327