xref: /aosp_15_r20/external/pigweed/seed/0105.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1.. _seed-0105:
2
3===============================================
40105: Nested Tokens and Tokenized Log Arguments
5===============================================
6
7.. seed::
8   :number: 105
9   :name: Nested Tokens and Tokenized Log Arguments
10   :status: Accepted
11   :proposal_date: 2023-07-10
12   :cl: 154190
13   :authors: Gwyneth Chen
14   :facilitator: Wyatt Hepler
15
16-------
17Summary
18-------
19This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_
20and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to
21improve support for nesting tokens and add facilities for tokenizing arguments
22to logs such as strings or and enums. This SEED primarily addresses C/C++
23tokenization and Python/C++ detokenization.
24
25----------
26Motivation
27----------
28Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited
29memory to store long log format strings as hashed 32-bit tokens. When logs are
30moved off-device, host tooling can recover the full logs using token databases
31that were created when building the device image. However, logs may still have
32runtime string arguments that are stored and transferred 1:1 without additional
33encoding. This SEED aims to extend tokenization to these arguments to further
34reduce the weight of logging for embedded applications.
35
36The proposed changes affect both the tokenization module itself and the logging
37facilities built on top of tokenization.
38
39--------
40Proposal
41--------
42Logging enums such as ``pw::Status`` is one common special case where
43tokenization is particularly appropriate: enum values are conceptually
44already tokens mapping to their names, assuming no duplicate values. Logging
45enums frequently entails creating functions and string names that occupy space
46exclusively for logging purposes, which this proposal seeks to mitigate.
47Here, ``pw::Status::NotFound()`` is presented as an illustrative example of
48the several transformations that strings undergo during tokenization and
49detokenization, further complicated in the proposed design by nested tokens.
50
51.. list-table:: Enum Tokenization/Detokenization Phases
52   :widths: 20 45
53
54   * - (1) Source code
55     - ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())``
56   * - (2) Token database entries (token, string, domain)
57     - | ``16170adf, "Status: ${pw::Status}#%08x", ""``
58       | ``5       , "PW_STATUS_NOT_FOUND"       , "pw::Status"``
59   * - (3) Wire format
60     - ``df 0a 17 16 0a`` (5 bytes)
61   * - (4) Top-level detokenized and formatted
62     - ``"Status: ${pw::Status}#00000005"``
63   * - (5) Fully detokenized
64     - ``"Status: PW_STATUS_NOT_FOUND"``
65
66Compared to log tokenization without nesting, string literals in token
67database entries may not be identical to what is typed in source code due
68to the use of macros and preprocessor string concatenation. The
69detokenizer also takes an additional step to recursively detokenize any
70nested tokens. In exchange for this added complexity, nested enum tokenization
71allows us to gain the readability of logging value names with zero additional
72runtime space or performance cost compared to logging the integral values
73directly with ``pw_log_tokenized``.
74
75.. note::
76  Without nested enum token support, users can select either readability or
77  reduced binary and transmission size, but not easily both:
78
79  .. list-table::
80    :widths: 15 20 20
81    :header-rows: 1
82
83    * -
84      - Raw integers
85      - String names
86    * - (1) Source code
87      - ``PW_LOG("Status: %x" , status.code())``
88      - ``PW_LOG("Status: %s" , pw_StatusString(status))``
89    * - (2) Token database entries (token, string, domain)
90      - ``03a83461, "Status: %x", ""``
91      - ``069c3ef0, "Status: %s", ""``
92    * - (3) Wire format
93      - ``61 34 a8 03 0a`` (5 bytes)
94      - ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes)
95    * - (4) Top-level detokenized and formatted
96      - ``"Status: 5"``
97      - ``"Status: PW_STATUS_NOT_FOUND"``
98    * - (5) Fully detokenized
99      - ``"Status: 5"``
100      - ``"Status: PW_STATUS_NOT_FOUND"``
101
102Tokenization (C/C++)
103====================
104The ``pw_log_tokenized`` module exposes a set of macros for creating and
105formatting nested tokens. Within format strings in the source code, tokens
106are specified using function-like PRI-style macros. These can be used to
107encode static information like the token domain or a numeric base encoding
108and are macro-expanded to string literals that are concatenated with the
109rest of the format string during preprocessing. Since ``pw_log`` generally
110uses printf syntax, only bases 8, 10, and 16 are supported for integer token
111arguments via ``%[odiuxX]``.
112
113The provided macros enforce the token specifier syntax and keep the argument
114types in sync when switching between other ``pw_log`` backends like
115``pw_log_basic``. These macros for basic usage are as follows:
116
117* ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args.
118* ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg.
119* ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be
120  referenced, e.g. as a ``ToString`` function return type.
121
122.. code-block:: cpp
123
124   #include "pw_log/log.h"
125   #include "pw_log/tokenized_args.h"
126
127   // token with default options base-16 and empty domain
128   // token database literal: "The sun will come out $#%08x!"
129   PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow"))
130   // after detokenization: "The sun will come out tomorrow!"
131
132Additional macros are also provided specifically for enum handling. The
133``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum
134value with the specified token domain to prevent token collision between
135multiple tokenized enums. This macro is kept separate from the enum
136definition to allow things like tokenizing a preexisting enum defined in an
137external dependency.
138
139.. code-block:: cpp
140
141   // enums
142   namespace foo {
143
144     enum class Color { kRed, kGreen, kBlue };
145
146     // syntax TBD
147     TOKENIZE_ENUM(
148       foo::Color,
149       kRed,
150       kGreen,
151       kBlue
152     )
153
154   } // namespace foo
155
156   void LogColor(foo::Color color) {
157     // token database literal:
158     // "Color: [${foo::Color}10#%010d]"
159     PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color)
160     // after detokenization:
161     // e.g. "Color: kRed"
162   }
163
164.. admonition:: Nested Base64 tokens
165
166  ``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in
167  which case the argument should be a pre-encoded Base64 string argument
168  (e.g. ``QAzF39==``). However, this should be avoided when possible to
169  maximize space savings. Fully-formatted Base64 including the token prefix
170  may also be logged with ``%s`` as before.
171
172Detokenization (Python)
173=======================
174``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++)
175will automatically recursively detokenize tokens of all known formats rather
176than requiring a separate call to ``detokenize_base64`` or similar.
177
178To support detokenizing domain-specific tokens, token databases support multiple
179domains, and ``database.py create`` will build a database with tokens from all
180domains by default. Specifying a domain during database creation will cause
181that domain to be treated as the default.
182
183When detokenization fails, tokens appear as-is in logs. If the detokenizer has
184the ``show_errors`` option set to ``True``, error messages may be printed
185inline following the raw token.
186
187Tokens
188======
189Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so
190users should typically not be manually formatting tokens. However, if
191detokenization fails for any reason, tokens will appear with the following
192format in the final logs and should be easily recognizable.
193
194Nested tokens have the following structure in partially detokenized logs
195(transformation stage 4):
196
197.. code-block::
198
199   $[{DOMAIN}][BASE#]TOKEN
200
201The ``$`` is a common prefix required for all nested tokens. It is possible to
202configure a different common prefix if necessary, but using the default ``$``
203character is strongly recommended.
204
205.. list-table:: Options
206   :widths: 10 30
207
208   * - ``{DOMAIN}``
209     - Specifies the token domain. If this option is omitted, the default
210       (empty) domain is assumed.
211   * - ``BASE#``
212     - Defines the numeric base encoding of the token. Accepted values are 8,
213       10, 16, and 64. If the hash symbol ``#`` is used without specifying a
214       number, the base is assumed to be 16. If the base option is omitted
215       entirely, the base defaults to 64 for backward compatibility. All
216       encodings except Base64 are not case sensitive.
217
218       This option may be expanded to support other bases in the future.
219   * - ``TOKEN`` (required)
220     - The numeric representation of the token in the given base encoding. All
221       encodings except Base64 are left-padded with zeroes to the maximum width
222       of a 32-bit integer in the given base. Base64 data may additionally encode
223       string arguments for the detokenized token, and therefore does not have a
224       maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for
225       supported bases.
226
227When used in conjunction with ``pw_log_tokenized``, the token prefix (including
228any domain and base specifications) is tokenized as part of the log format
229string and therefore incurs zero additional memory or transmission cost over
230that of the original format string. Over the wire, tokens in bases 8, 10, and
23116 are transmitted as varint-encoded integers up to 5 bytes in size. Base64
232tokens continue to be encoded as strings.
233
234.. warning::
235  Tokens do not have a terminating character in general, which is why we
236  require them to be formatted with fixed width. Otherwise, following them
237  immediately with alphanumeric characters valid in their base encoding
238  will cause detokenization errors.
239
240.. admonition:: Recognizing raw nested tokens in strings
241
242  When a string is fully detokenized, there should no longer be any indication
243  of tokenization in the final result, e.g. detokenized logs should read the
244  same as plain string logs. However, if nested tokens cannot be detokenized for
245  any reason, they will appear in their raw form as below:
246
247  .. code-block::
248
249     // Base64 token with no arguments and empty domain
250     $QA19pfEQ
251
252     // Base-10 token
253     $10#0086025943
254
255     // Base-16 token with specified domain
256     ${foo_namespace::MyEnum}#0000001A
257
258     // Base64 token with specified domain
259     ${bar_namespace::MyEnum}QAQQQQ==
260
261
262---------------------
263Problem investigation
264---------------------
265Complex embedded device projects are perpetually seeking more RAM. For longer
266descriptive string arguments, even just a handful can take up hundreds of bytes
267that are frequently exclusively for logging purposes, without any impact on
268function.
269
270One of the most common potential use cases is for logging enum values.
271Inspection of one project revealed that enums accounted for some 90% of the
272string log arguments. We have encountered instances where, to save space,
273developers have avoided logging descriptive names in favor of raw enum values,
274forcing readers of logs look up or memorize the meanings of each number. Like
275with log format strings, we do know the set of possible string values that
276might be emitted in the final logs, so they should be able to be extracted
277into a token database at compile time.
278
279Another major challenge overall is maintaining a user interface
280that is easy to understand and use. The current primary interface through
281``pw_log`` provides printf-style formatting, which is familiar and succinct
282for basic applications.
283
284We also have to contend with the interchangeable backends of ``pw_log``. The
285``pw_log`` facade is intended as an opaque interface layer; adding syntax
286specifically for tokenized logging will break this abstraction barrier. Either
287this additional syntax would be ignored by other backends, or it might simply
288be incompatible (e.g. logging raw integer tokens instead of strings).
289
290Pigweed already supports one form of nested tokens via Base64 encoding. Base64
291tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded
292with one or two trailing ``'='`` symbols. The Python
293``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by
294running a regex replacement on the formatted results of each iteration. Base64
295is not merely a token format, however; it can encode any binary data in a text
296format at the cost of reduced efficiency. Therefore, Base64 tokens may include
297not only a database token that may detokenize to a format string but also
298binary-encoded arguments. Other token types are not expected to include this
299additional argument data.
300
301---------------
302Detailed design
303---------------
304
305Tokenization
306============
307``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary
308functionality to support tokenized arguments. The proposed API is fully
309backward-compatible with non-nested tokenized logging.
310
311Token arguments are indicated in log format strings via PRI-style macros that
312are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT``
313supplies the ``$`` token prefix, brackets around the domain, the base specifier,
314and the printf-style specifier including padding and width, i.e. ``%011o`` for
315base-8, ``%010u`` for base-10, and ``%08X`` for base-16.
316
317For free-standing string arguments such as those where the literals are defined
318in the log statements themselves, tokenization is performed with macros from
319``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros
320simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to
321basic string formatting for other backends. This is achieved by placing an
322empty header file in the local ``public_overrides`` directory of
323``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using
324the ``__has_include`` directive.
325
326For variable string arguments, the API is split across locations. The string
327literals are tokenized wherever they are defined, and the string format macros
328appear in the log format strings corresponding to those string arguments.
329
330When tokens use non-default domains, additional work may be required to create
331the domain name and store associated tokens in the ELF.
332
333Enum Tokenization
334-----------------
335We use existing ``pw_tokenizer`` utilities to record the raw enum values as
336tokens corresponding to their string names in the ELF. There is no change
337required for the backend implementation; we simply skip the token calculation
338step, since we already have a value to use, and specifying a token domain is
339generally required to isolate multiple enums from token collision.
340
341For ease of use, we can also provide a macro that wraps the enum value list
342and encapsulates the recording of each token value-string pair in the ELF.
343
344When actually logging the values, users pass the enum type name as the domain
345to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be
346passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums).
347Since integers are varint-encoded over the wire, this will only require a
348single byte for most enums.
349
350.. admonition:: Logging pw::status
351
352  Note that while this immediately reduces transmission size, the code
353  space occupied by the string names in ``pw::Status::str()`` cannot be
354  recovered unless an entire project is converted to log ``pw::Status``
355  as tokens.
356
357  .. code-block:: cpp
358
359     #include "pw_log/log.h"
360     #include "pw_log/tokenized_args.h"
361     #include "pw_status/status.h"
362
363     pw::Status status = pw::Status::NotFound();
364
365     // "pw::Status: ${pw::Status}#%08d"
366     PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code)
367     // "pw::Status: NOT_FOUND"
368
369Since the token mapping entries in the ELF are optimized out of the final
370binary, the enum domains are tokenized away as part of the log format strings,
371and we don't need to store separate tokens for each enum value, this addition
372to the API would would provide enum value names in logs with zero additional
373RAM cost. Compared to logging strings with ``ToString``-style functions, we
374save space on the string names as well as the functions themselves.
375
376Token Database
377==============
378Token databases will be expanded to include a column for domains, so that
379multiple domains can be encompassed in a single database rather than requiring
380separate databases for each domain. This is important because domains are being
381used to categorize tokens within a single project, rather than merely keeping
382separate projects distinct from each other. When creating a database
383from an ELF, a domain may be specified as the default domain instead of the
384empty domain. A list of domains or path to a file with a list of domains may
385also separately be specified to define which domains are to be included in
386the database; all domains are now included by default.
387
388When accessing a token database, both a domain and token value may be specified
389to access specific values. If a domain is not specified, the default domain
390will be assumed, retaining the same behavior as before.
391
392Detokenization
393==============
394Detokenization is relatively straightforward. When the detokenizer is called,
395it will first detokenize and format the top-level token and binary argument
396data. The detokenizer will then find and replace nested tokens in the resulting
397formatted string, then rescan the result for more nested tokens up to a fixed
398number of rescans.
399
400For each token type or format, ``pw_tokenizer`` defines a regular expression to
401match the expected formatted output token and a helper function to convert a
402token from a particular format to its mapped value. The regular expressions for
403each token type are combined into a single regex that matches any one of the
404formats. At each recursive step for every match, each detokenization format
405will be attempted, stopping at the first successful token type and then
406recursively replacing all nested tokens in the result. Only full data encoding-
407type tokens like Base64 will also require string/argument formatting as part of
408the recursive step.
409
410For non-Base64 tokens, a token's base encoding as specified by ``BASE#``
411determines its set of permissible alphanumeric characters and the
412maximum token width for regex matching.
413
414If nested detokenization fails for any reason, the formatted token will be
415printed as-is in the output logs. If ``show_errors`` is true for the
416detokenizer, errors will appear in parentheses immediately following the
417token. Supported errors include:
418
419* ``(token collision)``
420* ``(missing database)``
421* ``(token not found)``
422
423------------
424Alternatives
425------------
426
427Protobuf-based Tokenization
428===========================
429Tokenization may be expanded to function on structured data via protobufs.
430This can be used to make logging more flexible, as all manner of compile-time
431metadata can be freely attached to log arguments at effectively no cost.
432This will most likely involve a separate build process to generate and tokenize
433partially-populated protos and will significantly change the user API. It
434will also be a large break from the existing process in implementation, as
435the current system relies only on existing C preprocessor and C++ constexpr
436tricks to function.
437
438In this model, the token domain would likely be a fully-qualified
439namespace for or path to the proto definition.
440
441Implementing this approach also requires a method of passing ordered arguments
442to a partially-filled detokenized protobuf in a manner similar to printf-style
443string formatting, so that argument data can be efficiently encoded and
444transmitted alongside the protobuf's token, and the arguments to a particular
445proto can be disambiguated from arguments to the rest of a log statement.
446
447This approach will also most likely preclude plain string logging as is
448currently supported by ``pw_log``, as the implementations diverge dramatically.
449However, if pursued, this would likely be made the default logging schema
450across all platforms, including host devices.
451
452Custom Detokenization
453=====================
454Theoretically, individual projects could implement their own regex replacement
455schemes on top of Pigweed's detokenizer, allowing them to more flexibly define
456complex relationships between logged tokens via custom log format string
457syntax. However, Pigweed should provide utilities for nested tokenization in
458common cases such as logging enums.
459
460The changes proposed do not preclude additional custom detokenization schemas
461if absolutely necessary, and such practices do not appear to have been popular
462thus far in any case.
463
464--------------
465Open questions
466--------------
467Missing API definitions:
468
469* Updated APIs for creating and accessing token databases with multiple domains
470* Python nested tokenization
471* C++ nested detokenization
472