1.. _seed-0105: 2 3=============================================== 40105: Nested Tokens and Tokenized Log Arguments 5=============================================== 6 7.. seed:: 8 :number: 105 9 :name: Nested Tokens and Tokenized Log Arguments 10 :status: Accepted 11 :proposal_date: 2023-07-10 12 :cl: 154190 13 :authors: Gwyneth Chen 14 :facilitator: Wyatt Hepler 15 16------- 17Summary 18------- 19This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_ 20and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to 21improve support for nesting tokens and add facilities for tokenizing arguments 22to logs such as strings or and enums. This SEED primarily addresses C/C++ 23tokenization and Python/C++ detokenization. 24 25---------- 26Motivation 27---------- 28Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited 29memory to store long log format strings as hashed 32-bit tokens. When logs are 30moved off-device, host tooling can recover the full logs using token databases 31that were created when building the device image. However, logs may still have 32runtime string arguments that are stored and transferred 1:1 without additional 33encoding. This SEED aims to extend tokenization to these arguments to further 34reduce the weight of logging for embedded applications. 35 36The proposed changes affect both the tokenization module itself and the logging 37facilities built on top of tokenization. 38 39-------- 40Proposal 41-------- 42Logging enums such as ``pw::Status`` is one common special case where 43tokenization is particularly appropriate: enum values are conceptually 44already tokens mapping to their names, assuming no duplicate values. Logging 45enums frequently entails creating functions and string names that occupy space 46exclusively for logging purposes, which this proposal seeks to mitigate. 47Here, ``pw::Status::NotFound()`` is presented as an illustrative example of 48the several transformations that strings undergo during tokenization and 49detokenization, further complicated in the proposed design by nested tokens. 50 51.. list-table:: Enum Tokenization/Detokenization Phases 52 :widths: 20 45 53 54 * - (1) Source code 55 - ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())`` 56 * - (2) Token database entries (token, string, domain) 57 - | ``16170adf, "Status: ${pw::Status}#%08x", ""`` 58 | ``5 , "PW_STATUS_NOT_FOUND" , "pw::Status"`` 59 * - (3) Wire format 60 - ``df 0a 17 16 0a`` (5 bytes) 61 * - (4) Top-level detokenized and formatted 62 - ``"Status: ${pw::Status}#00000005"`` 63 * - (5) Fully detokenized 64 - ``"Status: PW_STATUS_NOT_FOUND"`` 65 66Compared to log tokenization without nesting, string literals in token 67database entries may not be identical to what is typed in source code due 68to the use of macros and preprocessor string concatenation. The 69detokenizer also takes an additional step to recursively detokenize any 70nested tokens. In exchange for this added complexity, nested enum tokenization 71allows us to gain the readability of logging value names with zero additional 72runtime space or performance cost compared to logging the integral values 73directly with ``pw_log_tokenized``. 74 75.. note:: 76 Without nested enum token support, users can select either readability or 77 reduced binary and transmission size, but not easily both: 78 79 .. list-table:: 80 :widths: 15 20 20 81 :header-rows: 1 82 83 * - 84 - Raw integers 85 - String names 86 * - (1) Source code 87 - ``PW_LOG("Status: %x" , status.code())`` 88 - ``PW_LOG("Status: %s" , pw_StatusString(status))`` 89 * - (2) Token database entries (token, string, domain) 90 - ``03a83461, "Status: %x", ""`` 91 - ``069c3ef0, "Status: %s", ""`` 92 * - (3) Wire format 93 - ``61 34 a8 03 0a`` (5 bytes) 94 - ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes) 95 * - (4) Top-level detokenized and formatted 96 - ``"Status: 5"`` 97 - ``"Status: PW_STATUS_NOT_FOUND"`` 98 * - (5) Fully detokenized 99 - ``"Status: 5"`` 100 - ``"Status: PW_STATUS_NOT_FOUND"`` 101 102Tokenization (C/C++) 103==================== 104The ``pw_log_tokenized`` module exposes a set of macros for creating and 105formatting nested tokens. Within format strings in the source code, tokens 106are specified using function-like PRI-style macros. These can be used to 107encode static information like the token domain or a numeric base encoding 108and are macro-expanded to string literals that are concatenated with the 109rest of the format string during preprocessing. Since ``pw_log`` generally 110uses printf syntax, only bases 8, 10, and 16 are supported for integer token 111arguments via ``%[odiuxX]``. 112 113The provided macros enforce the token specifier syntax and keep the argument 114types in sync when switching between other ``pw_log`` backends like 115``pw_log_basic``. These macros for basic usage are as follows: 116 117* ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args. 118* ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg. 119* ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be 120 referenced, e.g. as a ``ToString`` function return type. 121 122.. code-block:: cpp 123 124 #include "pw_log/log.h" 125 #include "pw_log/tokenized_args.h" 126 127 // token with default options base-16 and empty domain 128 // token database literal: "The sun will come out $#%08x!" 129 PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow")) 130 // after detokenization: "The sun will come out tomorrow!" 131 132Additional macros are also provided specifically for enum handling. The 133``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum 134value with the specified token domain to prevent token collision between 135multiple tokenized enums. This macro is kept separate from the enum 136definition to allow things like tokenizing a preexisting enum defined in an 137external dependency. 138 139.. code-block:: cpp 140 141 // enums 142 namespace foo { 143 144 enum class Color { kRed, kGreen, kBlue }; 145 146 // syntax TBD 147 TOKENIZE_ENUM( 148 foo::Color, 149 kRed, 150 kGreen, 151 kBlue 152 ) 153 154 } // namespace foo 155 156 void LogColor(foo::Color color) { 157 // token database literal: 158 // "Color: [${foo::Color}10#%010d]" 159 PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color) 160 // after detokenization: 161 // e.g. "Color: kRed" 162 } 163 164.. admonition:: Nested Base64 tokens 165 166 ``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in 167 which case the argument should be a pre-encoded Base64 string argument 168 (e.g. ``QAzF39==``). However, this should be avoided when possible to 169 maximize space savings. Fully-formatted Base64 including the token prefix 170 may also be logged with ``%s`` as before. 171 172Detokenization (Python) 173======================= 174``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++) 175will automatically recursively detokenize tokens of all known formats rather 176than requiring a separate call to ``detokenize_base64`` or similar. 177 178To support detokenizing domain-specific tokens, token databases support multiple 179domains, and ``database.py create`` will build a database with tokens from all 180domains by default. Specifying a domain during database creation will cause 181that domain to be treated as the default. 182 183When detokenization fails, tokens appear as-is in logs. If the detokenizer has 184the ``show_errors`` option set to ``True``, error messages may be printed 185inline following the raw token. 186 187Tokens 188====== 189Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so 190users should typically not be manually formatting tokens. However, if 191detokenization fails for any reason, tokens will appear with the following 192format in the final logs and should be easily recognizable. 193 194Nested tokens have the following structure in partially detokenized logs 195(transformation stage 4): 196 197.. code-block:: 198 199 $[{DOMAIN}][BASE#]TOKEN 200 201The ``$`` is a common prefix required for all nested tokens. It is possible to 202configure a different common prefix if necessary, but using the default ``$`` 203character is strongly recommended. 204 205.. list-table:: Options 206 :widths: 10 30 207 208 * - ``{DOMAIN}`` 209 - Specifies the token domain. If this option is omitted, the default 210 (empty) domain is assumed. 211 * - ``BASE#`` 212 - Defines the numeric base encoding of the token. Accepted values are 8, 213 10, 16, and 64. If the hash symbol ``#`` is used without specifying a 214 number, the base is assumed to be 16. If the base option is omitted 215 entirely, the base defaults to 64 for backward compatibility. All 216 encodings except Base64 are not case sensitive. 217 218 This option may be expanded to support other bases in the future. 219 * - ``TOKEN`` (required) 220 - The numeric representation of the token in the given base encoding. All 221 encodings except Base64 are left-padded with zeroes to the maximum width 222 of a 32-bit integer in the given base. Base64 data may additionally encode 223 string arguments for the detokenized token, and therefore does not have a 224 maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for 225 supported bases. 226 227When used in conjunction with ``pw_log_tokenized``, the token prefix (including 228any domain and base specifications) is tokenized as part of the log format 229string and therefore incurs zero additional memory or transmission cost over 230that of the original format string. Over the wire, tokens in bases 8, 10, and 23116 are transmitted as varint-encoded integers up to 5 bytes in size. Base64 232tokens continue to be encoded as strings. 233 234.. warning:: 235 Tokens do not have a terminating character in general, which is why we 236 require them to be formatted with fixed width. Otherwise, following them 237 immediately with alphanumeric characters valid in their base encoding 238 will cause detokenization errors. 239 240.. admonition:: Recognizing raw nested tokens in strings 241 242 When a string is fully detokenized, there should no longer be any indication 243 of tokenization in the final result, e.g. detokenized logs should read the 244 same as plain string logs. However, if nested tokens cannot be detokenized for 245 any reason, they will appear in their raw form as below: 246 247 .. code-block:: 248 249 // Base64 token with no arguments and empty domain 250 $QA19pfEQ 251 252 // Base-10 token 253 $10#0086025943 254 255 // Base-16 token with specified domain 256 ${foo_namespace::MyEnum}#0000001A 257 258 // Base64 token with specified domain 259 ${bar_namespace::MyEnum}QAQQQQ== 260 261 262--------------------- 263Problem investigation 264--------------------- 265Complex embedded device projects are perpetually seeking more RAM. For longer 266descriptive string arguments, even just a handful can take up hundreds of bytes 267that are frequently exclusively for logging purposes, without any impact on 268function. 269 270One of the most common potential use cases is for logging enum values. 271Inspection of one project revealed that enums accounted for some 90% of the 272string log arguments. We have encountered instances where, to save space, 273developers have avoided logging descriptive names in favor of raw enum values, 274forcing readers of logs look up or memorize the meanings of each number. Like 275with log format strings, we do know the set of possible string values that 276might be emitted in the final logs, so they should be able to be extracted 277into a token database at compile time. 278 279Another major challenge overall is maintaining a user interface 280that is easy to understand and use. The current primary interface through 281``pw_log`` provides printf-style formatting, which is familiar and succinct 282for basic applications. 283 284We also have to contend with the interchangeable backends of ``pw_log``. The 285``pw_log`` facade is intended as an opaque interface layer; adding syntax 286specifically for tokenized logging will break this abstraction barrier. Either 287this additional syntax would be ignored by other backends, or it might simply 288be incompatible (e.g. logging raw integer tokens instead of strings). 289 290Pigweed already supports one form of nested tokens via Base64 encoding. Base64 291tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded 292with one or two trailing ``'='`` symbols. The Python 293``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by 294running a regex replacement on the formatted results of each iteration. Base64 295is not merely a token format, however; it can encode any binary data in a text 296format at the cost of reduced efficiency. Therefore, Base64 tokens may include 297not only a database token that may detokenize to a format string but also 298binary-encoded arguments. Other token types are not expected to include this 299additional argument data. 300 301--------------- 302Detailed design 303--------------- 304 305Tokenization 306============ 307``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary 308functionality to support tokenized arguments. The proposed API is fully 309backward-compatible with non-nested tokenized logging. 310 311Token arguments are indicated in log format strings via PRI-style macros that 312are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT`` 313supplies the ``$`` token prefix, brackets around the domain, the base specifier, 314and the printf-style specifier including padding and width, i.e. ``%011o`` for 315base-8, ``%010u`` for base-10, and ``%08X`` for base-16. 316 317For free-standing string arguments such as those where the literals are defined 318in the log statements themselves, tokenization is performed with macros from 319``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros 320simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to 321basic string formatting for other backends. This is achieved by placing an 322empty header file in the local ``public_overrides`` directory of 323``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using 324the ``__has_include`` directive. 325 326For variable string arguments, the API is split across locations. The string 327literals are tokenized wherever they are defined, and the string format macros 328appear in the log format strings corresponding to those string arguments. 329 330When tokens use non-default domains, additional work may be required to create 331the domain name and store associated tokens in the ELF. 332 333Enum Tokenization 334----------------- 335We use existing ``pw_tokenizer`` utilities to record the raw enum values as 336tokens corresponding to their string names in the ELF. There is no change 337required for the backend implementation; we simply skip the token calculation 338step, since we already have a value to use, and specifying a token domain is 339generally required to isolate multiple enums from token collision. 340 341For ease of use, we can also provide a macro that wraps the enum value list 342and encapsulates the recording of each token value-string pair in the ELF. 343 344When actually logging the values, users pass the enum type name as the domain 345to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be 346passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums). 347Since integers are varint-encoded over the wire, this will only require a 348single byte for most enums. 349 350.. admonition:: Logging pw::status 351 352 Note that while this immediately reduces transmission size, the code 353 space occupied by the string names in ``pw::Status::str()`` cannot be 354 recovered unless an entire project is converted to log ``pw::Status`` 355 as tokens. 356 357 .. code-block:: cpp 358 359 #include "pw_log/log.h" 360 #include "pw_log/tokenized_args.h" 361 #include "pw_status/status.h" 362 363 pw::Status status = pw::Status::NotFound(); 364 365 // "pw::Status: ${pw::Status}#%08d" 366 PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code) 367 // "pw::Status: NOT_FOUND" 368 369Since the token mapping entries in the ELF are optimized out of the final 370binary, the enum domains are tokenized away as part of the log format strings, 371and we don't need to store separate tokens for each enum value, this addition 372to the API would would provide enum value names in logs with zero additional 373RAM cost. Compared to logging strings with ``ToString``-style functions, we 374save space on the string names as well as the functions themselves. 375 376Token Database 377============== 378Token databases will be expanded to include a column for domains, so that 379multiple domains can be encompassed in a single database rather than requiring 380separate databases for each domain. This is important because domains are being 381used to categorize tokens within a single project, rather than merely keeping 382separate projects distinct from each other. When creating a database 383from an ELF, a domain may be specified as the default domain instead of the 384empty domain. A list of domains or path to a file with a list of domains may 385also separately be specified to define which domains are to be included in 386the database; all domains are now included by default. 387 388When accessing a token database, both a domain and token value may be specified 389to access specific values. If a domain is not specified, the default domain 390will be assumed, retaining the same behavior as before. 391 392Detokenization 393============== 394Detokenization is relatively straightforward. When the detokenizer is called, 395it will first detokenize and format the top-level token and binary argument 396data. The detokenizer will then find and replace nested tokens in the resulting 397formatted string, then rescan the result for more nested tokens up to a fixed 398number of rescans. 399 400For each token type or format, ``pw_tokenizer`` defines a regular expression to 401match the expected formatted output token and a helper function to convert a 402token from a particular format to its mapped value. The regular expressions for 403each token type are combined into a single regex that matches any one of the 404formats. At each recursive step for every match, each detokenization format 405will be attempted, stopping at the first successful token type and then 406recursively replacing all nested tokens in the result. Only full data encoding- 407type tokens like Base64 will also require string/argument formatting as part of 408the recursive step. 409 410For non-Base64 tokens, a token's base encoding as specified by ``BASE#`` 411determines its set of permissible alphanumeric characters and the 412maximum token width for regex matching. 413 414If nested detokenization fails for any reason, the formatted token will be 415printed as-is in the output logs. If ``show_errors`` is true for the 416detokenizer, errors will appear in parentheses immediately following the 417token. Supported errors include: 418 419* ``(token collision)`` 420* ``(missing database)`` 421* ``(token not found)`` 422 423------------ 424Alternatives 425------------ 426 427Protobuf-based Tokenization 428=========================== 429Tokenization may be expanded to function on structured data via protobufs. 430This can be used to make logging more flexible, as all manner of compile-time 431metadata can be freely attached to log arguments at effectively no cost. 432This will most likely involve a separate build process to generate and tokenize 433partially-populated protos and will significantly change the user API. It 434will also be a large break from the existing process in implementation, as 435the current system relies only on existing C preprocessor and C++ constexpr 436tricks to function. 437 438In this model, the token domain would likely be a fully-qualified 439namespace for or path to the proto definition. 440 441Implementing this approach also requires a method of passing ordered arguments 442to a partially-filled detokenized protobuf in a manner similar to printf-style 443string formatting, so that argument data can be efficiently encoded and 444transmitted alongside the protobuf's token, and the arguments to a particular 445proto can be disambiguated from arguments to the rest of a log statement. 446 447This approach will also most likely preclude plain string logging as is 448currently supported by ``pw_log``, as the implementations diverge dramatically. 449However, if pursued, this would likely be made the default logging schema 450across all platforms, including host devices. 451 452Custom Detokenization 453===================== 454Theoretically, individual projects could implement their own regex replacement 455schemes on top of Pigweed's detokenizer, allowing them to more flexibly define 456complex relationships between logged tokens via custom log format string 457syntax. However, Pigweed should provide utilities for nested tokenization in 458common cases such as logging enums. 459 460The changes proposed do not preclude additional custom detokenization schemas 461if absolutely necessary, and such practices do not appear to have been popular 462thus far in any case. 463 464-------------- 465Open questions 466-------------- 467Missing API definitions: 468 469* Updated APIs for creating and accessing token databases with multiple domains 470* Python nested tokenization 471* C++ nested detokenization 472