pigweed/seed/0105.rst

.. _seed-0105:

===============================================
0105: Nested Tokens and Tokenized Log Arguments
===============================================

.. seed::
   :number: 105
   :name: Nested Tokens and Tokenized Log Arguments
   :status: Accepted
   :proposal_date: 2023-07-10
   :cl: 154190
   :authors: Gwyneth Chen
   :facilitator: Wyatt Hepler

-------
Summary
-------
This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_
and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to
improve support for nesting tokens and add facilities for tokenizing arguments
to logs such as strings or and enums. This SEED primarily addresses C/C++
tokenization and Python/C++ detokenization.

----------
Motivation
----------
Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited
memory to store long log format strings as hashed 32-bit tokens. When logs are
moved off-device, host tooling can recover the full logs using token databases
that were created when building the device image. However, logs may still have
runtime string arguments that are stored and transferred 1:1 without additional
encoding. This SEED aims to extend tokenization to these arguments to further
reduce the weight of logging for embedded applications.

The proposed changes affect both the tokenization module itself and the logging
facilities built on top of tokenization.

--------
Proposal
--------
Logging enums such as ``pw::Status`` is one common special case where
tokenization is particularly appropriate: enum values are conceptually
already tokens mapping to their names, assuming no duplicate values. Logging
enums frequently entails creating functions and string names that occupy space
exclusively for logging purposes, which this proposal seeks to mitigate.
Here, ``pw::Status::NotFound()`` is presented as an illustrative example of
the several transformations that strings undergo during tokenization and
detokenization, further complicated in the proposed design by nested tokens.

.. list-table:: Enum Tokenization/Detokenization Phases
   :widths: 20 45

   * - (1) Source code
     - ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())``
   * - (2) Token database entries (token, string, domain)
     - | ``16170adf, "Status: ${pw::Status}#%08x", ""``
       | ``5       , "PW_STATUS_NOT_FOUND"       , "pw::Status"``
   * - (3) Wire format
     - ``df 0a 17 16 0a`` (5 bytes)
   * - (4) Top-level detokenized and formatted
     - ``"Status: ${pw::Status}#00000005"``
   * - (5) Fully detokenized
     - ``"Status: PW_STATUS_NOT_FOUND"``

Compared to log tokenization without nesting, string literals in token
database entries may not be identical to what is typed in source code due
to the use of macros and preprocessor string concatenation. The
detokenizer also takes an additional step to recursively detokenize any
nested tokens. In exchange for this added complexity, nested enum tokenization
allows us to gain the readability of logging value names with zero additional
runtime space or performance cost compared to logging the integral values
directly with ``pw_log_tokenized``.

.. note::
  Without nested enum token support, users can select either readability or
  reduced binary and transmission size, but not easily both:

  .. list-table::
    :widths: 15 20 20
    :header-rows: 1

    * -
      - Raw integers
      - String names
    * - (1) Source code
      - ``PW_LOG("Status: %x" , status.code())``
      - ``PW_LOG("Status: %s" , pw_StatusString(status))``
    * - (2) Token database entries (token, string, domain)
      - ``03a83461, "Status: %x", ""``
      - ``069c3ef0, "Status: %s", ""``
    * - (3) Wire format
      - ``61 34 a8 03 0a`` (5 bytes)
      - ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes)
    * - (4) Top-level detokenized and formatted
      - ``"Status: 5"``
      - ``"Status: PW_STATUS_NOT_FOUND"``
    * - (5) Fully detokenized
      - ``"Status: 5"``
      - ``"Status: PW_STATUS_NOT_FOUND"``

Tokenization (C/C++)
====================
The ``pw_log_tokenized`` module exposes a set of macros for creating and
formatting nested tokens. Within format strings in the source code, tokens
are specified using function-like PRI-style macros. These can be used to
encode static information like the token domain or a numeric base encoding
and are macro-expanded to string literals that are concatenated with the
rest of the format string during preprocessing. Since ``pw_log`` generally
uses printf syntax, only bases 8, 10, and 16 are supported for integer token
arguments via ``%[odiuxX]``.

The provided macros enforce the token specifier syntax and keep the argument
types in sync when switching between other ``pw_log`` backends like
``pw_log_basic``. These macros for basic usage are as follows:

* ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args.
* ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg.
* ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be
  referenced, e.g. as a ``ToString`` function return type.

.. code-block:: cpp

   #include "pw_log/log.h"
   #include "pw_log/tokenized_args.h"

   // token with default options base-16 and empty domain
   // token database literal: "The sun will come out $#%08x!"
   PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow"))
   // after detokenization: "The sun will come out tomorrow!"

Additional macros are also provided specifically for enum handling. The
``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum
value with the specified token domain to prevent token collision between
multiple tokenized enums. This macro is kept separate from the enum
definition to allow things like tokenizing a preexisting enum defined in an
external dependency.

.. code-block:: cpp

   // enums
   namespace foo {

     enum class Color { kRed, kGreen, kBlue };

     // syntax TBD
     TOKENIZE_ENUM(
       foo::Color,
       kRed,
       kGreen,
       kBlue
     )

   } // namespace foo

   void LogColor(foo::Color color) {
     // token database literal:
     // "Color: [${foo::Color}10#%010d]"
     PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color)
     // after detokenization:
     // e.g. "Color: kRed"
   }

.. admonition:: Nested Base64 tokens

  ``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in
  which case the argument should be a pre-encoded Base64 string argument
  (e.g. ``QAzF39==``). However, this should be avoided when possible to
  maximize space savings. Fully-formatted Base64 including the token prefix
  may also be logged with ``%s`` as before.

Detokenization (Python)
=======================
``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++)
will automatically recursively detokenize tokens of all known formats rather
than requiring a separate call to ``detokenize_base64`` or similar.

To support detokenizing domain-specific tokens, token databases support multiple
domains, and ``database.py create`` will build a database with tokens from all
domains by default. Specifying a domain during database creation will cause
that domain to be treated as the default.

When detokenization fails, tokens appear as-is in logs. If the detokenizer has
the ``show_errors`` option set to ``True``, error messages may be printed
inline following the raw token.

Tokens
======
Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so
users should typically not be manually formatting tokens. However, if
detokenization fails for any reason, tokens will appear with the following
format in the final logs and should be easily recognizable.

Nested tokens have the following structure in partially detokenized logs
(transformation stage 4):

.. code-block::

   $[{DOMAIN}][BASE#]TOKEN

The ``$`` is a common prefix required for all nested tokens. It is possible to
configure a different common prefix if necessary, but using the default ``$``
character is strongly recommended.

.. list-table:: Options
   :widths: 10 30

   * - ``{DOMAIN}``
     - Specifies the token domain. If this option is omitted, the default
       (empty) domain is assumed.
   * - ``BASE#``
     - Defines the numeric base encoding of the token. Accepted values are 8,
       10, 16, and 64. If the hash symbol ``#`` is used without specifying a
       number, the base is assumed to be 16. If the base option is omitted
       entirely, the base defaults to 64 for backward compatibility. All
       encodings except Base64 are not case sensitive.

       This option may be expanded to support other bases in the future.
   * - ``TOKEN`` (required)
     - The numeric representation of the token in the given base encoding. All
       encodings except Base64 are left-padded with zeroes to the maximum width
       of a 32-bit integer in the given base. Base64 data may additionally encode
       string arguments for the detokenized token, and therefore does not have a
       maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for
       supported bases.

When used in conjunction with ``pw_log_tokenized``, the token prefix (including
any domain and base specifications) is tokenized as part of the log format
string and therefore incurs zero additional memory or transmission cost over
that of the original format string. Over the wire, tokens in bases 8, 10, and
16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64
tokens continue to be encoded as strings.

.. warning::
  Tokens do not have a terminating character in general, which is why we
  require them to be formatted with fixed width. Otherwise, following them
  immediately with alphanumeric characters valid in their base encoding
  will cause detokenization errors.

.. admonition:: Recognizing raw nested tokens in strings

  When a string is fully detokenized, there should no longer be any indication
  of tokenization in the final result, e.g. detokenized logs should read the
  same as plain string logs. However, if nested tokens cannot be detokenized for
  any reason, they will appear in their raw form as below:

  .. code-block::

     // Base64 token with no arguments and empty domain
     $QA19pfEQ

     // Base-10 token
     $10#0086025943

     // Base-16 token with specified domain
     ${foo_namespace::MyEnum}#0000001A

     // Base64 token with specified domain
     ${bar_namespace::MyEnum}QAQQQQ==


---------------------
Problem investigation
---------------------
Complex embedded device projects are perpetually seeking more RAM. For longer
descriptive string arguments, even just a handful can take up hundreds of bytes
that are frequently exclusively for logging purposes, without any impact on
function.

One of the most common potential use cases is for logging enum values.
Inspection of one project revealed that enums accounted for some 90% of the
string log arguments. We have encountered instances where, to save space,
developers have avoided logging descriptive names in favor of raw enum values,
forcing readers of logs look up or memorize the meanings of each number. Like
with log format strings, we do know the set of possible string values that
might be emitted in the final logs, so they should be able to be extracted
into a token database at compile time.

Another major challenge overall is maintaining a user interface
that is easy to understand and use. The current primary interface through
``pw_log`` provides printf-style formatting, which is familiar and succinct
for basic applications.

We also have to contend with the interchangeable backends of ``pw_log``. The
``pw_log`` facade is intended as an opaque interface layer; adding syntax
specifically for tokenized logging will break this abstraction barrier. Either
this additional syntax would be ignored by other backends, or it might simply
be incompatible (e.g. logging raw integer tokens instead of strings).

Pigweed already supports one form of nested tokens via Base64 encoding. Base64
tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded
with one or two trailing ``'='`` symbols. The Python
``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by
running a regex replacement on the formatted results of each iteration. Base64
is not merely a token format, however; it can encode any binary data in a text
format at the cost of reduced efficiency. Therefore, Base64 tokens may include
not only a database token that may detokenize to a format string but also
binary-encoded arguments. Other token types are not expected to include this
additional argument data.

---------------
Detailed design
---------------

Tokenization
============
``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary
functionality to support tokenized arguments. The proposed API is fully
backward-compatible with non-nested tokenized logging.

Token arguments are indicated in log format strings via PRI-style macros that
are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT``
supplies the ``$`` token prefix, brackets around the domain, the base specifier,
and the printf-style specifier including padding and width, i.e. ``%011o`` for
base-8, ``%010u`` for base-10, and ``%08X`` for base-16.

For free-standing string arguments such as those where the literals are defined
in the log statements themselves, tokenization is performed with macros from
``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros
simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to
basic string formatting for other backends. This is achieved by placing an
empty header file in the local ``public_overrides`` directory of
``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using
the ``__has_include`` directive.

For variable string arguments, the API is split across locations. The string
literals are tokenized wherever they are defined, and the string format macros
appear in the log format strings corresponding to those string arguments.

When tokens use non-default domains, additional work may be required to create
the domain name and store associated tokens in the ELF.

Enum Tokenization
-----------------
We use existing ``pw_tokenizer`` utilities to record the raw enum values as
tokens corresponding to their string names in the ELF. There is no change
required for the backend implementation; we simply skip the token calculation
step, since we already have a value to use, and specifying a token domain is
generally required to isolate multiple enums from token collision.

For ease of use, we can also provide a macro that wraps the enum value list
and encapsulates the recording of each token value-string pair in the ELF.

When actually logging the values, users pass the enum type name as the domain
to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be
passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums).
Since integers are varint-encoded over the wire, this will only require a
single byte for most enums.

.. admonition:: Logging pw::status

  Note that while this immediately reduces transmission size, the code
  space occupied by the string names in ``pw::Status::str()`` cannot be
  recovered unless an entire project is converted to log ``pw::Status``
  as tokens.

  .. code-block:: cpp

     #include "pw_log/log.h"
     #include "pw_log/tokenized_args.h"
     #include "pw_status/status.h"

     pw::Status status = pw::Status::NotFound();

     // "pw::Status: ${pw::Status}#%08d"
     PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code)
     // "pw::Status: NOT_FOUND"

Since the token mapping entries in the ELF are optimized out of the final
binary, the enum domains are tokenized away as part of the log format strings,
and we don't need to store separate tokens for each enum value, this addition
to the API would would provide enum value names in logs with zero additional
RAM cost. Compared to logging strings with ``ToString``-style functions, we
save space on the string names as well as the functions themselves.

Token Database
==============
Token databases will be expanded to include a column for domains, so that
multiple domains can be encompassed in a single database rather than requiring
separate databases for each domain. This is important because domains are being
used to categorize tokens within a single project, rather than merely keeping
separate projects distinct from each other. When creating a database
from an ELF, a domain may be specified as the default domain instead of the
empty domain. A list of domains or path to a file with a list of domains may
also separately be specified to define which domains are to be included in
the database; all domains are now included by default.

When accessing a token database, both a domain and token value may be specified
to access specific values. If a domain is not specified, the default domain
will be assumed, retaining the same behavior as before.

Detokenization
==============
Detokenization is relatively straightforward. When the detokenizer is called,
it will first detokenize and format the top-level token and binary argument
data. The detokenizer will then find and replace nested tokens in the resulting
formatted string, then rescan the result for more nested tokens up to a fixed
number of rescans.

For each token type or format, ``pw_tokenizer`` defines a regular expression to
match the expected formatted output token and a helper function to convert a
token from a particular format to its mapped value. The regular expressions for
each token type are combined into a single regex that matches any one of the
formats. At each recursive step for every match, each detokenization format
will be attempted, stopping at the first successful token type and then
recursively replacing all nested tokens in the result. Only full data encoding-
type tokens like Base64 will also require string/argument formatting as part of
the recursive step.

For non-Base64 tokens, a token's base encoding as specified by ``BASE#``
determines its set of permissible alphanumeric characters and the
maximum token width for regex matching.

If nested detokenization fails for any reason, the formatted token will be
printed as-is in the output logs. If ``show_errors`` is true for the
detokenizer, errors will appear in parentheses immediately following the
token. Supported errors include:

* ``(token collision)``
* ``(missing database)``
* ``(token not found)``

------------
Alternatives
------------

Protobuf-based Tokenization
===========================
Tokenization may be expanded to function on structured data via protobufs.
This can be used to make logging more flexible, as all manner of compile-time
metadata can be freely attached to log arguments at effectively no cost.
This will most likely involve a separate build process to generate and tokenize
partially-populated protos and will significantly change the user API. It
will also be a large break from the existing process in implementation, as
the current system relies only on existing C preprocessor and C++ constexpr
tricks to function.

In this model, the token domain would likely be a fully-qualified
namespace for or path to the proto definition.

Implementing this approach also requires a method of passing ordered arguments
to a partially-filled detokenized protobuf in a manner similar to printf-style
string formatting, so that argument data can be efficiently encoded and
transmitted alongside the protobuf's token, and the arguments to a particular
proto can be disambiguated from arguments to the rest of a log statement.

This approach will also most likely preclude plain string logging as is
currently supported by ``pw_log``, as the implementations diverge dramatically.
However, if pursued, this would likely be made the default logging schema
across all platforms, including host devices.

Custom Detokenization
=====================
Theoretically, individual projects could implement their own regex replacement
schemes on top of Pigweed's detokenizer, allowing them to more flexibly define
complex relationships between logged tokens via custom log format string
syntax. However, Pigweed should provide utilities for nested tokenization in
common cases such as logging enums.

The changes proposed do not preclude additional custom detokenization schemas
if absolutely necessary, and such practices do not appear to have been popular
thus far in any case.

--------------
Open questions
--------------
Missing API definitions:

* Updated APIs for creating and accessing token databases with multiple domains
* Python nested tokenization
* C++ nested detokenization