xref: /aosp_15_r20/external/pigweed/pw_tokenizer/tokenization.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1:tocdepth: 3
2
3.. _module-pw_tokenizer-tokenization:
4
5============
6Tokenization
7============
8.. pigweed-module-subpage::
9   :name: pw_tokenizer
10
11Tokenization converts a string literal to a token. If it's a printf-style
12string, its arguments are encoded along with it. The results of tokenization can
13be sent off device or stored in place of a full string.
14
15--------
16Concepts
17--------
18See :ref:`module-pw_tokenizer-get-started-overview` for a high-level
19explanation of how ``pw_tokenizer`` works.
20
21Token generation: fixed length hashing at compile time
22======================================================
23String tokens are generated using a modified version of the x65599 hash used by
24the SDBM project. All hashing is done at compile time.
25
26In C code, strings are hashed with a preprocessor macro. For compatibility with
27macros, the hash must be limited to a fixed maximum number of characters. This
28value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
29``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
30the complexity of the hashing macros.
31
32C++ macros use a constexpr function instead of a macro. This function works with
33any length of string and has lower compilation time impact than the C macros.
34For consistency, C++ tokenization uses the same hash algorithm, but the
35calculated values will differ between C and C++ for strings longer than
36``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
37
38Token encoding
39==============
40The token is a 32-bit hash calculated during compilation. The string is encoded
41little-endian with the token followed by arguments, if any. For example, the
4231-byte string ``You can go about your business.`` hashes to 0xdac9a244.
43This is encoded as 4 bytes: ``44 a2 c9 da``.
44
45Arguments are encoded as follows:
46
47* **Integers**  (1--10 bytes) --
48  `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-ints>`_,
49  similarly to Protocol Buffers. Smaller values take fewer bytes.
50* **Floating point numbers** (4 bytes) -- Single precision floating point.
51* **Strings** (1--128 bytes) -- Length byte followed by the string contents.
52  The top bit of the length whether the string was truncated or not. The
53  remaining 7 bits encode the string length, with a maximum of 127 bytes.
54
55.. TODO(hepler): insert diagram here!
56
57.. tip::
58   ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s``
59   arguments short or avoid encoding them as strings (e.g. encode an enum as an
60   integer instead of a string). See also
61   :ref:`module-pw_tokenizer-nested-arguments`.
62
63.. _module-pw_tokenizer-proto:
64
65Tokenized fields in protocol buffers
66====================================
67Text may be represented in a few different ways:
68
69- Plain ASCII or UTF-8 text (``This is plain text``)
70- Base64-encoded tokenized message (``$ibafcA==``)
71- Binary-encoded tokenized message (``89 b6 9f 70``)
72- Little-endian 32-bit integer token (``0x709fb689``)
73
74``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
75This option may be applied to a protobuf field to indicate that it may contain a
76tokenized string. A string that is optionally tokenized is represented with a
77single ``bytes`` field annotated with ``(pw.tokenizer.format) =
78TOKENIZATION_OPTIONAL``.
79
80For example, the following protobuf has one field that may contain a tokenized
81string.
82
83.. code-block:: protobuf
84
85   import "pw_tokenizer_proto/options.proto";
86
87   message MessageWithOptionallyTokenizedField {
88     bytes just_bytes = 1;
89     bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
90     string just_text = 3;
91   }
92
93-----------------------
94Tokenization in C++ / C
95-----------------------
96To tokenize a string, include ``pw_tokenizer/tokenize.h`` and invoke one of the
97``PW_TOKENIZE_*`` macros.
98
99Tokenize string literals outside of expressions
100===============================================
101``pw_tokenizer`` provides macros for tokenizing string literals with no
102arguments:
103
104* :c:macro:`PW_TOKENIZE_STRING`
105* :c:macro:`PW_TOKENIZE_STRING_DOMAIN`
106* :c:macro:`PW_TOKENIZE_STRING_MASK`
107
108The tokenization macros above cannot be used inside other expressions.
109
110.. admonition:: **Yes**: Assign :c:macro:`PW_TOKENIZE_STRING` to a ``constexpr`` variable.
111  :class: checkmark
112
113  .. code-block:: cpp
114
115     constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!");
116
117     void Function() {
118       constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?");
119     }
120
121.. admonition:: **No**: Use :c:macro:`PW_TOKENIZE_STRING` in another expression.
122  :class: error
123
124  .. code-block:: cpp
125
126     void BadExample() {
127       ProcessToken(PW_TOKENIZE_STRING("This won't compile!"));
128     }
129
130  Use :c:macro:`PW_TOKENIZE_STRING_EXPR` instead.
131
132Tokenize inside expressions
133===========================
134An alternate set of macros are provided for use inside expressions. These make
135use of lambda functions, so while they can be used inside expressions, they
136require C++ and cannot be assigned to constexpr variables or be used with
137special function variables like ``__func__``.
138
139* :c:macro:`PW_TOKENIZE_STRING_EXPR`
140* :c:macro:`PW_TOKENIZE_STRING_DOMAIN_EXPR`
141* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR`
142
143.. admonition:: When to use these macros
144
145  Use :c:macro:`PW_TOKENIZE_STRING` and related macros to tokenize string
146  literals that do not need %-style arguments encoded.
147
148.. admonition:: **Yes**: Use :c:macro:`PW_TOKENIZE_STRING_EXPR` within other expressions.
149  :class: checkmark
150
151  .. code-block:: cpp
152
153     void GoodExample() {
154       ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!"));
155     }
156
157.. admonition:: **No**: Assign :c:macro:`PW_TOKENIZE_STRING_EXPR` to a ``constexpr`` variable.
158  :class: error
159
160  .. code-block:: cpp
161
162     constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!"));
163
164  Instead, use :c:macro:`PW_TOKENIZE_STRING` to assign to a ``constexpr`` variable.
165
166.. admonition:: **No**: Tokenize ``__func__`` in :c:macro:`PW_TOKENIZE_STRING_EXPR`.
167  :class: error
168
169  .. code-block:: cpp
170
171     void BadExample() {
172       // This compiles, but __func__ will not be the outer function's name, and
173       // there may be compiler warnings.
174       constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__);
175     }
176
177  Instead, use :c:macro:`PW_TOKENIZE_STRING` to tokenize ``__func__`` or similar macros.
178
179Tokenize a message with arguments to a buffer
180=============================================
181* :c:macro:`PW_TOKENIZE_TO_BUFFER`
182* :c:macro:`PW_TOKENIZE_TO_BUFFER_DOMAIN`
183* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK`
184
185.. admonition:: Why use this macro
186
187   - Encode a tokenized message for consumption within a function.
188   - Encode a tokenized message into an existing buffer.
189
190   Avoid using ``PW_TOKENIZE_TO_BUFFER`` in widely expanded macros, such as a
191   logging macro, because it will result in larger code size than passing the
192   tokenized data to a function.
193
194.. _module-pw_tokenizer-nested-arguments:
195
196Tokenize nested arguments
197=========================
198Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are
199encoded 1:1, with no tokenization. Tokens can therefore be used to replace
200string arguments to tokenized format strings.
201
202* :c:macro:`PW_TOKEN_FMT`
203
204.. admonition:: Logging nested tokens
205
206  Users will typically interact with nested token arguments during logging.
207  In this case there is a slightly different interface described by
208  :ref:`module-pw_log-tokenized-args` that does not generally invoke
209  ``PW_TOKEN_FMT`` directly.
210
211The format specifier for a token is given by PRI-style macro ``PW_TOKEN_FMT()``,
212which is concatenated to the rest of the format string by the C preprocessor.
213
214.. code-block:: cpp
215
216   PW_TOKENIZE_FORMAT_STRING("margarine_domain",
217                             UINT32_MAX,
218                             "I can't believe it's not " PW_TOKEN_FMT() "!",
219                             PW_TOKENIZE_STRING_EXPR("butter"));
220
221This feature is currently only supported by the Python detokenizer.
222
223Nested token format
224-------------------
225Nested tokens have the following format within strings:
226
227.. code-block::
228
229   $[{DOMAIN}][BASE#]TOKEN
230
231The ``$`` is a common prefix required for all nested tokens. It is possible to
232configure a different common prefix if necessary, but using the default ``$``
233character is strongly recommended.
234
235The optional ``DOMAIN`` specifies the token domain. If this option is omitted,
236the default (empty) domain is assumed.
237
238The optional ``BASE`` defines the numeric base encoding of the token. Accepted
239values are 8, 10, 16, and 64. If the hash symbol ``#`` is used without
240specifying a number, the base is assumed to be 16. If the base option is
241omitted entirely, the base defaults to 64 for backward compatibility. All
242encodings except Base64 are not case sensitive. This may be expanded to support
243other bases in the future.
244
245Non-Base64 tokens are encoded strictly as 32-bit integers with padding.
246Base64 data may additionally encode string arguments for the detokenized token,
247and therefore does not have a maximum width.
248
249The meaning of ``TOKEN`` depends on the current phase of transformation for the
250current tokenized format string. Within the format string's entry in the token
251database, when the actual value of the token argument is not known, ``TOKEN`` is
252a printf argument specifier (e.g. ``%08x`` for a base-16 token with correct
253padding). The actual tokens that will be used as arguments have separate
254entries in the token database.
255
256After the top-level format string has been detokenized and formatted, ``TOKEN``
257should be the value of the token argument in the specified base, with any
258necessary padding. This is the final format of a nested token if it cannot be
259tokenized.
260
261.. list-table:: Example tokens
262   :widths: 10 25 25
263
264   * - Base
265     - | Token database
266       | (within format string entry)
267     - Partially detokenized
268   * - 10
269     - ``$10#%010d``
270     - ``$10#0086025943``
271   * - 16
272     - ``$#%08x``
273     - ``$#0000001A``
274   * - 64
275     - ``%s``
276     - ``$QA19pfEQ``
277
278.. _module-pw_tokenizer-custom-macro:
279
280Tokenizing enums
281================
282Logging enums is one common special case where tokenization is particularly
283appropriate: enum values are conceptually already tokens mapping to their
284names, assuming no duplicate values.
285
286:c:macro:`PW_TOKENIZE_ENUM` will take in a fully qualified enum name along with all
287of the associated enum values. This macro will create database entries that
288include the domain name (fully qualified enum name), enum value, and a tokenized
289form of the enum value.
290
291The macro also supports returing the string version of the enum value in the
292case that there is a non-tokenizing backend, using
293:cpp:func:`pw::tokenizer::EnumToString`.
294
295All enum values in the enum declaration must be present in the macro, and the
296macro must be in the same namespace as the enum to be able to use the
297:cpp:func:`pw::tokenizer::EnumToString` function and avoid compiler errors.
298
299.. literalinclude: enum_test.cc
300   :language: cpp
301   :start-after: [pw_tokenizer-examples-enum]
302   :end-before: [pw_tokenizer-examples-enum]
303
304:c:macro:`PW_TOKENIZE_ENUM_CUSTOM` is an alternative version of
305:c:macro:`PW_TOKENIZE_ENUM` to tokenized a custom strings instead of a
306stringified form of the enum value name. It will take in a fully qualified enum
307name along with all the associated enum values and custom string for these
308values. This macro will create database entries that include the domain name
309(fully qualified enum name), enum value, and a tokenized form of the custom
310string for the enum value.
311
312.. literalinclude: enum_test.cc
313   :language: cpp
314   :start-after: [pw_tokenizer-examples-enum-custom]
315   :end-before: [pw_tokenizer-examples-enum-custom]
316
317Tokenize a message with arguments in a custom macro
318===================================================
319Projects can leverage the tokenization machinery in whichever way best suits
320their needs. The most efficient way to use ``pw_tokenizer`` is to pass tokenized
321data to a global handler function. A project's custom tokenization macro can
322handle tokenized data in a function of their choosing. The function may accept
323any arguments, but its final arguments must be:
324
325* The 32-bit token (:cpp:type:`pw_tokenizer_Token`)
326* The argument types (:cpp:type:`pw_tokenizer_ArgTypes`)
327* Variadic arguments, if any
328
329``pw_tokenizer`` provides two low-level macros to help projects create custom
330tokenization macros:
331
332* :c:macro:`PW_TOKENIZE_FORMAT_STRING`
333* :c:macro:`PW_TOKENIZER_REPLACE_FORMAT_STRING`
334
335.. caution::
336
337   Note the spelling difference! The first macro begins with ``PW_TOKENIZE_``
338   (no ``R``) whereas the second begins with ``PW_TOKENIZER_``.
339
340Use these macros to invoke an encoding function with the token, argument types,
341and variadic arguments. The function can then encode the tokenized message to a
342buffer using helpers in ``pw_tokenizer/encode_args.h``:
343
344.. Note: pw_tokenizer_EncodeArgs is a C function so you would expect to
345.. reference it as :c:func:`pw_tokenizer_EncodeArgs`. That doesn't work because
346.. it's defined in a header file that mixes C and C++.
347
348* :cpp:func:`pw::tokenizer::EncodeArgs`
349* :cpp:class:`pw::tokenizer::EncodedMessage`
350* :cpp:func:`pw_tokenizer_EncodeArgs`
351
352Example
353-------
354The following example implements a custom tokenization macro similar to
355:ref:`module-pw_log_tokenized`.
356
357.. code-block:: cpp
358
359   #include "pw_tokenizer/tokenize.h"
360
361   #ifndef __cplusplus
362   extern "C" {
363   #endif
364
365   void EncodeTokenizedMessage(uint32_t metadata,
366                               pw_tokenizer_Token token,
367                               pw_tokenizer_ArgTypes types,
368                               ...);
369
370   #ifndef __cplusplus
371   }  // extern "C"
372   #endif
373
374   #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...)          \
375     do {                                                                  \
376       PW_TOKENIZE_FORMAT_STRING("logs", UINT32_MAX, format, __VA_ARGS__); \
377       EncodeTokenizedMessage(                                             \
378           metadata, PW_TOKENIZER_REPLACE_FORMAT_STRING(__VA_ARGS__));     \
379     } while (0)
380
381In this example, the ``EncodeTokenizedMessage`` function would handle encoding
382and processing the message. Encoding is done by the
383:cpp:class:`pw::tokenizer::EncodedMessage` class or
384:cpp:func:`pw::tokenizer::EncodeArgs` function from
385``pw_tokenizer/encode_args.h``. The encoded message can then be transmitted or
386stored as needed.
387
388.. code-block:: cpp
389
390   #include "pw_log_tokenized/log_tokenized.h"
391   #include "pw_tokenizer/encode_args.h"
392
393   void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata,
394                               pw::span<std::byte> message);
395
396   extern "C" void EncodeTokenizedMessage(const uint32_t metadata,
397                                          const pw_tokenizer_Token token,
398                                          const pw_tokenizer_ArgTypes types,
399                                          ...) {
400     va_list args;
401     va_start(args, types);
402     pw::tokenizer::EncodedMessage<kLogBufferSize> encoded_message(token, types, args);
403     va_end(args);
404
405     HandleTokenizedMessage(metadata, encoded_message);
406   }
407
408.. admonition:: Why use a custom macro
409
410   - Optimal code size. Invoking a free function with the tokenized data results
411     in the smallest possible call site.
412   - Pass additional arguments, such as metadata, with the tokenized message.
413   - Integrate ``pw_tokenizer`` with other systems.
414
415Tokenizing function names
416=========================
417The string literal tokenization functions support tokenizing string literals or
418constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
419special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
420as ``static constexpr char[]`` in C++ instead of the standard ``static const
421char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
422tokenized while compiling C++ with GCC or Clang.
423
424.. code-block:: cpp
425
426   // Tokenize the special function name variables.
427   constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
428   constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
429
430Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
431They are defined as static character arrays, so they cannot be implicitly
432concatentated with string literals. For example, ``printf(__func__ ": %d",
433123);`` will not compile.
434
435Calculate minimum required buffer size
436======================================
437See :cpp:func:`pw::tokenizer::MinEncodingBufferSizeBytes`.
438
439.. _module-pw_tokenizer-base64-format:
440
441Encoding Base64
442===============
443The tokenizer encodes messages to a compact binary representation. Applications
444may desire a textual representation of tokenized strings. This makes it easy to
445use tokenized messages alongside plain text messages, but comes at a small
446efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
447as binary messages.
448
449The Base64 format is comprised of a ``$`` character followed by the
450Base64-encoded contents of the tokenized message. For example, consider
451tokenizing the string ``This is an example: %d!`` with the argument -1. The
452string's token is 0x4b016e66.
453
454.. code-block:: text
455
456   Source code: PW_LOG("This is an example: %d!", -1);
457
458    Plain text: This is an example: -1! [23 bytes]
459
460        Binary: 66 6e 01 4b 01          [ 5 bytes]
461
462        Base64: $Zm4BSwE=               [ 9 bytes]
463
464To encode with the Base64 format, add a call to
465``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
466in the tokenizer handler function. For example,
467
468.. code-block:: cpp
469
470   void TokenizedMessageHandler(const uint8_t encoded_message[],
471                                size_t size_bytes) {
472     pw::InlineBasicString base64 = pw::tokenizer::PrefixedBase64Encode(
473         pw::span(encoded_message, size_bytes));
474
475     TransmitLogMessage(base64.data(), base64.size());
476   }
477
478.. _module-pw_tokenizer-masks:
479
480Reduce token size with masking
481==============================
482``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
483fewer than 32 bits does not improve runtime or code size efficiency. However,
484when tokens are packed into data structures or stored in arrays, the size of the
485token directly affects memory usage. In those cases, every bit counts, and it
486may be desireable to use fewer bits for the token.
487
488``pw_tokenizer`` allows users to provide a mask to apply to the token. This
489masked token is used in both the token database and the code. The masked token
490is not a masked version of the full 32-bit token, the masked token is the token.
491This makes it trivial to decode tokens that use fewer than 32 bits.
492
493Masking functionality is provided through the ``*_MASK`` versions of the macros:
494
495* :c:macro:`PW_TOKENIZE_STRING_MASK`
496* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR`
497* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK`
498
499For example, the following generates 16-bit tokens and packs them into an
500existing value.
501
502.. code-block:: cpp
503
504   constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
505   uint32_t packed_word = (other_bits << 16) | token;
506
507Tokens are hashes, so tokens of any size have a collision risk. The fewer bits
508used for tokens, the more likely two strings are to hash to the same token. See
509:ref:`module-pw_tokenizer-collisions`.
510
511Masked tokens without arguments may be encoded in fewer bytes. For example, the
51216-bit token ``0x1234`` may be encoded as two little-endian bytes (``34 12``)
513rather than four (``34 12 00 00``). The detokenizer tools zero-pad data smaller
514than four bytes. Tokens with arguments must always be encoded as four bytes.
515
516.. _module-pw_tokenizer-domains:
517
518Keep tokens from different sources separate with domains
519========================================================
520``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
521string label associated with each tokenized string. This allows projects to keep
522tokens from different sources separate. Potential use cases include the
523following:
524
525* Keep large sets of tokenized strings separate to avoid collisions.
526* Create a separate database for a small number of strings that use truncated
527  tokens, for example only 10 or 16 bits instead of the full 32 bits.
528
529When a domain is specified, any whitespace will be ignored in domain names and
530removed from the database.
531
532If no domain is specified, the domain is empty (``""``). For many projects, this
533default domain is sufficient, so no additional configuration is required.
534
535.. code-block:: cpp
536
537   // Tokenizes this string to the default ("") domain.
538   PW_TOKENIZE_STRING("Hello, world!");
539
540   // Tokenizes this string to the "my_custom_domain" domain.
541   PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
542
543The database and detokenization command line tools default to loading tokens
544from all domains. The domain may be specified for ELF files by appending
545``#DOMAIN_NAME_REGEX`` to the file path. Use ``#`` to only read from the default
546domain. For example, the following reads strings in ``some_domain`` from
547``my_image.elf``.
548
549.. code-block:: sh
550
551   ./database.py create --database my_db.csv "path/to/my_image.elf#some_domain"
552
553See :ref:`module-pw_tokenizer-managing-token-databases` for information about
554the ``database.py`` command line tool.
555
556Limitations, bugs, and future work
557==================================
558
559.. _module-pw_tokenizer-gcc-template-bug:
560
561GCC bug: tokenization in template functions
562-------------------------------------------
563GCC releases prior to 14 incorrectly ignore the section attribute for template
564`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and `variables
565<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. The bug causes tokenized
566strings in template functions to be emitted into ``.rodata`` instead of the
567tokenized string section, so they cannot be extracted for detokenization.
568
569Fortunately, this is simple to work around in the linker script.
570``pw_tokenizer_linker_sections.ld`` includes a statement that pulls tokenized
571string entries from ``.rodata`` into the tokenized string section. See
572`b/321306079 <https://issues.pigweed.dev/issues/321306079>`_ for details.
573
574If tokenization is working, but strings in templates are not appearing in token
575databases, check the following:
576
577- The full contents of the latest version of ``pw_tokenizer_linker_sections.ld``
578  are included with the linker script. The linker script was updated in
579  `pwrev.dev/188424 <http://pwrev.dev/188424>`_.
580- The ``-fdata-sections`` compilation option is in use. This places each
581  variable in its own section, which is necessary for pulling tokenized string
582  entries from ``.rodata`` into the proper section.
583
58464-bit tokenization
585-------------------
586The Python and C++ detokenizing libraries currently assume that strings were
587tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
588``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
589device performed the tokenization.
590
591Supporting detokenization of strings tokenized on 64-bit targets would be
592simple. This could be done by adding an option to switch the 32-bit types to
59364-bit. The tokenizer stores the sizes of these types in the
594``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
595by checking the ELF file, if necessary.
596
597Tokenization in headers
598-----------------------
599Tokenizing code in header files (inline functions or templates) may trigger
600warnings such as ``-Wlto-type-mismatch`` under certain conditions. That
601is because tokenization requires declaring a character array for each tokenized
602string. If the tokenized string includes macros that change value, the size of
603this character array changes, which means the same static variable is defined
604with different sizes. It should be safe to suppress these warnings, but, when
605possible, code that tokenizes strings with macros that can change value should
606be moved to source files rather than headers.
607
608----------------------
609Tokenization in Python
610----------------------
611The Python ``pw_tokenizer.encode`` module has limited support for encoding
612tokenized messages with the :func:`pw_tokenizer.encode.encode_token_and_args`
613function. This function requires a string's token is already calculated.
614Typically these tokens are provided by a database, but they can be manually
615created using the tokenizer hash.
616
617:func:`pw_tokenizer.tokens.pw_tokenizer_65599_hash` is particularly useful
618for offline token database generation in cases where tokenized strings in a
619binary cannot be embedded as parsable pw_tokenizer entries.
620
621.. note::
622   In C, the hash length of a string has a fixed limit controlled by
623   ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed
624   to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching
625   hash length limit. When creating an offline database, it's a good idea to
626   generate tokens for both, and merge the databases.
627
628.. _module-pw_tokenizer-cli-encoding:
629
630-----------------
631Encoding CLI tool
632-----------------
633The ``pw_tokenizer.encode`` command line tool can be used to encode
634format strings and optional arguments.
635
636.. code-block:: bash
637
638   python -m pw_tokenizer.encode [-h] FORMAT_STRING [ARG ...]
639
640Example:
641
642.. code-block:: text
643
644   $ python -m pw_tokenizer.encode "There's... %d many of %s!" 2 them
645         Raw input: "There's... %d many of %s!" % (2, 'them')
646   Formatted input: There's... 2 many of them!
647             Token: 0xb6ef8b2d
648           Encoded: b'-\x8b\xef\xb6\x04\x04them' (2d 8b ef b6 04 04 74 68 65 6d) [10 bytes]
649   Prefixed Base64: $LYvvtgQEdGhlbQ==
650
651See ``--help`` for full usage details.
652
653--------
654Appendix
655--------
656
657Case study
658==========
659.. note:: This section discusses the implementation, results, and lessons
660   learned from a real-world deployment of ``pw_tokenizer``.
661
662The tokenizer module was developed to bring tokenized logging to an
663in-development product. The product already had an established text-based
664logging system. Deploying tokenization was straightforward and had substantial
665benefits.
666
667Results
668-------
669* Log contents shrunk by over 50%, even with Base64 encoding.
670
671  * Significant size savings for encoded logs, even using the less-efficient
672    Base64 encoding required for compatibility with the existing log system.
673  * Freed valuable communication bandwidth.
674  * Allowed storing many more logs in crash dumps.
675
676* Substantial flash savings.
677
678  * Reduced the size firmware images by up to 18%.
679
680* Simpler logging code.
681
682  * Removed CPU-heavy ``snprintf`` calls.
683  * Removed complex code for forwarding log arguments to a low-priority task.
684
685This section describes the tokenizer deployment process and highlights key
686insights.
687
688Firmware deployment
689-------------------
690* In the project's logging macro, calls to the underlying logging function were
691  replaced with a tokenized log macro invocation.
692* The log level was passed as the payload argument to facilitate runtime log
693  level control.
694* For this project, it was necessary to encode the log messages as text. In
695  the handler function the log messages were encoded in the $-prefixed
696  :ref:`module-pw_tokenizer-base64-format`, then dispatched as normal log messages.
697* Asserts were tokenized a callback-based API that has been removed (a
698  :ref:`custom macro <module-pw_tokenizer-custom-macro>` is a better
699  alternative).
700
701.. attention::
702  Do not encode line numbers in tokenized strings. This results in a huge
703  number of lines being added to the database, since every time code moves,
704  new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line
705  numbers are encoded in the log metadata. Line numbers may also be included by
706  by adding ``"%d"`` to the format string and passing ``__LINE__``.
707
708.. _module-pw_tokenizer-database-management:
709
710Database management
711-------------------
712* The token database was stored as a CSV file in the project's Git repo.
713* The token database was automatically updated as part of the build, and
714  developers were expected to check in the database changes alongside their code
715  changes.
716* A presubmit check verified that all strings added by a change were added to
717  the token database.
718* The token database included logs and asserts for all firmware images in the
719  project.
720* No strings were purged from the token database.
721
722.. tip::
723   Merge conflicts may be a frequent occurrence with an in-source CSV database.
724   Use the :ref:`module-pw_tokenizer-directory-database-format` instead.
725
726Decoding tooling deployment
727---------------------------
728* The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
729
730  * Product-specific Python command line tools, using
731    ``pw_tokenizer.Detokenizer``.
732  * Standalone script for decoding prefixed Base64 tokens in files or
733    live output (e.g. from ``adb``), using ``detokenize.py``'s command line
734    interface.
735
736* The C++ detokenizer library was deployed to two Android apps with a Java
737  Native Interface (JNI) layer.
738
739  * The binary token database was included as a raw resource in the APK.
740  * In one app, the built-in token database could be overridden by copying a
741    file to the phone.
742
743.. tip::
744   Make the tokenized logging tools simple to use for your project.
745
746   * Provide simple wrapper shell scripts that fill in arguments for the
747     project. For example, point ``detokenize.py`` to the project's token
748     databases.
749   * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in
750     continuously-running tools, so that users don't have to restart the tool
751     when the token database updates.
752   * Integrate detokenization everywhere it is needed. Integrating the tools
753     takes just a few lines of code, and token databases can be embedded in APKs
754     or binaries.
755