1:tocdepth: 3 2 3.. _module-pw_tokenizer-tokenization: 4 5============ 6Tokenization 7============ 8.. pigweed-module-subpage:: 9 :name: pw_tokenizer 10 11Tokenization converts a string literal to a token. If it's a printf-style 12string, its arguments are encoded along with it. The results of tokenization can 13be sent off device or stored in place of a full string. 14 15-------- 16Concepts 17-------- 18See :ref:`module-pw_tokenizer-get-started-overview` for a high-level 19explanation of how ``pw_tokenizer`` works. 20 21Token generation: fixed length hashing at compile time 22====================================================== 23String tokens are generated using a modified version of the x65599 hash used by 24the SDBM project. All hashing is done at compile time. 25 26In C code, strings are hashed with a preprocessor macro. For compatibility with 27macros, the hash must be limited to a fixed maximum number of characters. This 28value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing 29``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to 30the complexity of the hashing macros. 31 32C++ macros use a constexpr function instead of a macro. This function works with 33any length of string and has lower compilation time impact than the C macros. 34For consistency, C++ tokenization uses the same hash algorithm, but the 35calculated values will differ between C and C++ for strings longer than 36``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters. 37 38Token encoding 39============== 40The token is a 32-bit hash calculated during compilation. The string is encoded 41little-endian with the token followed by arguments, if any. For example, the 4231-byte string ``You can go about your business.`` hashes to 0xdac9a244. 43This is encoded as 4 bytes: ``44 a2 c9 da``. 44 45Arguments are encoded as follows: 46 47* **Integers** (1--10 bytes) -- 48 `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-ints>`_, 49 similarly to Protocol Buffers. Smaller values take fewer bytes. 50* **Floating point numbers** (4 bytes) -- Single precision floating point. 51* **Strings** (1--128 bytes) -- Length byte followed by the string contents. 52 The top bit of the length whether the string was truncated or not. The 53 remaining 7 bits encode the string length, with a maximum of 127 bytes. 54 55.. TODO(hepler): insert diagram here! 56 57.. tip:: 58 ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` 59 arguments short or avoid encoding them as strings (e.g. encode an enum as an 60 integer instead of a string). See also 61 :ref:`module-pw_tokenizer-nested-arguments`. 62 63.. _module-pw_tokenizer-proto: 64 65Tokenized fields in protocol buffers 66==================================== 67Text may be represented in a few different ways: 68 69- Plain ASCII or UTF-8 text (``This is plain text``) 70- Base64-encoded tokenized message (``$ibafcA==``) 71- Binary-encoded tokenized message (``89 b6 9f 70``) 72- Little-endian 32-bit integer token (``0x709fb689``) 73 74``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option. 75This option may be applied to a protobuf field to indicate that it may contain a 76tokenized string. A string that is optionally tokenized is represented with a 77single ``bytes`` field annotated with ``(pw.tokenizer.format) = 78TOKENIZATION_OPTIONAL``. 79 80For example, the following protobuf has one field that may contain a tokenized 81string. 82 83.. code-block:: protobuf 84 85 import "pw_tokenizer_proto/options.proto"; 86 87 message MessageWithOptionallyTokenizedField { 88 bytes just_bytes = 1; 89 bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL]; 90 string just_text = 3; 91 } 92 93----------------------- 94Tokenization in C++ / C 95----------------------- 96To tokenize a string, include ``pw_tokenizer/tokenize.h`` and invoke one of the 97``PW_TOKENIZE_*`` macros. 98 99Tokenize string literals outside of expressions 100=============================================== 101``pw_tokenizer`` provides macros for tokenizing string literals with no 102arguments: 103 104* :c:macro:`PW_TOKENIZE_STRING` 105* :c:macro:`PW_TOKENIZE_STRING_DOMAIN` 106* :c:macro:`PW_TOKENIZE_STRING_MASK` 107 108The tokenization macros above cannot be used inside other expressions. 109 110.. admonition:: **Yes**: Assign :c:macro:`PW_TOKENIZE_STRING` to a ``constexpr`` variable. 111 :class: checkmark 112 113 .. code-block:: cpp 114 115 constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!"); 116 117 void Function() { 118 constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?"); 119 } 120 121.. admonition:: **No**: Use :c:macro:`PW_TOKENIZE_STRING` in another expression. 122 :class: error 123 124 .. code-block:: cpp 125 126 void BadExample() { 127 ProcessToken(PW_TOKENIZE_STRING("This won't compile!")); 128 } 129 130 Use :c:macro:`PW_TOKENIZE_STRING_EXPR` instead. 131 132Tokenize inside expressions 133=========================== 134An alternate set of macros are provided for use inside expressions. These make 135use of lambda functions, so while they can be used inside expressions, they 136require C++ and cannot be assigned to constexpr variables or be used with 137special function variables like ``__func__``. 138 139* :c:macro:`PW_TOKENIZE_STRING_EXPR` 140* :c:macro:`PW_TOKENIZE_STRING_DOMAIN_EXPR` 141* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR` 142 143.. admonition:: When to use these macros 144 145 Use :c:macro:`PW_TOKENIZE_STRING` and related macros to tokenize string 146 literals that do not need %-style arguments encoded. 147 148.. admonition:: **Yes**: Use :c:macro:`PW_TOKENIZE_STRING_EXPR` within other expressions. 149 :class: checkmark 150 151 .. code-block:: cpp 152 153 void GoodExample() { 154 ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!")); 155 } 156 157.. admonition:: **No**: Assign :c:macro:`PW_TOKENIZE_STRING_EXPR` to a ``constexpr`` variable. 158 :class: error 159 160 .. code-block:: cpp 161 162 constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!")); 163 164 Instead, use :c:macro:`PW_TOKENIZE_STRING` to assign to a ``constexpr`` variable. 165 166.. admonition:: **No**: Tokenize ``__func__`` in :c:macro:`PW_TOKENIZE_STRING_EXPR`. 167 :class: error 168 169 .. code-block:: cpp 170 171 void BadExample() { 172 // This compiles, but __func__ will not be the outer function's name, and 173 // there may be compiler warnings. 174 constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__); 175 } 176 177 Instead, use :c:macro:`PW_TOKENIZE_STRING` to tokenize ``__func__`` or similar macros. 178 179Tokenize a message with arguments to a buffer 180============================================= 181* :c:macro:`PW_TOKENIZE_TO_BUFFER` 182* :c:macro:`PW_TOKENIZE_TO_BUFFER_DOMAIN` 183* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK` 184 185.. admonition:: Why use this macro 186 187 - Encode a tokenized message for consumption within a function. 188 - Encode a tokenized message into an existing buffer. 189 190 Avoid using ``PW_TOKENIZE_TO_BUFFER`` in widely expanded macros, such as a 191 logging macro, because it will result in larger code size than passing the 192 tokenized data to a function. 193 194.. _module-pw_tokenizer-nested-arguments: 195 196Tokenize nested arguments 197========================= 198Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are 199encoded 1:1, with no tokenization. Tokens can therefore be used to replace 200string arguments to tokenized format strings. 201 202* :c:macro:`PW_TOKEN_FMT` 203 204.. admonition:: Logging nested tokens 205 206 Users will typically interact with nested token arguments during logging. 207 In this case there is a slightly different interface described by 208 :ref:`module-pw_log-tokenized-args` that does not generally invoke 209 ``PW_TOKEN_FMT`` directly. 210 211The format specifier for a token is given by PRI-style macro ``PW_TOKEN_FMT()``, 212which is concatenated to the rest of the format string by the C preprocessor. 213 214.. code-block:: cpp 215 216 PW_TOKENIZE_FORMAT_STRING("margarine_domain", 217 UINT32_MAX, 218 "I can't believe it's not " PW_TOKEN_FMT() "!", 219 PW_TOKENIZE_STRING_EXPR("butter")); 220 221This feature is currently only supported by the Python detokenizer. 222 223Nested token format 224------------------- 225Nested tokens have the following format within strings: 226 227.. code-block:: 228 229 $[{DOMAIN}][BASE#]TOKEN 230 231The ``$`` is a common prefix required for all nested tokens. It is possible to 232configure a different common prefix if necessary, but using the default ``$`` 233character is strongly recommended. 234 235The optional ``DOMAIN`` specifies the token domain. If this option is omitted, 236the default (empty) domain is assumed. 237 238The optional ``BASE`` defines the numeric base encoding of the token. Accepted 239values are 8, 10, 16, and 64. If the hash symbol ``#`` is used without 240specifying a number, the base is assumed to be 16. If the base option is 241omitted entirely, the base defaults to 64 for backward compatibility. All 242encodings except Base64 are not case sensitive. This may be expanded to support 243other bases in the future. 244 245Non-Base64 tokens are encoded strictly as 32-bit integers with padding. 246Base64 data may additionally encode string arguments for the detokenized token, 247and therefore does not have a maximum width. 248 249The meaning of ``TOKEN`` depends on the current phase of transformation for the 250current tokenized format string. Within the format string's entry in the token 251database, when the actual value of the token argument is not known, ``TOKEN`` is 252a printf argument specifier (e.g. ``%08x`` for a base-16 token with correct 253padding). The actual tokens that will be used as arguments have separate 254entries in the token database. 255 256After the top-level format string has been detokenized and formatted, ``TOKEN`` 257should be the value of the token argument in the specified base, with any 258necessary padding. This is the final format of a nested token if it cannot be 259tokenized. 260 261.. list-table:: Example tokens 262 :widths: 10 25 25 263 264 * - Base 265 - | Token database 266 | (within format string entry) 267 - Partially detokenized 268 * - 10 269 - ``$10#%010d`` 270 - ``$10#0086025943`` 271 * - 16 272 - ``$#%08x`` 273 - ``$#0000001A`` 274 * - 64 275 - ``%s`` 276 - ``$QA19pfEQ`` 277 278.. _module-pw_tokenizer-custom-macro: 279 280Tokenizing enums 281================ 282Logging enums is one common special case where tokenization is particularly 283appropriate: enum values are conceptually already tokens mapping to their 284names, assuming no duplicate values. 285 286:c:macro:`PW_TOKENIZE_ENUM` will take in a fully qualified enum name along with all 287of the associated enum values. This macro will create database entries that 288include the domain name (fully qualified enum name), enum value, and a tokenized 289form of the enum value. 290 291The macro also supports returing the string version of the enum value in the 292case that there is a non-tokenizing backend, using 293:cpp:func:`pw::tokenizer::EnumToString`. 294 295All enum values in the enum declaration must be present in the macro, and the 296macro must be in the same namespace as the enum to be able to use the 297:cpp:func:`pw::tokenizer::EnumToString` function and avoid compiler errors. 298 299.. literalinclude: enum_test.cc 300 :language: cpp 301 :start-after: [pw_tokenizer-examples-enum] 302 :end-before: [pw_tokenizer-examples-enum] 303 304:c:macro:`PW_TOKENIZE_ENUM_CUSTOM` is an alternative version of 305:c:macro:`PW_TOKENIZE_ENUM` to tokenized a custom strings instead of a 306stringified form of the enum value name. It will take in a fully qualified enum 307name along with all the associated enum values and custom string for these 308values. This macro will create database entries that include the domain name 309(fully qualified enum name), enum value, and a tokenized form of the custom 310string for the enum value. 311 312.. literalinclude: enum_test.cc 313 :language: cpp 314 :start-after: [pw_tokenizer-examples-enum-custom] 315 :end-before: [pw_tokenizer-examples-enum-custom] 316 317Tokenize a message with arguments in a custom macro 318=================================================== 319Projects can leverage the tokenization machinery in whichever way best suits 320their needs. The most efficient way to use ``pw_tokenizer`` is to pass tokenized 321data to a global handler function. A project's custom tokenization macro can 322handle tokenized data in a function of their choosing. The function may accept 323any arguments, but its final arguments must be: 324 325* The 32-bit token (:cpp:type:`pw_tokenizer_Token`) 326* The argument types (:cpp:type:`pw_tokenizer_ArgTypes`) 327* Variadic arguments, if any 328 329``pw_tokenizer`` provides two low-level macros to help projects create custom 330tokenization macros: 331 332* :c:macro:`PW_TOKENIZE_FORMAT_STRING` 333* :c:macro:`PW_TOKENIZER_REPLACE_FORMAT_STRING` 334 335.. caution:: 336 337 Note the spelling difference! The first macro begins with ``PW_TOKENIZE_`` 338 (no ``R``) whereas the second begins with ``PW_TOKENIZER_``. 339 340Use these macros to invoke an encoding function with the token, argument types, 341and variadic arguments. The function can then encode the tokenized message to a 342buffer using helpers in ``pw_tokenizer/encode_args.h``: 343 344.. Note: pw_tokenizer_EncodeArgs is a C function so you would expect to 345.. reference it as :c:func:`pw_tokenizer_EncodeArgs`. That doesn't work because 346.. it's defined in a header file that mixes C and C++. 347 348* :cpp:func:`pw::tokenizer::EncodeArgs` 349* :cpp:class:`pw::tokenizer::EncodedMessage` 350* :cpp:func:`pw_tokenizer_EncodeArgs` 351 352Example 353------- 354The following example implements a custom tokenization macro similar to 355:ref:`module-pw_log_tokenized`. 356 357.. code-block:: cpp 358 359 #include "pw_tokenizer/tokenize.h" 360 361 #ifndef __cplusplus 362 extern "C" { 363 #endif 364 365 void EncodeTokenizedMessage(uint32_t metadata, 366 pw_tokenizer_Token token, 367 pw_tokenizer_ArgTypes types, 368 ...); 369 370 #ifndef __cplusplus 371 } // extern "C" 372 #endif 373 374 #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...) \ 375 do { \ 376 PW_TOKENIZE_FORMAT_STRING("logs", UINT32_MAX, format, __VA_ARGS__); \ 377 EncodeTokenizedMessage( \ 378 metadata, PW_TOKENIZER_REPLACE_FORMAT_STRING(__VA_ARGS__)); \ 379 } while (0) 380 381In this example, the ``EncodeTokenizedMessage`` function would handle encoding 382and processing the message. Encoding is done by the 383:cpp:class:`pw::tokenizer::EncodedMessage` class or 384:cpp:func:`pw::tokenizer::EncodeArgs` function from 385``pw_tokenizer/encode_args.h``. The encoded message can then be transmitted or 386stored as needed. 387 388.. code-block:: cpp 389 390 #include "pw_log_tokenized/log_tokenized.h" 391 #include "pw_tokenizer/encode_args.h" 392 393 void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata, 394 pw::span<std::byte> message); 395 396 extern "C" void EncodeTokenizedMessage(const uint32_t metadata, 397 const pw_tokenizer_Token token, 398 const pw_tokenizer_ArgTypes types, 399 ...) { 400 va_list args; 401 va_start(args, types); 402 pw::tokenizer::EncodedMessage<kLogBufferSize> encoded_message(token, types, args); 403 va_end(args); 404 405 HandleTokenizedMessage(metadata, encoded_message); 406 } 407 408.. admonition:: Why use a custom macro 409 410 - Optimal code size. Invoking a free function with the tokenized data results 411 in the smallest possible call site. 412 - Pass additional arguments, such as metadata, with the tokenized message. 413 - Integrate ``pw_tokenizer`` with other systems. 414 415Tokenizing function names 416========================= 417The string literal tokenization functions support tokenizing string literals or 418constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the 419special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared 420as ``static constexpr char[]`` in C++ instead of the standard ``static const 421char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be 422tokenized while compiling C++ with GCC or Clang. 423 424.. code-block:: cpp 425 426 // Tokenize the special function name variables. 427 constexpr uint32_t function = PW_TOKENIZE_STRING(__func__); 428 constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__); 429 430Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals. 431They are defined as static character arrays, so they cannot be implicitly 432concatentated with string literals. For example, ``printf(__func__ ": %d", 433123);`` will not compile. 434 435Calculate minimum required buffer size 436====================================== 437See :cpp:func:`pw::tokenizer::MinEncodingBufferSizeBytes`. 438 439.. _module-pw_tokenizer-base64-format: 440 441Encoding Base64 442=============== 443The tokenizer encodes messages to a compact binary representation. Applications 444may desire a textual representation of tokenized strings. This makes it easy to 445use tokenized messages alongside plain text messages, but comes at a small 446efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory 447as binary messages. 448 449The Base64 format is comprised of a ``$`` character followed by the 450Base64-encoded contents of the tokenized message. For example, consider 451tokenizing the string ``This is an example: %d!`` with the argument -1. The 452string's token is 0x4b016e66. 453 454.. code-block:: text 455 456 Source code: PW_LOG("This is an example: %d!", -1); 457 458 Plain text: This is an example: -1! [23 bytes] 459 460 Binary: 66 6e 01 4b 01 [ 5 bytes] 461 462 Base64: $Zm4BSwE= [ 9 bytes] 463 464To encode with the Base64 format, add a call to 465``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode`` 466in the tokenizer handler function. For example, 467 468.. code-block:: cpp 469 470 void TokenizedMessageHandler(const uint8_t encoded_message[], 471 size_t size_bytes) { 472 pw::InlineBasicString base64 = pw::tokenizer::PrefixedBase64Encode( 473 pw::span(encoded_message, size_bytes)); 474 475 TransmitLogMessage(base64.data(), base64.size()); 476 } 477 478.. _module-pw_tokenizer-masks: 479 480Reduce token size with masking 481============================== 482``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using 483fewer than 32 bits does not improve runtime or code size efficiency. However, 484when tokens are packed into data structures or stored in arrays, the size of the 485token directly affects memory usage. In those cases, every bit counts, and it 486may be desireable to use fewer bits for the token. 487 488``pw_tokenizer`` allows users to provide a mask to apply to the token. This 489masked token is used in both the token database and the code. The masked token 490is not a masked version of the full 32-bit token, the masked token is the token. 491This makes it trivial to decode tokens that use fewer than 32 bits. 492 493Masking functionality is provided through the ``*_MASK`` versions of the macros: 494 495* :c:macro:`PW_TOKENIZE_STRING_MASK` 496* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR` 497* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK` 498 499For example, the following generates 16-bit tokens and packs them into an 500existing value. 501 502.. code-block:: cpp 503 504 constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!"); 505 uint32_t packed_word = (other_bits << 16) | token; 506 507Tokens are hashes, so tokens of any size have a collision risk. The fewer bits 508used for tokens, the more likely two strings are to hash to the same token. See 509:ref:`module-pw_tokenizer-collisions`. 510 511Masked tokens without arguments may be encoded in fewer bytes. For example, the 51216-bit token ``0x1234`` may be encoded as two little-endian bytes (``34 12``) 513rather than four (``34 12 00 00``). The detokenizer tools zero-pad data smaller 514than four bytes. Tokens with arguments must always be encoded as four bytes. 515 516.. _module-pw_tokenizer-domains: 517 518Keep tokens from different sources separate with domains 519======================================================== 520``pw_tokenizer`` supports having multiple tokenization domains. Domains are a 521string label associated with each tokenized string. This allows projects to keep 522tokens from different sources separate. Potential use cases include the 523following: 524 525* Keep large sets of tokenized strings separate to avoid collisions. 526* Create a separate database for a small number of strings that use truncated 527 tokens, for example only 10 or 16 bits instead of the full 32 bits. 528 529When a domain is specified, any whitespace will be ignored in domain names and 530removed from the database. 531 532If no domain is specified, the domain is empty (``""``). For many projects, this 533default domain is sufficient, so no additional configuration is required. 534 535.. code-block:: cpp 536 537 // Tokenizes this string to the default ("") domain. 538 PW_TOKENIZE_STRING("Hello, world!"); 539 540 // Tokenizes this string to the "my_custom_domain" domain. 541 PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!"); 542 543The database and detokenization command line tools default to loading tokens 544from all domains. The domain may be specified for ELF files by appending 545``#DOMAIN_NAME_REGEX`` to the file path. Use ``#`` to only read from the default 546domain. For example, the following reads strings in ``some_domain`` from 547``my_image.elf``. 548 549.. code-block:: sh 550 551 ./database.py create --database my_db.csv "path/to/my_image.elf#some_domain" 552 553See :ref:`module-pw_tokenizer-managing-token-databases` for information about 554the ``database.py`` command line tool. 555 556Limitations, bugs, and future work 557================================== 558 559.. _module-pw_tokenizer-gcc-template-bug: 560 561GCC bug: tokenization in template functions 562------------------------------------------- 563GCC releases prior to 14 incorrectly ignore the section attribute for template 564`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and `variables 565<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. The bug causes tokenized 566strings in template functions to be emitted into ``.rodata`` instead of the 567tokenized string section, so they cannot be extracted for detokenization. 568 569Fortunately, this is simple to work around in the linker script. 570``pw_tokenizer_linker_sections.ld`` includes a statement that pulls tokenized 571string entries from ``.rodata`` into the tokenized string section. See 572`b/321306079 <https://issues.pigweed.dev/issues/321306079>`_ for details. 573 574If tokenization is working, but strings in templates are not appearing in token 575databases, check the following: 576 577- The full contents of the latest version of ``pw_tokenizer_linker_sections.ld`` 578 are included with the linker script. The linker script was updated in 579 `pwrev.dev/188424 <http://pwrev.dev/188424>`_. 580- The ``-fdata-sections`` compilation option is in use. This places each 581 variable in its own section, which is necessary for pulling tokenized string 582 entries from ``.rodata`` into the proper section. 583 58464-bit tokenization 585------------------- 586The Python and C++ detokenizing libraries currently assume that strings were 587tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and 588``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit 589device performed the tokenization. 590 591Supporting detokenization of strings tokenized on 64-bit targets would be 592simple. This could be done by adding an option to switch the 32-bit types to 59364-bit. The tokenizer stores the sizes of these types in the 594``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified 595by checking the ELF file, if necessary. 596 597Tokenization in headers 598----------------------- 599Tokenizing code in header files (inline functions or templates) may trigger 600warnings such as ``-Wlto-type-mismatch`` under certain conditions. That 601is because tokenization requires declaring a character array for each tokenized 602string. If the tokenized string includes macros that change value, the size of 603this character array changes, which means the same static variable is defined 604with different sizes. It should be safe to suppress these warnings, but, when 605possible, code that tokenizes strings with macros that can change value should 606be moved to source files rather than headers. 607 608---------------------- 609Tokenization in Python 610---------------------- 611The Python ``pw_tokenizer.encode`` module has limited support for encoding 612tokenized messages with the :func:`pw_tokenizer.encode.encode_token_and_args` 613function. This function requires a string's token is already calculated. 614Typically these tokens are provided by a database, but they can be manually 615created using the tokenizer hash. 616 617:func:`pw_tokenizer.tokens.pw_tokenizer_65599_hash` is particularly useful 618for offline token database generation in cases where tokenized strings in a 619binary cannot be embedded as parsable pw_tokenizer entries. 620 621.. note:: 622 In C, the hash length of a string has a fixed limit controlled by 623 ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed 624 to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching 625 hash length limit. When creating an offline database, it's a good idea to 626 generate tokens for both, and merge the databases. 627 628.. _module-pw_tokenizer-cli-encoding: 629 630----------------- 631Encoding CLI tool 632----------------- 633The ``pw_tokenizer.encode`` command line tool can be used to encode 634format strings and optional arguments. 635 636.. code-block:: bash 637 638 python -m pw_tokenizer.encode [-h] FORMAT_STRING [ARG ...] 639 640Example: 641 642.. code-block:: text 643 644 $ python -m pw_tokenizer.encode "There's... %d many of %s!" 2 them 645 Raw input: "There's... %d many of %s!" % (2, 'them') 646 Formatted input: There's... 2 many of them! 647 Token: 0xb6ef8b2d 648 Encoded: b'-\x8b\xef\xb6\x04\x04them' (2d 8b ef b6 04 04 74 68 65 6d) [10 bytes] 649 Prefixed Base64: $LYvvtgQEdGhlbQ== 650 651See ``--help`` for full usage details. 652 653-------- 654Appendix 655-------- 656 657Case study 658========== 659.. note:: This section discusses the implementation, results, and lessons 660 learned from a real-world deployment of ``pw_tokenizer``. 661 662The tokenizer module was developed to bring tokenized logging to an 663in-development product. The product already had an established text-based 664logging system. Deploying tokenization was straightforward and had substantial 665benefits. 666 667Results 668------- 669* Log contents shrunk by over 50%, even with Base64 encoding. 670 671 * Significant size savings for encoded logs, even using the less-efficient 672 Base64 encoding required for compatibility with the existing log system. 673 * Freed valuable communication bandwidth. 674 * Allowed storing many more logs in crash dumps. 675 676* Substantial flash savings. 677 678 * Reduced the size firmware images by up to 18%. 679 680* Simpler logging code. 681 682 * Removed CPU-heavy ``snprintf`` calls. 683 * Removed complex code for forwarding log arguments to a low-priority task. 684 685This section describes the tokenizer deployment process and highlights key 686insights. 687 688Firmware deployment 689------------------- 690* In the project's logging macro, calls to the underlying logging function were 691 replaced with a tokenized log macro invocation. 692* The log level was passed as the payload argument to facilitate runtime log 693 level control. 694* For this project, it was necessary to encode the log messages as text. In 695 the handler function the log messages were encoded in the $-prefixed 696 :ref:`module-pw_tokenizer-base64-format`, then dispatched as normal log messages. 697* Asserts were tokenized a callback-based API that has been removed (a 698 :ref:`custom macro <module-pw_tokenizer-custom-macro>` is a better 699 alternative). 700 701.. attention:: 702 Do not encode line numbers in tokenized strings. This results in a huge 703 number of lines being added to the database, since every time code moves, 704 new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line 705 numbers are encoded in the log metadata. Line numbers may also be included by 706 by adding ``"%d"`` to the format string and passing ``__LINE__``. 707 708.. _module-pw_tokenizer-database-management: 709 710Database management 711------------------- 712* The token database was stored as a CSV file in the project's Git repo. 713* The token database was automatically updated as part of the build, and 714 developers were expected to check in the database changes alongside their code 715 changes. 716* A presubmit check verified that all strings added by a change were added to 717 the token database. 718* The token database included logs and asserts for all firmware images in the 719 project. 720* No strings were purged from the token database. 721 722.. tip:: 723 Merge conflicts may be a frequent occurrence with an in-source CSV database. 724 Use the :ref:`module-pw_tokenizer-directory-database-format` instead. 725 726Decoding tooling deployment 727--------------------------- 728* The Python detokenizer in ``pw_tokenizer`` was deployed to two places: 729 730 * Product-specific Python command line tools, using 731 ``pw_tokenizer.Detokenizer``. 732 * Standalone script for decoding prefixed Base64 tokens in files or 733 live output (e.g. from ``adb``), using ``detokenize.py``'s command line 734 interface. 735 736* The C++ detokenizer library was deployed to two Android apps with a Java 737 Native Interface (JNI) layer. 738 739 * The binary token database was included as a raw resource in the APK. 740 * In one app, the built-in token database could be overridden by copying a 741 file to the phone. 742 743.. tip:: 744 Make the tokenized logging tools simple to use for your project. 745 746 * Provide simple wrapper shell scripts that fill in arguments for the 747 project. For example, point ``detokenize.py`` to the project's token 748 databases. 749 * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in 750 continuously-running tools, so that users don't have to restart the tool 751 when the token database updates. 752 * Integrate detokenization everywhere it is needed. Integrating the tools 753 takes just a few lines of code, and token databases can be embedded in APKs 754 or binaries. 755