xref: /aosp_15_r20/external/pigweed/pw_tokenizer/tokenization.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1*61c4878aSAndroid Build Coastguard Worker:tocdepth: 3
2*61c4878aSAndroid Build Coastguard Worker
3*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-tokenization:
4*61c4878aSAndroid Build Coastguard Worker
5*61c4878aSAndroid Build Coastguard Worker============
6*61c4878aSAndroid Build Coastguard WorkerTokenization
7*61c4878aSAndroid Build Coastguard Worker============
8*61c4878aSAndroid Build Coastguard Worker.. pigweed-module-subpage::
9*61c4878aSAndroid Build Coastguard Worker   :name: pw_tokenizer
10*61c4878aSAndroid Build Coastguard Worker
11*61c4878aSAndroid Build Coastguard WorkerTokenization converts a string literal to a token. If it's a printf-style
12*61c4878aSAndroid Build Coastguard Workerstring, its arguments are encoded along with it. The results of tokenization can
13*61c4878aSAndroid Build Coastguard Workerbe sent off device or stored in place of a full string.
14*61c4878aSAndroid Build Coastguard Worker
15*61c4878aSAndroid Build Coastguard Worker--------
16*61c4878aSAndroid Build Coastguard WorkerConcepts
17*61c4878aSAndroid Build Coastguard Worker--------
18*61c4878aSAndroid Build Coastguard WorkerSee :ref:`module-pw_tokenizer-get-started-overview` for a high-level
19*61c4878aSAndroid Build Coastguard Workerexplanation of how ``pw_tokenizer`` works.
20*61c4878aSAndroid Build Coastguard Worker
21*61c4878aSAndroid Build Coastguard WorkerToken generation: fixed length hashing at compile time
22*61c4878aSAndroid Build Coastguard Worker======================================================
23*61c4878aSAndroid Build Coastguard WorkerString tokens are generated using a modified version of the x65599 hash used by
24*61c4878aSAndroid Build Coastguard Workerthe SDBM project. All hashing is done at compile time.
25*61c4878aSAndroid Build Coastguard Worker
26*61c4878aSAndroid Build Coastguard WorkerIn C code, strings are hashed with a preprocessor macro. For compatibility with
27*61c4878aSAndroid Build Coastguard Workermacros, the hash must be limited to a fixed maximum number of characters. This
28*61c4878aSAndroid Build Coastguard Workervalue is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing
29*61c4878aSAndroid Build Coastguard Worker``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to
30*61c4878aSAndroid Build Coastguard Workerthe complexity of the hashing macros.
31*61c4878aSAndroid Build Coastguard Worker
32*61c4878aSAndroid Build Coastguard WorkerC++ macros use a constexpr function instead of a macro. This function works with
33*61c4878aSAndroid Build Coastguard Workerany length of string and has lower compilation time impact than the C macros.
34*61c4878aSAndroid Build Coastguard WorkerFor consistency, C++ tokenization uses the same hash algorithm, but the
35*61c4878aSAndroid Build Coastguard Workercalculated values will differ between C and C++ for strings longer than
36*61c4878aSAndroid Build Coastguard Worker``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters.
37*61c4878aSAndroid Build Coastguard Worker
38*61c4878aSAndroid Build Coastguard WorkerToken encoding
39*61c4878aSAndroid Build Coastguard Worker==============
40*61c4878aSAndroid Build Coastguard WorkerThe token is a 32-bit hash calculated during compilation. The string is encoded
41*61c4878aSAndroid Build Coastguard Workerlittle-endian with the token followed by arguments, if any. For example, the
42*61c4878aSAndroid Build Coastguard Worker31-byte string ``You can go about your business.`` hashes to 0xdac9a244.
43*61c4878aSAndroid Build Coastguard WorkerThis is encoded as 4 bytes: ``44 a2 c9 da``.
44*61c4878aSAndroid Build Coastguard Worker
45*61c4878aSAndroid Build Coastguard WorkerArguments are encoded as follows:
46*61c4878aSAndroid Build Coastguard Worker
47*61c4878aSAndroid Build Coastguard Worker* **Integers**  (1--10 bytes) --
48*61c4878aSAndroid Build Coastguard Worker  `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-ints>`_,
49*61c4878aSAndroid Build Coastguard Worker  similarly to Protocol Buffers. Smaller values take fewer bytes.
50*61c4878aSAndroid Build Coastguard Worker* **Floating point numbers** (4 bytes) -- Single precision floating point.
51*61c4878aSAndroid Build Coastguard Worker* **Strings** (1--128 bytes) -- Length byte followed by the string contents.
52*61c4878aSAndroid Build Coastguard Worker  The top bit of the length whether the string was truncated or not. The
53*61c4878aSAndroid Build Coastguard Worker  remaining 7 bits encode the string length, with a maximum of 127 bytes.
54*61c4878aSAndroid Build Coastguard Worker
55*61c4878aSAndroid Build Coastguard Worker.. TODO(hepler): insert diagram here!
56*61c4878aSAndroid Build Coastguard Worker
57*61c4878aSAndroid Build Coastguard Worker.. tip::
58*61c4878aSAndroid Build Coastguard Worker   ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s``
59*61c4878aSAndroid Build Coastguard Worker   arguments short or avoid encoding them as strings (e.g. encode an enum as an
60*61c4878aSAndroid Build Coastguard Worker   integer instead of a string). See also
61*61c4878aSAndroid Build Coastguard Worker   :ref:`module-pw_tokenizer-nested-arguments`.
62*61c4878aSAndroid Build Coastguard Worker
63*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-proto:
64*61c4878aSAndroid Build Coastguard Worker
65*61c4878aSAndroid Build Coastguard WorkerTokenized fields in protocol buffers
66*61c4878aSAndroid Build Coastguard Worker====================================
67*61c4878aSAndroid Build Coastguard WorkerText may be represented in a few different ways:
68*61c4878aSAndroid Build Coastguard Worker
69*61c4878aSAndroid Build Coastguard Worker- Plain ASCII or UTF-8 text (``This is plain text``)
70*61c4878aSAndroid Build Coastguard Worker- Base64-encoded tokenized message (``$ibafcA==``)
71*61c4878aSAndroid Build Coastguard Worker- Binary-encoded tokenized message (``89 b6 9f 70``)
72*61c4878aSAndroid Build Coastguard Worker- Little-endian 32-bit integer token (``0x709fb689``)
73*61c4878aSAndroid Build Coastguard Worker
74*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
75*61c4878aSAndroid Build Coastguard WorkerThis option may be applied to a protobuf field to indicate that it may contain a
76*61c4878aSAndroid Build Coastguard Workertokenized string. A string that is optionally tokenized is represented with a
77*61c4878aSAndroid Build Coastguard Workersingle ``bytes`` field annotated with ``(pw.tokenizer.format) =
78*61c4878aSAndroid Build Coastguard WorkerTOKENIZATION_OPTIONAL``.
79*61c4878aSAndroid Build Coastguard Worker
80*61c4878aSAndroid Build Coastguard WorkerFor example, the following protobuf has one field that may contain a tokenized
81*61c4878aSAndroid Build Coastguard Workerstring.
82*61c4878aSAndroid Build Coastguard Worker
83*61c4878aSAndroid Build Coastguard Worker.. code-block:: protobuf
84*61c4878aSAndroid Build Coastguard Worker
85*61c4878aSAndroid Build Coastguard Worker   import "pw_tokenizer_proto/options.proto";
86*61c4878aSAndroid Build Coastguard Worker
87*61c4878aSAndroid Build Coastguard Worker   message MessageWithOptionallyTokenizedField {
88*61c4878aSAndroid Build Coastguard Worker     bytes just_bytes = 1;
89*61c4878aSAndroid Build Coastguard Worker     bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
90*61c4878aSAndroid Build Coastguard Worker     string just_text = 3;
91*61c4878aSAndroid Build Coastguard Worker   }
92*61c4878aSAndroid Build Coastguard Worker
93*61c4878aSAndroid Build Coastguard Worker-----------------------
94*61c4878aSAndroid Build Coastguard WorkerTokenization in C++ / C
95*61c4878aSAndroid Build Coastguard Worker-----------------------
96*61c4878aSAndroid Build Coastguard WorkerTo tokenize a string, include ``pw_tokenizer/tokenize.h`` and invoke one of the
97*61c4878aSAndroid Build Coastguard Worker``PW_TOKENIZE_*`` macros.
98*61c4878aSAndroid Build Coastguard Worker
99*61c4878aSAndroid Build Coastguard WorkerTokenize string literals outside of expressions
100*61c4878aSAndroid Build Coastguard Worker===============================================
101*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer`` provides macros for tokenizing string literals with no
102*61c4878aSAndroid Build Coastguard Workerarguments:
103*61c4878aSAndroid Build Coastguard Worker
104*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING`
105*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING_DOMAIN`
106*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING_MASK`
107*61c4878aSAndroid Build Coastguard Worker
108*61c4878aSAndroid Build Coastguard WorkerThe tokenization macros above cannot be used inside other expressions.
109*61c4878aSAndroid Build Coastguard Worker
110*61c4878aSAndroid Build Coastguard Worker.. admonition:: **Yes**: Assign :c:macro:`PW_TOKENIZE_STRING` to a ``constexpr`` variable.
111*61c4878aSAndroid Build Coastguard Worker  :class: checkmark
112*61c4878aSAndroid Build Coastguard Worker
113*61c4878aSAndroid Build Coastguard Worker  .. code-block:: cpp
114*61c4878aSAndroid Build Coastguard Worker
115*61c4878aSAndroid Build Coastguard Worker     constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!");
116*61c4878aSAndroid Build Coastguard Worker
117*61c4878aSAndroid Build Coastguard Worker     void Function() {
118*61c4878aSAndroid Build Coastguard Worker       constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?");
119*61c4878aSAndroid Build Coastguard Worker     }
120*61c4878aSAndroid Build Coastguard Worker
121*61c4878aSAndroid Build Coastguard Worker.. admonition:: **No**: Use :c:macro:`PW_TOKENIZE_STRING` in another expression.
122*61c4878aSAndroid Build Coastguard Worker  :class: error
123*61c4878aSAndroid Build Coastguard Worker
124*61c4878aSAndroid Build Coastguard Worker  .. code-block:: cpp
125*61c4878aSAndroid Build Coastguard Worker
126*61c4878aSAndroid Build Coastguard Worker     void BadExample() {
127*61c4878aSAndroid Build Coastguard Worker       ProcessToken(PW_TOKENIZE_STRING("This won't compile!"));
128*61c4878aSAndroid Build Coastguard Worker     }
129*61c4878aSAndroid Build Coastguard Worker
130*61c4878aSAndroid Build Coastguard Worker  Use :c:macro:`PW_TOKENIZE_STRING_EXPR` instead.
131*61c4878aSAndroid Build Coastguard Worker
132*61c4878aSAndroid Build Coastguard WorkerTokenize inside expressions
133*61c4878aSAndroid Build Coastguard Worker===========================
134*61c4878aSAndroid Build Coastguard WorkerAn alternate set of macros are provided for use inside expressions. These make
135*61c4878aSAndroid Build Coastguard Workeruse of lambda functions, so while they can be used inside expressions, they
136*61c4878aSAndroid Build Coastguard Workerrequire C++ and cannot be assigned to constexpr variables or be used with
137*61c4878aSAndroid Build Coastguard Workerspecial function variables like ``__func__``.
138*61c4878aSAndroid Build Coastguard Worker
139*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING_EXPR`
140*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING_DOMAIN_EXPR`
141*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR`
142*61c4878aSAndroid Build Coastguard Worker
143*61c4878aSAndroid Build Coastguard Worker.. admonition:: When to use these macros
144*61c4878aSAndroid Build Coastguard Worker
145*61c4878aSAndroid Build Coastguard Worker  Use :c:macro:`PW_TOKENIZE_STRING` and related macros to tokenize string
146*61c4878aSAndroid Build Coastguard Worker  literals that do not need %-style arguments encoded.
147*61c4878aSAndroid Build Coastguard Worker
148*61c4878aSAndroid Build Coastguard Worker.. admonition:: **Yes**: Use :c:macro:`PW_TOKENIZE_STRING_EXPR` within other expressions.
149*61c4878aSAndroid Build Coastguard Worker  :class: checkmark
150*61c4878aSAndroid Build Coastguard Worker
151*61c4878aSAndroid Build Coastguard Worker  .. code-block:: cpp
152*61c4878aSAndroid Build Coastguard Worker
153*61c4878aSAndroid Build Coastguard Worker     void GoodExample() {
154*61c4878aSAndroid Build Coastguard Worker       ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!"));
155*61c4878aSAndroid Build Coastguard Worker     }
156*61c4878aSAndroid Build Coastguard Worker
157*61c4878aSAndroid Build Coastguard Worker.. admonition:: **No**: Assign :c:macro:`PW_TOKENIZE_STRING_EXPR` to a ``constexpr`` variable.
158*61c4878aSAndroid Build Coastguard Worker  :class: error
159*61c4878aSAndroid Build Coastguard Worker
160*61c4878aSAndroid Build Coastguard Worker  .. code-block:: cpp
161*61c4878aSAndroid Build Coastguard Worker
162*61c4878aSAndroid Build Coastguard Worker     constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!"));
163*61c4878aSAndroid Build Coastguard Worker
164*61c4878aSAndroid Build Coastguard Worker  Instead, use :c:macro:`PW_TOKENIZE_STRING` to assign to a ``constexpr`` variable.
165*61c4878aSAndroid Build Coastguard Worker
166*61c4878aSAndroid Build Coastguard Worker.. admonition:: **No**: Tokenize ``__func__`` in :c:macro:`PW_TOKENIZE_STRING_EXPR`.
167*61c4878aSAndroid Build Coastguard Worker  :class: error
168*61c4878aSAndroid Build Coastguard Worker
169*61c4878aSAndroid Build Coastguard Worker  .. code-block:: cpp
170*61c4878aSAndroid Build Coastguard Worker
171*61c4878aSAndroid Build Coastguard Worker     void BadExample() {
172*61c4878aSAndroid Build Coastguard Worker       // This compiles, but __func__ will not be the outer function's name, and
173*61c4878aSAndroid Build Coastguard Worker       // there may be compiler warnings.
174*61c4878aSAndroid Build Coastguard Worker       constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__);
175*61c4878aSAndroid Build Coastguard Worker     }
176*61c4878aSAndroid Build Coastguard Worker
177*61c4878aSAndroid Build Coastguard Worker  Instead, use :c:macro:`PW_TOKENIZE_STRING` to tokenize ``__func__`` or similar macros.
178*61c4878aSAndroid Build Coastguard Worker
179*61c4878aSAndroid Build Coastguard WorkerTokenize a message with arguments to a buffer
180*61c4878aSAndroid Build Coastguard Worker=============================================
181*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_TO_BUFFER`
182*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_TO_BUFFER_DOMAIN`
183*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK`
184*61c4878aSAndroid Build Coastguard Worker
185*61c4878aSAndroid Build Coastguard Worker.. admonition:: Why use this macro
186*61c4878aSAndroid Build Coastguard Worker
187*61c4878aSAndroid Build Coastguard Worker   - Encode a tokenized message for consumption within a function.
188*61c4878aSAndroid Build Coastguard Worker   - Encode a tokenized message into an existing buffer.
189*61c4878aSAndroid Build Coastguard Worker
190*61c4878aSAndroid Build Coastguard Worker   Avoid using ``PW_TOKENIZE_TO_BUFFER`` in widely expanded macros, such as a
191*61c4878aSAndroid Build Coastguard Worker   logging macro, because it will result in larger code size than passing the
192*61c4878aSAndroid Build Coastguard Worker   tokenized data to a function.
193*61c4878aSAndroid Build Coastguard Worker
194*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-nested-arguments:
195*61c4878aSAndroid Build Coastguard Worker
196*61c4878aSAndroid Build Coastguard WorkerTokenize nested arguments
197*61c4878aSAndroid Build Coastguard Worker=========================
198*61c4878aSAndroid Build Coastguard WorkerEncoding ``%s`` string arguments is inefficient, since ``%s`` strings are
199*61c4878aSAndroid Build Coastguard Workerencoded 1:1, with no tokenization. Tokens can therefore be used to replace
200*61c4878aSAndroid Build Coastguard Workerstring arguments to tokenized format strings.
201*61c4878aSAndroid Build Coastguard Worker
202*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKEN_FMT`
203*61c4878aSAndroid Build Coastguard Worker
204*61c4878aSAndroid Build Coastguard Worker.. admonition:: Logging nested tokens
205*61c4878aSAndroid Build Coastguard Worker
206*61c4878aSAndroid Build Coastguard Worker  Users will typically interact with nested token arguments during logging.
207*61c4878aSAndroid Build Coastguard Worker  In this case there is a slightly different interface described by
208*61c4878aSAndroid Build Coastguard Worker  :ref:`module-pw_log-tokenized-args` that does not generally invoke
209*61c4878aSAndroid Build Coastguard Worker  ``PW_TOKEN_FMT`` directly.
210*61c4878aSAndroid Build Coastguard Worker
211*61c4878aSAndroid Build Coastguard WorkerThe format specifier for a token is given by PRI-style macro ``PW_TOKEN_FMT()``,
212*61c4878aSAndroid Build Coastguard Workerwhich is concatenated to the rest of the format string by the C preprocessor.
213*61c4878aSAndroid Build Coastguard Worker
214*61c4878aSAndroid Build Coastguard Worker.. code-block:: cpp
215*61c4878aSAndroid Build Coastguard Worker
216*61c4878aSAndroid Build Coastguard Worker   PW_TOKENIZE_FORMAT_STRING("margarine_domain",
217*61c4878aSAndroid Build Coastguard Worker                             UINT32_MAX,
218*61c4878aSAndroid Build Coastguard Worker                             "I can't believe it's not " PW_TOKEN_FMT() "!",
219*61c4878aSAndroid Build Coastguard Worker                             PW_TOKENIZE_STRING_EXPR("butter"));
220*61c4878aSAndroid Build Coastguard Worker
221*61c4878aSAndroid Build Coastguard WorkerThis feature is currently only supported by the Python detokenizer.
222*61c4878aSAndroid Build Coastguard Worker
223*61c4878aSAndroid Build Coastguard WorkerNested token format
224*61c4878aSAndroid Build Coastguard Worker-------------------
225*61c4878aSAndroid Build Coastguard WorkerNested tokens have the following format within strings:
226*61c4878aSAndroid Build Coastguard Worker
227*61c4878aSAndroid Build Coastguard Worker.. code-block::
228*61c4878aSAndroid Build Coastguard Worker
229*61c4878aSAndroid Build Coastguard Worker   $[{DOMAIN}][BASE#]TOKEN
230*61c4878aSAndroid Build Coastguard Worker
231*61c4878aSAndroid Build Coastguard WorkerThe ``$`` is a common prefix required for all nested tokens. It is possible to
232*61c4878aSAndroid Build Coastguard Workerconfigure a different common prefix if necessary, but using the default ``$``
233*61c4878aSAndroid Build Coastguard Workercharacter is strongly recommended.
234*61c4878aSAndroid Build Coastguard Worker
235*61c4878aSAndroid Build Coastguard WorkerThe optional ``DOMAIN`` specifies the token domain. If this option is omitted,
236*61c4878aSAndroid Build Coastguard Workerthe default (empty) domain is assumed.
237*61c4878aSAndroid Build Coastguard Worker
238*61c4878aSAndroid Build Coastguard WorkerThe optional ``BASE`` defines the numeric base encoding of the token. Accepted
239*61c4878aSAndroid Build Coastguard Workervalues are 8, 10, 16, and 64. If the hash symbol ``#`` is used without
240*61c4878aSAndroid Build Coastguard Workerspecifying a number, the base is assumed to be 16. If the base option is
241*61c4878aSAndroid Build Coastguard Workeromitted entirely, the base defaults to 64 for backward compatibility. All
242*61c4878aSAndroid Build Coastguard Workerencodings except Base64 are not case sensitive. This may be expanded to support
243*61c4878aSAndroid Build Coastguard Workerother bases in the future.
244*61c4878aSAndroid Build Coastguard Worker
245*61c4878aSAndroid Build Coastguard WorkerNon-Base64 tokens are encoded strictly as 32-bit integers with padding.
246*61c4878aSAndroid Build Coastguard WorkerBase64 data may additionally encode string arguments for the detokenized token,
247*61c4878aSAndroid Build Coastguard Workerand therefore does not have a maximum width.
248*61c4878aSAndroid Build Coastguard Worker
249*61c4878aSAndroid Build Coastguard WorkerThe meaning of ``TOKEN`` depends on the current phase of transformation for the
250*61c4878aSAndroid Build Coastguard Workercurrent tokenized format string. Within the format string's entry in the token
251*61c4878aSAndroid Build Coastguard Workerdatabase, when the actual value of the token argument is not known, ``TOKEN`` is
252*61c4878aSAndroid Build Coastguard Workera printf argument specifier (e.g. ``%08x`` for a base-16 token with correct
253*61c4878aSAndroid Build Coastguard Workerpadding). The actual tokens that will be used as arguments have separate
254*61c4878aSAndroid Build Coastguard Workerentries in the token database.
255*61c4878aSAndroid Build Coastguard Worker
256*61c4878aSAndroid Build Coastguard WorkerAfter the top-level format string has been detokenized and formatted, ``TOKEN``
257*61c4878aSAndroid Build Coastguard Workershould be the value of the token argument in the specified base, with any
258*61c4878aSAndroid Build Coastguard Workernecessary padding. This is the final format of a nested token if it cannot be
259*61c4878aSAndroid Build Coastguard Workertokenized.
260*61c4878aSAndroid Build Coastguard Worker
261*61c4878aSAndroid Build Coastguard Worker.. list-table:: Example tokens
262*61c4878aSAndroid Build Coastguard Worker   :widths: 10 25 25
263*61c4878aSAndroid Build Coastguard Worker
264*61c4878aSAndroid Build Coastguard Worker   * - Base
265*61c4878aSAndroid Build Coastguard Worker     - | Token database
266*61c4878aSAndroid Build Coastguard Worker       | (within format string entry)
267*61c4878aSAndroid Build Coastguard Worker     - Partially detokenized
268*61c4878aSAndroid Build Coastguard Worker   * - 10
269*61c4878aSAndroid Build Coastguard Worker     - ``$10#%010d``
270*61c4878aSAndroid Build Coastguard Worker     - ``$10#0086025943``
271*61c4878aSAndroid Build Coastguard Worker   * - 16
272*61c4878aSAndroid Build Coastguard Worker     - ``$#%08x``
273*61c4878aSAndroid Build Coastguard Worker     - ``$#0000001A``
274*61c4878aSAndroid Build Coastguard Worker   * - 64
275*61c4878aSAndroid Build Coastguard Worker     - ``%s``
276*61c4878aSAndroid Build Coastguard Worker     - ``$QA19pfEQ``
277*61c4878aSAndroid Build Coastguard Worker
278*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-custom-macro:
279*61c4878aSAndroid Build Coastguard Worker
280*61c4878aSAndroid Build Coastguard WorkerTokenizing enums
281*61c4878aSAndroid Build Coastguard Worker================
282*61c4878aSAndroid Build Coastguard WorkerLogging enums is one common special case where tokenization is particularly
283*61c4878aSAndroid Build Coastguard Workerappropriate: enum values are conceptually already tokens mapping to their
284*61c4878aSAndroid Build Coastguard Workernames, assuming no duplicate values.
285*61c4878aSAndroid Build Coastguard Worker
286*61c4878aSAndroid Build Coastguard Worker:c:macro:`PW_TOKENIZE_ENUM` will take in a fully qualified enum name along with all
287*61c4878aSAndroid Build Coastguard Workerof the associated enum values. This macro will create database entries that
288*61c4878aSAndroid Build Coastguard Workerinclude the domain name (fully qualified enum name), enum value, and a tokenized
289*61c4878aSAndroid Build Coastguard Workerform of the enum value.
290*61c4878aSAndroid Build Coastguard Worker
291*61c4878aSAndroid Build Coastguard WorkerThe macro also supports returing the string version of the enum value in the
292*61c4878aSAndroid Build Coastguard Workercase that there is a non-tokenizing backend, using
293*61c4878aSAndroid Build Coastguard Worker:cpp:func:`pw::tokenizer::EnumToString`.
294*61c4878aSAndroid Build Coastguard Worker
295*61c4878aSAndroid Build Coastguard WorkerAll enum values in the enum declaration must be present in the macro, and the
296*61c4878aSAndroid Build Coastguard Workermacro must be in the same namespace as the enum to be able to use the
297*61c4878aSAndroid Build Coastguard Worker:cpp:func:`pw::tokenizer::EnumToString` function and avoid compiler errors.
298*61c4878aSAndroid Build Coastguard Worker
299*61c4878aSAndroid Build Coastguard Worker.. literalinclude: enum_test.cc
300*61c4878aSAndroid Build Coastguard Worker   :language: cpp
301*61c4878aSAndroid Build Coastguard Worker   :start-after: [pw_tokenizer-examples-enum]
302*61c4878aSAndroid Build Coastguard Worker   :end-before: [pw_tokenizer-examples-enum]
303*61c4878aSAndroid Build Coastguard Worker
304*61c4878aSAndroid Build Coastguard Worker:c:macro:`PW_TOKENIZE_ENUM_CUSTOM` is an alternative version of
305*61c4878aSAndroid Build Coastguard Worker:c:macro:`PW_TOKENIZE_ENUM` to tokenized a custom strings instead of a
306*61c4878aSAndroid Build Coastguard Workerstringified form of the enum value name. It will take in a fully qualified enum
307*61c4878aSAndroid Build Coastguard Workername along with all the associated enum values and custom string for these
308*61c4878aSAndroid Build Coastguard Workervalues. This macro will create database entries that include the domain name
309*61c4878aSAndroid Build Coastguard Worker(fully qualified enum name), enum value, and a tokenized form of the custom
310*61c4878aSAndroid Build Coastguard Workerstring for the enum value.
311*61c4878aSAndroid Build Coastguard Worker
312*61c4878aSAndroid Build Coastguard Worker.. literalinclude: enum_test.cc
313*61c4878aSAndroid Build Coastguard Worker   :language: cpp
314*61c4878aSAndroid Build Coastguard Worker   :start-after: [pw_tokenizer-examples-enum-custom]
315*61c4878aSAndroid Build Coastguard Worker   :end-before: [pw_tokenizer-examples-enum-custom]
316*61c4878aSAndroid Build Coastguard Worker
317*61c4878aSAndroid Build Coastguard WorkerTokenize a message with arguments in a custom macro
318*61c4878aSAndroid Build Coastguard Worker===================================================
319*61c4878aSAndroid Build Coastguard WorkerProjects can leverage the tokenization machinery in whichever way best suits
320*61c4878aSAndroid Build Coastguard Workertheir needs. The most efficient way to use ``pw_tokenizer`` is to pass tokenized
321*61c4878aSAndroid Build Coastguard Workerdata to a global handler function. A project's custom tokenization macro can
322*61c4878aSAndroid Build Coastguard Workerhandle tokenized data in a function of their choosing. The function may accept
323*61c4878aSAndroid Build Coastguard Workerany arguments, but its final arguments must be:
324*61c4878aSAndroid Build Coastguard Worker
325*61c4878aSAndroid Build Coastguard Worker* The 32-bit token (:cpp:type:`pw_tokenizer_Token`)
326*61c4878aSAndroid Build Coastguard Worker* The argument types (:cpp:type:`pw_tokenizer_ArgTypes`)
327*61c4878aSAndroid Build Coastguard Worker* Variadic arguments, if any
328*61c4878aSAndroid Build Coastguard Worker
329*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer`` provides two low-level macros to help projects create custom
330*61c4878aSAndroid Build Coastguard Workertokenization macros:
331*61c4878aSAndroid Build Coastguard Worker
332*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_FORMAT_STRING`
333*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZER_REPLACE_FORMAT_STRING`
334*61c4878aSAndroid Build Coastguard Worker
335*61c4878aSAndroid Build Coastguard Worker.. caution::
336*61c4878aSAndroid Build Coastguard Worker
337*61c4878aSAndroid Build Coastguard Worker   Note the spelling difference! The first macro begins with ``PW_TOKENIZE_``
338*61c4878aSAndroid Build Coastguard Worker   (no ``R``) whereas the second begins with ``PW_TOKENIZER_``.
339*61c4878aSAndroid Build Coastguard Worker
340*61c4878aSAndroid Build Coastguard WorkerUse these macros to invoke an encoding function with the token, argument types,
341*61c4878aSAndroid Build Coastguard Workerand variadic arguments. The function can then encode the tokenized message to a
342*61c4878aSAndroid Build Coastguard Workerbuffer using helpers in ``pw_tokenizer/encode_args.h``:
343*61c4878aSAndroid Build Coastguard Worker
344*61c4878aSAndroid Build Coastguard Worker.. Note: pw_tokenizer_EncodeArgs is a C function so you would expect to
345*61c4878aSAndroid Build Coastguard Worker.. reference it as :c:func:`pw_tokenizer_EncodeArgs`. That doesn't work because
346*61c4878aSAndroid Build Coastguard Worker.. it's defined in a header file that mixes C and C++.
347*61c4878aSAndroid Build Coastguard Worker
348*61c4878aSAndroid Build Coastguard Worker* :cpp:func:`pw::tokenizer::EncodeArgs`
349*61c4878aSAndroid Build Coastguard Worker* :cpp:class:`pw::tokenizer::EncodedMessage`
350*61c4878aSAndroid Build Coastguard Worker* :cpp:func:`pw_tokenizer_EncodeArgs`
351*61c4878aSAndroid Build Coastguard Worker
352*61c4878aSAndroid Build Coastguard WorkerExample
353*61c4878aSAndroid Build Coastguard Worker-------
354*61c4878aSAndroid Build Coastguard WorkerThe following example implements a custom tokenization macro similar to
355*61c4878aSAndroid Build Coastguard Worker:ref:`module-pw_log_tokenized`.
356*61c4878aSAndroid Build Coastguard Worker
357*61c4878aSAndroid Build Coastguard Worker.. code-block:: cpp
358*61c4878aSAndroid Build Coastguard Worker
359*61c4878aSAndroid Build Coastguard Worker   #include "pw_tokenizer/tokenize.h"
360*61c4878aSAndroid Build Coastguard Worker
361*61c4878aSAndroid Build Coastguard Worker   #ifndef __cplusplus
362*61c4878aSAndroid Build Coastguard Worker   extern "C" {
363*61c4878aSAndroid Build Coastguard Worker   #endif
364*61c4878aSAndroid Build Coastguard Worker
365*61c4878aSAndroid Build Coastguard Worker   void EncodeTokenizedMessage(uint32_t metadata,
366*61c4878aSAndroid Build Coastguard Worker                               pw_tokenizer_Token token,
367*61c4878aSAndroid Build Coastguard Worker                               pw_tokenizer_ArgTypes types,
368*61c4878aSAndroid Build Coastguard Worker                               ...);
369*61c4878aSAndroid Build Coastguard Worker
370*61c4878aSAndroid Build Coastguard Worker   #ifndef __cplusplus
371*61c4878aSAndroid Build Coastguard Worker   }  // extern "C"
372*61c4878aSAndroid Build Coastguard Worker   #endif
373*61c4878aSAndroid Build Coastguard Worker
374*61c4878aSAndroid Build Coastguard Worker   #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...)          \
375*61c4878aSAndroid Build Coastguard Worker     do {                                                                  \
376*61c4878aSAndroid Build Coastguard Worker       PW_TOKENIZE_FORMAT_STRING("logs", UINT32_MAX, format, __VA_ARGS__); \
377*61c4878aSAndroid Build Coastguard Worker       EncodeTokenizedMessage(                                             \
378*61c4878aSAndroid Build Coastguard Worker           metadata, PW_TOKENIZER_REPLACE_FORMAT_STRING(__VA_ARGS__));     \
379*61c4878aSAndroid Build Coastguard Worker     } while (0)
380*61c4878aSAndroid Build Coastguard Worker
381*61c4878aSAndroid Build Coastguard WorkerIn this example, the ``EncodeTokenizedMessage`` function would handle encoding
382*61c4878aSAndroid Build Coastguard Workerand processing the message. Encoding is done by the
383*61c4878aSAndroid Build Coastguard Worker:cpp:class:`pw::tokenizer::EncodedMessage` class or
384*61c4878aSAndroid Build Coastguard Worker:cpp:func:`pw::tokenizer::EncodeArgs` function from
385*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer/encode_args.h``. The encoded message can then be transmitted or
386*61c4878aSAndroid Build Coastguard Workerstored as needed.
387*61c4878aSAndroid Build Coastguard Worker
388*61c4878aSAndroid Build Coastguard Worker.. code-block:: cpp
389*61c4878aSAndroid Build Coastguard Worker
390*61c4878aSAndroid Build Coastguard Worker   #include "pw_log_tokenized/log_tokenized.h"
391*61c4878aSAndroid Build Coastguard Worker   #include "pw_tokenizer/encode_args.h"
392*61c4878aSAndroid Build Coastguard Worker
393*61c4878aSAndroid Build Coastguard Worker   void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata,
394*61c4878aSAndroid Build Coastguard Worker                               pw::span<std::byte> message);
395*61c4878aSAndroid Build Coastguard Worker
396*61c4878aSAndroid Build Coastguard Worker   extern "C" void EncodeTokenizedMessage(const uint32_t metadata,
397*61c4878aSAndroid Build Coastguard Worker                                          const pw_tokenizer_Token token,
398*61c4878aSAndroid Build Coastguard Worker                                          const pw_tokenizer_ArgTypes types,
399*61c4878aSAndroid Build Coastguard Worker                                          ...) {
400*61c4878aSAndroid Build Coastguard Worker     va_list args;
401*61c4878aSAndroid Build Coastguard Worker     va_start(args, types);
402*61c4878aSAndroid Build Coastguard Worker     pw::tokenizer::EncodedMessage<kLogBufferSize> encoded_message(token, types, args);
403*61c4878aSAndroid Build Coastguard Worker     va_end(args);
404*61c4878aSAndroid Build Coastguard Worker
405*61c4878aSAndroid Build Coastguard Worker     HandleTokenizedMessage(metadata, encoded_message);
406*61c4878aSAndroid Build Coastguard Worker   }
407*61c4878aSAndroid Build Coastguard Worker
408*61c4878aSAndroid Build Coastguard Worker.. admonition:: Why use a custom macro
409*61c4878aSAndroid Build Coastguard Worker
410*61c4878aSAndroid Build Coastguard Worker   - Optimal code size. Invoking a free function with the tokenized data results
411*61c4878aSAndroid Build Coastguard Worker     in the smallest possible call site.
412*61c4878aSAndroid Build Coastguard Worker   - Pass additional arguments, such as metadata, with the tokenized message.
413*61c4878aSAndroid Build Coastguard Worker   - Integrate ``pw_tokenizer`` with other systems.
414*61c4878aSAndroid Build Coastguard Worker
415*61c4878aSAndroid Build Coastguard WorkerTokenizing function names
416*61c4878aSAndroid Build Coastguard Worker=========================
417*61c4878aSAndroid Build Coastguard WorkerThe string literal tokenization functions support tokenizing string literals or
418*61c4878aSAndroid Build Coastguard Workerconstexpr character arrays (``constexpr const char[]``). In GCC and Clang, the
419*61c4878aSAndroid Build Coastguard Workerspecial ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared
420*61c4878aSAndroid Build Coastguard Workeras ``static constexpr char[]`` in C++ instead of the standard ``static const
421*61c4878aSAndroid Build Coastguard Workerchar[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be
422*61c4878aSAndroid Build Coastguard Workertokenized while compiling C++ with GCC or Clang.
423*61c4878aSAndroid Build Coastguard Worker
424*61c4878aSAndroid Build Coastguard Worker.. code-block:: cpp
425*61c4878aSAndroid Build Coastguard Worker
426*61c4878aSAndroid Build Coastguard Worker   // Tokenize the special function name variables.
427*61c4878aSAndroid Build Coastguard Worker   constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
428*61c4878aSAndroid Build Coastguard Worker   constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
429*61c4878aSAndroid Build Coastguard Worker
430*61c4878aSAndroid Build Coastguard WorkerNote that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals.
431*61c4878aSAndroid Build Coastguard WorkerThey are defined as static character arrays, so they cannot be implicitly
432*61c4878aSAndroid Build Coastguard Workerconcatentated with string literals. For example, ``printf(__func__ ": %d",
433*61c4878aSAndroid Build Coastguard Worker123);`` will not compile.
434*61c4878aSAndroid Build Coastguard Worker
435*61c4878aSAndroid Build Coastguard WorkerCalculate minimum required buffer size
436*61c4878aSAndroid Build Coastguard Worker======================================
437*61c4878aSAndroid Build Coastguard WorkerSee :cpp:func:`pw::tokenizer::MinEncodingBufferSizeBytes`.
438*61c4878aSAndroid Build Coastguard Worker
439*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-base64-format:
440*61c4878aSAndroid Build Coastguard Worker
441*61c4878aSAndroid Build Coastguard WorkerEncoding Base64
442*61c4878aSAndroid Build Coastguard Worker===============
443*61c4878aSAndroid Build Coastguard WorkerThe tokenizer encodes messages to a compact binary representation. Applications
444*61c4878aSAndroid Build Coastguard Workermay desire a textual representation of tokenized strings. This makes it easy to
445*61c4878aSAndroid Build Coastguard Workeruse tokenized messages alongside plain text messages, but comes at a small
446*61c4878aSAndroid Build Coastguard Workerefficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory
447*61c4878aSAndroid Build Coastguard Workeras binary messages.
448*61c4878aSAndroid Build Coastguard Worker
449*61c4878aSAndroid Build Coastguard WorkerThe Base64 format is comprised of a ``$`` character followed by the
450*61c4878aSAndroid Build Coastguard WorkerBase64-encoded contents of the tokenized message. For example, consider
451*61c4878aSAndroid Build Coastguard Workertokenizing the string ``This is an example: %d!`` with the argument -1. The
452*61c4878aSAndroid Build Coastguard Workerstring's token is 0x4b016e66.
453*61c4878aSAndroid Build Coastguard Worker
454*61c4878aSAndroid Build Coastguard Worker.. code-block:: text
455*61c4878aSAndroid Build Coastguard Worker
456*61c4878aSAndroid Build Coastguard Worker   Source code: PW_LOG("This is an example: %d!", -1);
457*61c4878aSAndroid Build Coastguard Worker
458*61c4878aSAndroid Build Coastguard Worker    Plain text: This is an example: -1! [23 bytes]
459*61c4878aSAndroid Build Coastguard Worker
460*61c4878aSAndroid Build Coastguard Worker        Binary: 66 6e 01 4b 01          [ 5 bytes]
461*61c4878aSAndroid Build Coastguard Worker
462*61c4878aSAndroid Build Coastguard Worker        Base64: $Zm4BSwE=               [ 9 bytes]
463*61c4878aSAndroid Build Coastguard Worker
464*61c4878aSAndroid Build Coastguard WorkerTo encode with the Base64 format, add a call to
465*61c4878aSAndroid Build Coastguard Worker``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode``
466*61c4878aSAndroid Build Coastguard Workerin the tokenizer handler function. For example,
467*61c4878aSAndroid Build Coastguard Worker
468*61c4878aSAndroid Build Coastguard Worker.. code-block:: cpp
469*61c4878aSAndroid Build Coastguard Worker
470*61c4878aSAndroid Build Coastguard Worker   void TokenizedMessageHandler(const uint8_t encoded_message[],
471*61c4878aSAndroid Build Coastguard Worker                                size_t size_bytes) {
472*61c4878aSAndroid Build Coastguard Worker     pw::InlineBasicString base64 = pw::tokenizer::PrefixedBase64Encode(
473*61c4878aSAndroid Build Coastguard Worker         pw::span(encoded_message, size_bytes));
474*61c4878aSAndroid Build Coastguard Worker
475*61c4878aSAndroid Build Coastguard Worker     TransmitLogMessage(base64.data(), base64.size());
476*61c4878aSAndroid Build Coastguard Worker   }
477*61c4878aSAndroid Build Coastguard Worker
478*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-masks:
479*61c4878aSAndroid Build Coastguard Worker
480*61c4878aSAndroid Build Coastguard WorkerReduce token size with masking
481*61c4878aSAndroid Build Coastguard Worker==============================
482*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using
483*61c4878aSAndroid Build Coastguard Workerfewer than 32 bits does not improve runtime or code size efficiency. However,
484*61c4878aSAndroid Build Coastguard Workerwhen tokens are packed into data structures or stored in arrays, the size of the
485*61c4878aSAndroid Build Coastguard Workertoken directly affects memory usage. In those cases, every bit counts, and it
486*61c4878aSAndroid Build Coastguard Workermay be desireable to use fewer bits for the token.
487*61c4878aSAndroid Build Coastguard Worker
488*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer`` allows users to provide a mask to apply to the token. This
489*61c4878aSAndroid Build Coastguard Workermasked token is used in both the token database and the code. The masked token
490*61c4878aSAndroid Build Coastguard Workeris not a masked version of the full 32-bit token, the masked token is the token.
491*61c4878aSAndroid Build Coastguard WorkerThis makes it trivial to decode tokens that use fewer than 32 bits.
492*61c4878aSAndroid Build Coastguard Worker
493*61c4878aSAndroid Build Coastguard WorkerMasking functionality is provided through the ``*_MASK`` versions of the macros:
494*61c4878aSAndroid Build Coastguard Worker
495*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING_MASK`
496*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR`
497*61c4878aSAndroid Build Coastguard Worker* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK`
498*61c4878aSAndroid Build Coastguard Worker
499*61c4878aSAndroid Build Coastguard WorkerFor example, the following generates 16-bit tokens and packs them into an
500*61c4878aSAndroid Build Coastguard Workerexisting value.
501*61c4878aSAndroid Build Coastguard Worker
502*61c4878aSAndroid Build Coastguard Worker.. code-block:: cpp
503*61c4878aSAndroid Build Coastguard Worker
504*61c4878aSAndroid Build Coastguard Worker   constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!");
505*61c4878aSAndroid Build Coastguard Worker   uint32_t packed_word = (other_bits << 16) | token;
506*61c4878aSAndroid Build Coastguard Worker
507*61c4878aSAndroid Build Coastguard WorkerTokens are hashes, so tokens of any size have a collision risk. The fewer bits
508*61c4878aSAndroid Build Coastguard Workerused for tokens, the more likely two strings are to hash to the same token. See
509*61c4878aSAndroid Build Coastguard Worker:ref:`module-pw_tokenizer-collisions`.
510*61c4878aSAndroid Build Coastguard Worker
511*61c4878aSAndroid Build Coastguard WorkerMasked tokens without arguments may be encoded in fewer bytes. For example, the
512*61c4878aSAndroid Build Coastguard Worker16-bit token ``0x1234`` may be encoded as two little-endian bytes (``34 12``)
513*61c4878aSAndroid Build Coastguard Workerrather than four (``34 12 00 00``). The detokenizer tools zero-pad data smaller
514*61c4878aSAndroid Build Coastguard Workerthan four bytes. Tokens with arguments must always be encoded as four bytes.
515*61c4878aSAndroid Build Coastguard Worker
516*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-domains:
517*61c4878aSAndroid Build Coastguard Worker
518*61c4878aSAndroid Build Coastguard WorkerKeep tokens from different sources separate with domains
519*61c4878aSAndroid Build Coastguard Worker========================================================
520*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer`` supports having multiple tokenization domains. Domains are a
521*61c4878aSAndroid Build Coastguard Workerstring label associated with each tokenized string. This allows projects to keep
522*61c4878aSAndroid Build Coastguard Workertokens from different sources separate. Potential use cases include the
523*61c4878aSAndroid Build Coastguard Workerfollowing:
524*61c4878aSAndroid Build Coastguard Worker
525*61c4878aSAndroid Build Coastguard Worker* Keep large sets of tokenized strings separate to avoid collisions.
526*61c4878aSAndroid Build Coastguard Worker* Create a separate database for a small number of strings that use truncated
527*61c4878aSAndroid Build Coastguard Worker  tokens, for example only 10 or 16 bits instead of the full 32 bits.
528*61c4878aSAndroid Build Coastguard Worker
529*61c4878aSAndroid Build Coastguard WorkerWhen a domain is specified, any whitespace will be ignored in domain names and
530*61c4878aSAndroid Build Coastguard Workerremoved from the database.
531*61c4878aSAndroid Build Coastguard Worker
532*61c4878aSAndroid Build Coastguard WorkerIf no domain is specified, the domain is empty (``""``). For many projects, this
533*61c4878aSAndroid Build Coastguard Workerdefault domain is sufficient, so no additional configuration is required.
534*61c4878aSAndroid Build Coastguard Worker
535*61c4878aSAndroid Build Coastguard Worker.. code-block:: cpp
536*61c4878aSAndroid Build Coastguard Worker
537*61c4878aSAndroid Build Coastguard Worker   // Tokenizes this string to the default ("") domain.
538*61c4878aSAndroid Build Coastguard Worker   PW_TOKENIZE_STRING("Hello, world!");
539*61c4878aSAndroid Build Coastguard Worker
540*61c4878aSAndroid Build Coastguard Worker   // Tokenizes this string to the "my_custom_domain" domain.
541*61c4878aSAndroid Build Coastguard Worker   PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!");
542*61c4878aSAndroid Build Coastguard Worker
543*61c4878aSAndroid Build Coastguard WorkerThe database and detokenization command line tools default to loading tokens
544*61c4878aSAndroid Build Coastguard Workerfrom all domains. The domain may be specified for ELF files by appending
545*61c4878aSAndroid Build Coastguard Worker``#DOMAIN_NAME_REGEX`` to the file path. Use ``#`` to only read from the default
546*61c4878aSAndroid Build Coastguard Workerdomain. For example, the following reads strings in ``some_domain`` from
547*61c4878aSAndroid Build Coastguard Worker``my_image.elf``.
548*61c4878aSAndroid Build Coastguard Worker
549*61c4878aSAndroid Build Coastguard Worker.. code-block:: sh
550*61c4878aSAndroid Build Coastguard Worker
551*61c4878aSAndroid Build Coastguard Worker   ./database.py create --database my_db.csv "path/to/my_image.elf#some_domain"
552*61c4878aSAndroid Build Coastguard Worker
553*61c4878aSAndroid Build Coastguard WorkerSee :ref:`module-pw_tokenizer-managing-token-databases` for information about
554*61c4878aSAndroid Build Coastguard Workerthe ``database.py`` command line tool.
555*61c4878aSAndroid Build Coastguard Worker
556*61c4878aSAndroid Build Coastguard WorkerLimitations, bugs, and future work
557*61c4878aSAndroid Build Coastguard Worker==================================
558*61c4878aSAndroid Build Coastguard Worker
559*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-gcc-template-bug:
560*61c4878aSAndroid Build Coastguard Worker
561*61c4878aSAndroid Build Coastguard WorkerGCC bug: tokenization in template functions
562*61c4878aSAndroid Build Coastguard Worker-------------------------------------------
563*61c4878aSAndroid Build Coastguard WorkerGCC releases prior to 14 incorrectly ignore the section attribute for template
564*61c4878aSAndroid Build Coastguard Worker`functions <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and `variables
565*61c4878aSAndroid Build Coastguard Worker<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. The bug causes tokenized
566*61c4878aSAndroid Build Coastguard Workerstrings in template functions to be emitted into ``.rodata`` instead of the
567*61c4878aSAndroid Build Coastguard Workertokenized string section, so they cannot be extracted for detokenization.
568*61c4878aSAndroid Build Coastguard Worker
569*61c4878aSAndroid Build Coastguard WorkerFortunately, this is simple to work around in the linker script.
570*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer_linker_sections.ld`` includes a statement that pulls tokenized
571*61c4878aSAndroid Build Coastguard Workerstring entries from ``.rodata`` into the tokenized string section. See
572*61c4878aSAndroid Build Coastguard Worker`b/321306079 <https://issues.pigweed.dev/issues/321306079>`_ for details.
573*61c4878aSAndroid Build Coastguard Worker
574*61c4878aSAndroid Build Coastguard WorkerIf tokenization is working, but strings in templates are not appearing in token
575*61c4878aSAndroid Build Coastguard Workerdatabases, check the following:
576*61c4878aSAndroid Build Coastguard Worker
577*61c4878aSAndroid Build Coastguard Worker- The full contents of the latest version of ``pw_tokenizer_linker_sections.ld``
578*61c4878aSAndroid Build Coastguard Worker  are included with the linker script. The linker script was updated in
579*61c4878aSAndroid Build Coastguard Worker  `pwrev.dev/188424 <http://pwrev.dev/188424>`_.
580*61c4878aSAndroid Build Coastguard Worker- The ``-fdata-sections`` compilation option is in use. This places each
581*61c4878aSAndroid Build Coastguard Worker  variable in its own section, which is necessary for pulling tokenized string
582*61c4878aSAndroid Build Coastguard Worker  entries from ``.rodata`` into the proper section.
583*61c4878aSAndroid Build Coastguard Worker
584*61c4878aSAndroid Build Coastguard Worker64-bit tokenization
585*61c4878aSAndroid Build Coastguard Worker-------------------
586*61c4878aSAndroid Build Coastguard WorkerThe Python and C++ detokenizing libraries currently assume that strings were
587*61c4878aSAndroid Build Coastguard Workertokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and
588*61c4878aSAndroid Build Coastguard Worker``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit
589*61c4878aSAndroid Build Coastguard Workerdevice performed the tokenization.
590*61c4878aSAndroid Build Coastguard Worker
591*61c4878aSAndroid Build Coastguard WorkerSupporting detokenization of strings tokenized on 64-bit targets would be
592*61c4878aSAndroid Build Coastguard Workersimple. This could be done by adding an option to switch the 32-bit types to
593*61c4878aSAndroid Build Coastguard Worker64-bit. The tokenizer stores the sizes of these types in the
594*61c4878aSAndroid Build Coastguard Worker``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified
595*61c4878aSAndroid Build Coastguard Workerby checking the ELF file, if necessary.
596*61c4878aSAndroid Build Coastguard Worker
597*61c4878aSAndroid Build Coastguard WorkerTokenization in headers
598*61c4878aSAndroid Build Coastguard Worker-----------------------
599*61c4878aSAndroid Build Coastguard WorkerTokenizing code in header files (inline functions or templates) may trigger
600*61c4878aSAndroid Build Coastguard Workerwarnings such as ``-Wlto-type-mismatch`` under certain conditions. That
601*61c4878aSAndroid Build Coastguard Workeris because tokenization requires declaring a character array for each tokenized
602*61c4878aSAndroid Build Coastguard Workerstring. If the tokenized string includes macros that change value, the size of
603*61c4878aSAndroid Build Coastguard Workerthis character array changes, which means the same static variable is defined
604*61c4878aSAndroid Build Coastguard Workerwith different sizes. It should be safe to suppress these warnings, but, when
605*61c4878aSAndroid Build Coastguard Workerpossible, code that tokenizes strings with macros that can change value should
606*61c4878aSAndroid Build Coastguard Workerbe moved to source files rather than headers.
607*61c4878aSAndroid Build Coastguard Worker
608*61c4878aSAndroid Build Coastguard Worker----------------------
609*61c4878aSAndroid Build Coastguard WorkerTokenization in Python
610*61c4878aSAndroid Build Coastguard Worker----------------------
611*61c4878aSAndroid Build Coastguard WorkerThe Python ``pw_tokenizer.encode`` module has limited support for encoding
612*61c4878aSAndroid Build Coastguard Workertokenized messages with the :func:`pw_tokenizer.encode.encode_token_and_args`
613*61c4878aSAndroid Build Coastguard Workerfunction. This function requires a string's token is already calculated.
614*61c4878aSAndroid Build Coastguard WorkerTypically these tokens are provided by a database, but they can be manually
615*61c4878aSAndroid Build Coastguard Workercreated using the tokenizer hash.
616*61c4878aSAndroid Build Coastguard Worker
617*61c4878aSAndroid Build Coastguard Worker:func:`pw_tokenizer.tokens.pw_tokenizer_65599_hash` is particularly useful
618*61c4878aSAndroid Build Coastguard Workerfor offline token database generation in cases where tokenized strings in a
619*61c4878aSAndroid Build Coastguard Workerbinary cannot be embedded as parsable pw_tokenizer entries.
620*61c4878aSAndroid Build Coastguard Worker
621*61c4878aSAndroid Build Coastguard Worker.. note::
622*61c4878aSAndroid Build Coastguard Worker   In C, the hash length of a string has a fixed limit controlled by
623*61c4878aSAndroid Build Coastguard Worker   ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed
624*61c4878aSAndroid Build Coastguard Worker   to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching
625*61c4878aSAndroid Build Coastguard Worker   hash length limit. When creating an offline database, it's a good idea to
626*61c4878aSAndroid Build Coastguard Worker   generate tokens for both, and merge the databases.
627*61c4878aSAndroid Build Coastguard Worker
628*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-cli-encoding:
629*61c4878aSAndroid Build Coastguard Worker
630*61c4878aSAndroid Build Coastguard Worker-----------------
631*61c4878aSAndroid Build Coastguard WorkerEncoding CLI tool
632*61c4878aSAndroid Build Coastguard Worker-----------------
633*61c4878aSAndroid Build Coastguard WorkerThe ``pw_tokenizer.encode`` command line tool can be used to encode
634*61c4878aSAndroid Build Coastguard Workerformat strings and optional arguments.
635*61c4878aSAndroid Build Coastguard Worker
636*61c4878aSAndroid Build Coastguard Worker.. code-block:: bash
637*61c4878aSAndroid Build Coastguard Worker
638*61c4878aSAndroid Build Coastguard Worker   python -m pw_tokenizer.encode [-h] FORMAT_STRING [ARG ...]
639*61c4878aSAndroid Build Coastguard Worker
640*61c4878aSAndroid Build Coastguard WorkerExample:
641*61c4878aSAndroid Build Coastguard Worker
642*61c4878aSAndroid Build Coastguard Worker.. code-block:: text
643*61c4878aSAndroid Build Coastguard Worker
644*61c4878aSAndroid Build Coastguard Worker   $ python -m pw_tokenizer.encode "There's... %d many of %s!" 2 them
645*61c4878aSAndroid Build Coastguard Worker         Raw input: "There's... %d many of %s!" % (2, 'them')
646*61c4878aSAndroid Build Coastguard Worker   Formatted input: There's... 2 many of them!
647*61c4878aSAndroid Build Coastguard Worker             Token: 0xb6ef8b2d
648*61c4878aSAndroid Build Coastguard Worker           Encoded: b'-\x8b\xef\xb6\x04\x04them' (2d 8b ef b6 04 04 74 68 65 6d) [10 bytes]
649*61c4878aSAndroid Build Coastguard Worker   Prefixed Base64: $LYvvtgQEdGhlbQ==
650*61c4878aSAndroid Build Coastguard Worker
651*61c4878aSAndroid Build Coastguard WorkerSee ``--help`` for full usage details.
652*61c4878aSAndroid Build Coastguard Worker
653*61c4878aSAndroid Build Coastguard Worker--------
654*61c4878aSAndroid Build Coastguard WorkerAppendix
655*61c4878aSAndroid Build Coastguard Worker--------
656*61c4878aSAndroid Build Coastguard Worker
657*61c4878aSAndroid Build Coastguard WorkerCase study
658*61c4878aSAndroid Build Coastguard Worker==========
659*61c4878aSAndroid Build Coastguard Worker.. note:: This section discusses the implementation, results, and lessons
660*61c4878aSAndroid Build Coastguard Worker   learned from a real-world deployment of ``pw_tokenizer``.
661*61c4878aSAndroid Build Coastguard Worker
662*61c4878aSAndroid Build Coastguard WorkerThe tokenizer module was developed to bring tokenized logging to an
663*61c4878aSAndroid Build Coastguard Workerin-development product. The product already had an established text-based
664*61c4878aSAndroid Build Coastguard Workerlogging system. Deploying tokenization was straightforward and had substantial
665*61c4878aSAndroid Build Coastguard Workerbenefits.
666*61c4878aSAndroid Build Coastguard Worker
667*61c4878aSAndroid Build Coastguard WorkerResults
668*61c4878aSAndroid Build Coastguard Worker-------
669*61c4878aSAndroid Build Coastguard Worker* Log contents shrunk by over 50%, even with Base64 encoding.
670*61c4878aSAndroid Build Coastguard Worker
671*61c4878aSAndroid Build Coastguard Worker  * Significant size savings for encoded logs, even using the less-efficient
672*61c4878aSAndroid Build Coastguard Worker    Base64 encoding required for compatibility with the existing log system.
673*61c4878aSAndroid Build Coastguard Worker  * Freed valuable communication bandwidth.
674*61c4878aSAndroid Build Coastguard Worker  * Allowed storing many more logs in crash dumps.
675*61c4878aSAndroid Build Coastguard Worker
676*61c4878aSAndroid Build Coastguard Worker* Substantial flash savings.
677*61c4878aSAndroid Build Coastguard Worker
678*61c4878aSAndroid Build Coastguard Worker  * Reduced the size firmware images by up to 18%.
679*61c4878aSAndroid Build Coastguard Worker
680*61c4878aSAndroid Build Coastguard Worker* Simpler logging code.
681*61c4878aSAndroid Build Coastguard Worker
682*61c4878aSAndroid Build Coastguard Worker  * Removed CPU-heavy ``snprintf`` calls.
683*61c4878aSAndroid Build Coastguard Worker  * Removed complex code for forwarding log arguments to a low-priority task.
684*61c4878aSAndroid Build Coastguard Worker
685*61c4878aSAndroid Build Coastguard WorkerThis section describes the tokenizer deployment process and highlights key
686*61c4878aSAndroid Build Coastguard Workerinsights.
687*61c4878aSAndroid Build Coastguard Worker
688*61c4878aSAndroid Build Coastguard WorkerFirmware deployment
689*61c4878aSAndroid Build Coastguard Worker-------------------
690*61c4878aSAndroid Build Coastguard Worker* In the project's logging macro, calls to the underlying logging function were
691*61c4878aSAndroid Build Coastguard Worker  replaced with a tokenized log macro invocation.
692*61c4878aSAndroid Build Coastguard Worker* The log level was passed as the payload argument to facilitate runtime log
693*61c4878aSAndroid Build Coastguard Worker  level control.
694*61c4878aSAndroid Build Coastguard Worker* For this project, it was necessary to encode the log messages as text. In
695*61c4878aSAndroid Build Coastguard Worker  the handler function the log messages were encoded in the $-prefixed
696*61c4878aSAndroid Build Coastguard Worker  :ref:`module-pw_tokenizer-base64-format`, then dispatched as normal log messages.
697*61c4878aSAndroid Build Coastguard Worker* Asserts were tokenized a callback-based API that has been removed (a
698*61c4878aSAndroid Build Coastguard Worker  :ref:`custom macro <module-pw_tokenizer-custom-macro>` is a better
699*61c4878aSAndroid Build Coastguard Worker  alternative).
700*61c4878aSAndroid Build Coastguard Worker
701*61c4878aSAndroid Build Coastguard Worker.. attention::
702*61c4878aSAndroid Build Coastguard Worker  Do not encode line numbers in tokenized strings. This results in a huge
703*61c4878aSAndroid Build Coastguard Worker  number of lines being added to the database, since every time code moves,
704*61c4878aSAndroid Build Coastguard Worker  new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line
705*61c4878aSAndroid Build Coastguard Worker  numbers are encoded in the log metadata. Line numbers may also be included by
706*61c4878aSAndroid Build Coastguard Worker  by adding ``"%d"`` to the format string and passing ``__LINE__``.
707*61c4878aSAndroid Build Coastguard Worker
708*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-database-management:
709*61c4878aSAndroid Build Coastguard Worker
710*61c4878aSAndroid Build Coastguard WorkerDatabase management
711*61c4878aSAndroid Build Coastguard Worker-------------------
712*61c4878aSAndroid Build Coastguard Worker* The token database was stored as a CSV file in the project's Git repo.
713*61c4878aSAndroid Build Coastguard Worker* The token database was automatically updated as part of the build, and
714*61c4878aSAndroid Build Coastguard Worker  developers were expected to check in the database changes alongside their code
715*61c4878aSAndroid Build Coastguard Worker  changes.
716*61c4878aSAndroid Build Coastguard Worker* A presubmit check verified that all strings added by a change were added to
717*61c4878aSAndroid Build Coastguard Worker  the token database.
718*61c4878aSAndroid Build Coastguard Worker* The token database included logs and asserts for all firmware images in the
719*61c4878aSAndroid Build Coastguard Worker  project.
720*61c4878aSAndroid Build Coastguard Worker* No strings were purged from the token database.
721*61c4878aSAndroid Build Coastguard Worker
722*61c4878aSAndroid Build Coastguard Worker.. tip::
723*61c4878aSAndroid Build Coastguard Worker   Merge conflicts may be a frequent occurrence with an in-source CSV database.
724*61c4878aSAndroid Build Coastguard Worker   Use the :ref:`module-pw_tokenizer-directory-database-format` instead.
725*61c4878aSAndroid Build Coastguard Worker
726*61c4878aSAndroid Build Coastguard WorkerDecoding tooling deployment
727*61c4878aSAndroid Build Coastguard Worker---------------------------
728*61c4878aSAndroid Build Coastguard Worker* The Python detokenizer in ``pw_tokenizer`` was deployed to two places:
729*61c4878aSAndroid Build Coastguard Worker
730*61c4878aSAndroid Build Coastguard Worker  * Product-specific Python command line tools, using
731*61c4878aSAndroid Build Coastguard Worker    ``pw_tokenizer.Detokenizer``.
732*61c4878aSAndroid Build Coastguard Worker  * Standalone script for decoding prefixed Base64 tokens in files or
733*61c4878aSAndroid Build Coastguard Worker    live output (e.g. from ``adb``), using ``detokenize.py``'s command line
734*61c4878aSAndroid Build Coastguard Worker    interface.
735*61c4878aSAndroid Build Coastguard Worker
736*61c4878aSAndroid Build Coastguard Worker* The C++ detokenizer library was deployed to two Android apps with a Java
737*61c4878aSAndroid Build Coastguard Worker  Native Interface (JNI) layer.
738*61c4878aSAndroid Build Coastguard Worker
739*61c4878aSAndroid Build Coastguard Worker  * The binary token database was included as a raw resource in the APK.
740*61c4878aSAndroid Build Coastguard Worker  * In one app, the built-in token database could be overridden by copying a
741*61c4878aSAndroid Build Coastguard Worker    file to the phone.
742*61c4878aSAndroid Build Coastguard Worker
743*61c4878aSAndroid Build Coastguard Worker.. tip::
744*61c4878aSAndroid Build Coastguard Worker   Make the tokenized logging tools simple to use for your project.
745*61c4878aSAndroid Build Coastguard Worker
746*61c4878aSAndroid Build Coastguard Worker   * Provide simple wrapper shell scripts that fill in arguments for the
747*61c4878aSAndroid Build Coastguard Worker     project. For example, point ``detokenize.py`` to the project's token
748*61c4878aSAndroid Build Coastguard Worker     databases.
749*61c4878aSAndroid Build Coastguard Worker   * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in
750*61c4878aSAndroid Build Coastguard Worker     continuously-running tools, so that users don't have to restart the tool
751*61c4878aSAndroid Build Coastguard Worker     when the token database updates.
752*61c4878aSAndroid Build Coastguard Worker   * Integrate detokenization everywhere it is needed. Integrating the tools
753*61c4878aSAndroid Build Coastguard Worker     takes just a few lines of code, and token databases can be embedded in APKs
754*61c4878aSAndroid Build Coastguard Worker     or binaries.
755