xref: /aosp_15_r20/external/emboss/doc/design_docs/strings.md (revision 99e0aae7469b87d12f0ad23e61142c2d74c1ef70)
1# String Support for Emboss
2
3GitHub Issue [#28](https://github.com/google/emboss/issues/28)
4
5## Background
6
7It is somewhat common to embed short strings into binary structures; examples
8include serial numbers and firmware revisions, although in some cases even
9things like IP addresses are encoded as ASCII text embedded in a larger binary
10message.
11
12Historically, we have modeled such fields in Emboss by using `UInt:8[]`; that
13is, arrays of 8-bit uints.  This is more-or-less functional, but can be awkward
14for things like text format output, and provides no way to add assertions to
15string fields.
16
17String support is complicated by the fact that there are several common ways of
18delimiting strings:
19
201.  Length determined by another field -- that is, the size of the string is
21    explicit.
222.  The string is *terminated* by a specific byte value, usually `'\0'`.  In
23    this case, there may be additional "garbage" bytes after the terminator,
24    which should not be considered to be part of the string.
253.  The string is *padded* by a specific byte value, usually 32 (`' '`).  In
26    this case, the "padding" character can usually occur inside the string,
27    and only trailing padding characters should be trimmed off.
28
29For both terminated and padded strings, some formats allow the string to run to
30the very end of its field, with no terminator/padding, and some require the
31terminator/padding.  In general, it seems that terminated strings are more
32likely to require the terminator, while padded strings can usually be entered
33with no padding.
34
35There are, no doubt, other ways of delimiting strings.  These seem to be rare
36and sui generis, and can often be handled by modeling them as length-determined
37strings, then applying the necessary logic in code.
38
39There are also multiple *encodings* for strings, such as ASCII, ISO/IEC 8859-1
40("Latin-1"), UTF-8, UTF-16, etc.  UTF-16 seems to be rare outside of
41Windows-based software and Java.  Hardware almost always appears to use ASCII
42(encoded as one character per byte, with the high bit always clear), although
43Java ME-based systems may use UTF-16.
44
45
46## Proposal
47
48### Bytestrings Only
49
50All strings in Emboss should be considered to be opaque blobs of bytes;
51interpretation as ASCII, Latin-1, UTF-8, etc. should be left to the application.
52
53UTF-16 strings are explicitly not handled by this proposal.  In principle, one
54could add a "byte width" parameter to the string types, or use a prefix like `W`
55to indicate "wide string" types, but it does not seem important for now.  This
56decision can be revisited later.
57
58
59### New Built-In Types
60
61Add three new types to the Prelude (names subject to change):
62
631.  `FixString`, a string whose contents should be the entire field containing
64    the `FixString`.  When writing to a `FixString`, the value must be exactly
65    the same length as the field.
66
67    `CouldWriteValue()` should return `true` for all strings that are exactly
68    the correct length.
69
70    `FixString` is very close to a notional `Blob` type or the current
71    `UInt:8[]` type, except for differences in text format.
72
732.  `ZString`, a terminated string.  A `ZString` with no arguments uses a null
74    byte (`'\0'`) as the terminator.  An optional argument can be used to
75    specify the terminator -- a `ZString(36)`, for example, would be terminated
76    by `$`.  When reading, the value returned is all bytes up to, but not
77    including, the first terminator byte.  When writing, for compatibility, the
78    entire field should be written, using the terminator value for padding if
79    there is extra space.  A second optional parameter can be used to specify
80    that the terminator is not required: `ZString(0, false)` can fill the
81    underlying field with no terminator.
82
83    `CouldWriteValue()` should return `true` if the value is no longer than the
84    field and the value does not *contain* any instances of the terminator
85    byte.
86
873.  `PaddedString`, a padded string.  A `PaddedString` with no arguments uses
88    space (`' '`, 32) as the padding value.  An optional argument can be used to
89    specify the padding -- a `PaddedString(0)`, for example, would be padded
90    with null bytes.  When reading, the end of the string is discovered by
91    walking *backwards* from the end until a non-padding byte is found, then
92    returning all bytes from the start of the string to the end.  When writing,
93    any excess bytes will be filled with the padding value.
94
95    Although, technically, "at least one byte of padding" could be enforced by
96    making the `PaddedString` one byte shorter and following it with a one-byte
97    field whose value *must* be the padding byte, for convenience `PaddedString`
98    should take a second optional parameter to specify that the terminator *is*
99    required: `PaddedString(32, true)` must have at least one space at the end.
100
101    `CouldWriteValue()` should return `true` if the value is no longer than the
102    field and the value does not *end with* the padding byte.
103
104
105### String Constants
106
107String constants (used in constructs such as `[requires: this == "abcd"]`) may
108take two forms:
109
1101.  `"A quoted string using C-style escapes like \n"`
111
112    In addition to standard C89 escapes (as interpreted by an ASCII Unix
113    compiler):
114
115    *   `\0` => 0
116    *   `\a` => 7
117    *   `\b` => 8
118    *   `\t` => 9
119    *   `\n` => 10
120    *   `\v` => 11
121    *   `\f` => 12
122    *   `\r` => 13
123    *   `\"` => 34
124    *   `\'` => 39
125    *   `\?` => 63 (part of the C standard, but rarely used)
126    *   `\\` => 92
127    *   <code>\x*hh*</code> => 0x*hh*
128
129    The following non-C-standard escapes should be allowed:
130
131    *   `\e` => 27 (not actually standard, but common)
132    *   <code>\d*nnn*</code> => *nnn*
133    *   <code>\x{*hh*}</code> => 0x*hh*
134    *   <code>\d{*nnn*}</code> => *nnn*
135
136    Note that the standard C escape <code>\\*nnn*</code> is explicitly not
137    supported.  C treats *nnn* as octal, which is often surprising, and modern
138    languages (the cut off date appears to be about 1993 -- right between Python
139    2 and Java) have largely dropped support for the octal escapes.
140
141    Based on a brief survey, only `\n`, `\t`, `\"`, `\\`, and `\'` appear to be
142    (nearly) universal among popular programming languages.  <code>\x*hh*</code>
143    is very common, though not universal.  <code>\u*nnnn*</code>, where *nnnn*
144    is a Unicode hex value to be encoded as UTF-8 or UTF-16, also appears to be
145    common, but only for text strings.
146
147    To avoid ambiguity, the un-braced <code>\x*hh*</code> escape should be
148    required to have 2 hex digits, and the <code>\d*nnn*</code> escape should be
149    required to have exactly 3 decimal digits.  The braced versions --
150    <code>\x{*hh*}</code> and <code>\d{*nnn*}</code> -- could have any number of
151    digits, but should be required to evaluate to a value in the range 0 to 255:
152    that is, `\d{000000100}` should be allowed, but `\d{256}` should not.
153
154    `\` characters should not be allowed outside of the escape sequences
155    specified here.
156
157    For now, only 7-bit ASCII printable characters (byte values 32 through 126)
158    should be allowed in `"quoted strings"`, even though `.emb` files generally
159    allow UTF-8.  This requirement may be relaxed in the future.
160
1612.  A list of bytes in `{}`, where each byte is either a single-quoted character
162    (`'a'`) or a numeric constant (e.g., `0x20` or `32`).
163
164    For ease of transition from existing `UInt:8[]` fields, explicit index
165    markers (`[8]:`) in the list should be allowed if the index exactly matches
166    the current cursor index; this matches output from the current Emboss text
167    format for `UInt:8[]`.
168
169The existing parameter system will need to be extended to allow default values,
170and to allow `external` types to accept parameters if they do not already.
171
172
173### String Field Methods (C++)
174
175#### C++ String Type Parameterization
176
177All methods that accept or return a string value should be templated on the C++
178type to use (`std::string`, `std::string_view`, `char *`, etc.).
179
180For methods that accept a string parameter (`Write`, etc.), the template
181argument should be inferred, and they can be called without specifying the type.
182
183For methods that only return a string value (`Read`, etc.), the template
184argument would need to be specified: `Read<std::string_view>()`.
185
186`char *` should not be accepted as a return type, due to problems with ensuring
187that there is actually a null byte at the end of the string.
188
189As an input type, `char *` is like to need explicit specialization.
190
191In many (most? all?) cases, methods should have no problem with some types that
192are not really "string" types, such as `std::vector<char>`.
193
194String types that use `signed char` or `unsigned char` instead of `char` (e.g.,
195`std::basic_string<unsigned char>`) should be explicitly supported.
196
197If the `BackingStorage` is not `ContiguousBuffer` (or some equivalent), it seems
198that it might be easy to hit undefined behavior with something like
199`Read<std::string_view>()`, since the iterator type returned by `begin()` and
200`end()` would not correctly model `std::contiguous_iterator`.  The cautious
201approach would be to disable `Read()` and `UncheckedRead()` if the backing
202storage is not `ContiguousBuffer`; readout to something like `std::string` could
203still be explicitly performed using the `begin()`/`end()` iterators.
204Alternately, for non-`ContiguousBuffer` backing storage, `Read()` could be
205explicitly limited to a small set of known-good types, such as `std::string` and
206`std::vector<char>`.
207
208
209#### Methods
210
211`Read()`, `UncheckedRead()`, `Write()`, and `UncheckedWrite()` should be defined
212as one would expect.
213
214`ToString()` should be an alias for `Read()`, to ease conversion from
215`UInt:8[]`.
216
217`CouldWriteValue()` should be defined as specified in the previous section.
218
219`Ok()` should return `true` if the string has storage (though it could be
220zero-length storage) and the bytes match the requirements (e.g., if a terminator
221or padding byte is required, `Ok()` should only return `true` if such a byte is
222present).
223
224`Size()` should return the (logical) length of the string in bytes.
225
226`MaxSize()` should return `BackingStorage().SizeInBytes()` or
227`BackingStorage().SizeInBytes() - 1` if the string requires a padding or
228terminator byte.
229
230`begin()`, `end()`, `rbegin()`, `rend()` should be defined as expected for a
231C++ container type.
232
233`operator[]` should return the value of a single byte at the specified offset.
234
235
236#### `emboss::String` Type
237
238(This section should not be considered particularly authoritative; the actual
239implementation could differ greatly if another strategy is turns out to be
240easier or less complex in practice.)
241
242Because values retrieved from the different string types can be used
243interchangeably at the expression layer (e.g., `let s = condition ? z_string :
244fix_string`), there must be a way for all views over strings to return a common
245type.  This is complicated by two requirements:
246
2471.  `emboss::String` should not allocate memory.
2482.  `emboss::String` needs to handle backing storage that is not
249    `ContiguousBuffer`.  It also needs to handle constant strings (`let x =
250    "string"`), and be able to assign `Storage`-based strings to constant
251    strings and vice versa.
252
253To satisfy the first requirement, `emboss::String` will need to hold a reference
254to the underlying storage, not actually copy bytes.
255
256One way to satisfy the second requirement would be to simply copy the string's
257bytes out to a new buffer, but that conflicts with the first requirement.
258Instead, it should be a sum type over a `Storage` type parameter and a constant
259string, like:
260
261```c++
262template <typename Storage>
263class String {
264 public:
265  String();
266  String(const char *data, int size);
267  String(Storage);
268  // ... operator= ...
269  int size() constexpr;
270  char operator[](int index) constexpr {
271    return storage_.Index() == 0 ? backports::Get<0>(storage_)[index]
272                                 : backports::Get<1>(storage_).data()[index];
273  }
274  // ... begin(), end(), etc. ...
275
276 private:
277  // TODO: replace backports::Variant with std::variant in 2027, when Emboss
278  // requires C++17.
279  backports::Variant<const char *, Storage> storage_;
280};
281```
282
283At least for now, `emboss::String` does not need to be exposed as a documented,
284supported API -- user code can use `Read<std::string_view>()` and similar
285operations as needed, with full knowledge of the underlying storage type.
286
287Comparisons and assignments between `emboss::String`s with different `Storage`
288type parameters do not need to be supported, since they cannot be generated by
289the code generator -- C++ codegen would only need those operations for
290`emboss::String`s that are derived from the same parent structure.
291
292
293### Handling in Other Languages
294
295C++ is unusual in that it does not differentiate at a language level between
296text strings and byte strings.  Most other languages have different types for
297byte strings and text strings.
298
299For all languages that differentiate, Emboss strings should be treated as byte
300strings or byte arrays (Python3 `bytes`, Rust `Vec<u8>`, Proto `bytes`, etc.)
301
302Other than this caveat, Emboss string support should be straightforward in other
303languages.
304
305
306### Text Format
307
308Text format output should use the `"quoted string"` style.  Byte values outside
309the range 32 through 126 should be emitted as escapes.  Values with standard
310shorthand escapes (10 => `'\n'`, 0 => `'\0'`, etc.) should be emitted as such.
311For other values, hex escapes with exactly two digits (e.g., `\x06`, not `\x6`)
312should be emitted.  It may be desirable to allow some `[text_format]` control
313over the output in the future.
314
315Text format input should allow both `"quoted string"` and list-of-bytes styles,
316with exactly the same rules as string constants in an `.emb` file, except that
317bytes > 126 might be allowed in a `"quoted string"`.
318
319
320### Expressions
321
322#### Type System Changes
323
324In order to facilitate `[requires]` on string types, the new types should have a
325new 'string' expression type.
326
327
328#### Runtime Representation
329
330In this proposal, no string manipulation are allowed, so temporary strings
331(which might require memory allocation) will not be necessary.
332
333
334#### String Attribute Representation
335
336Attributes values are currently represented by a special `AttributeValue` type
337which can hold either an `Expression` or a `String`.  With a string expression
338type, `AttributeValue` can be replaced by a plain `Expression`.  This will
339require changes to everything that touches `AttributeValue`.
340
341Alternately, `AttributeValue` could be left in the IR with only `Expression`,
342in which case only code that touches string attributes (`[byte_order]` and
343`[(cpp) namespace]`) needs to change.
344
345
346#### String Comparisons
347
348Comparison operations (`==`, `<`, `>`, `>=`, `<=`, `!=`) should be allowed,
349since these can be handled by passing references to existing memory.
350
351Equality and inequality (`==` and `!=`) should be defined in the expected way:
352two strings are equal iff they are the same length and the corresponding bytes
353in each string have the same value, and they are unequal if they are not equal.
354
355For ordering, strings should be compared lexically, using the binary value of
356each byte, with no regard for semantic collation.  That is, `"Z" < "a"`, since
357`'Z'` is 90 and `'a'` is 97.
358
359When one string is a strict prefix of another string, the shorter string should
360be "less than" the longer; e.g., `"abc" < "abcdef"`.  This is the same as the
361natural ordering for zero-terminated strings.
362
363
364#### Future String Operations
365
366It may be desirable, at some future point, to allow various string
367manipulations, such as concatenation or repetition, at least for compile-time
368strings.
369
370A substring operation should be possible without requiring memory allocation.
371
372Indexing into a string (`str[offset]`) should be allowed if/when indexing into
373an array is finally supported.
374
375
376### Arrays of Strings
377
378In some cases, it may be desirable to have an array of strings, like:
379
380```
381struct Foo:
382  0 [+100]  ZString[10]  list
383```
384
385Although somewhat awkward, the existing explicit-length syntax should work:
386
387```
388struct Foo:
389  0 [+100]  ZString:80[10]  list  # 10 10-byte (80-bit) strings
390```
391