1# String Support for Emboss 2 3GitHub Issue [#28](https://github.com/google/emboss/issues/28) 4 5## Background 6 7It is somewhat common to embed short strings into binary structures; examples 8include serial numbers and firmware revisions, although in some cases even 9things like IP addresses are encoded as ASCII text embedded in a larger binary 10message. 11 12Historically, we have modeled such fields in Emboss by using `UInt:8[]`; that 13is, arrays of 8-bit uints. This is more-or-less functional, but can be awkward 14for things like text format output, and provides no way to add assertions to 15string fields. 16 17String support is complicated by the fact that there are several common ways of 18delimiting strings: 19 201. Length determined by another field -- that is, the size of the string is 21 explicit. 222. The string is *terminated* by a specific byte value, usually `'\0'`. In 23 this case, there may be additional "garbage" bytes after the terminator, 24 which should not be considered to be part of the string. 253. The string is *padded* by a specific byte value, usually 32 (`' '`). In 26 this case, the "padding" character can usually occur inside the string, 27 and only trailing padding characters should be trimmed off. 28 29For both terminated and padded strings, some formats allow the string to run to 30the very end of its field, with no terminator/padding, and some require the 31terminator/padding. In general, it seems that terminated strings are more 32likely to require the terminator, while padded strings can usually be entered 33with no padding. 34 35There are, no doubt, other ways of delimiting strings. These seem to be rare 36and sui generis, and can often be handled by modeling them as length-determined 37strings, then applying the necessary logic in code. 38 39There are also multiple *encodings* for strings, such as ASCII, ISO/IEC 8859-1 40("Latin-1"), UTF-8, UTF-16, etc. UTF-16 seems to be rare outside of 41Windows-based software and Java. Hardware almost always appears to use ASCII 42(encoded as one character per byte, with the high bit always clear), although 43Java ME-based systems may use UTF-16. 44 45 46## Proposal 47 48### Bytestrings Only 49 50All strings in Emboss should be considered to be opaque blobs of bytes; 51interpretation as ASCII, Latin-1, UTF-8, etc. should be left to the application. 52 53UTF-16 strings are explicitly not handled by this proposal. In principle, one 54could add a "byte width" parameter to the string types, or use a prefix like `W` 55to indicate "wide string" types, but it does not seem important for now. This 56decision can be revisited later. 57 58 59### New Built-In Types 60 61Add three new types to the Prelude (names subject to change): 62 631. `FixString`, a string whose contents should be the entire field containing 64 the `FixString`. When writing to a `FixString`, the value must be exactly 65 the same length as the field. 66 67 `CouldWriteValue()` should return `true` for all strings that are exactly 68 the correct length. 69 70 `FixString` is very close to a notional `Blob` type or the current 71 `UInt:8[]` type, except for differences in text format. 72 732. `ZString`, a terminated string. A `ZString` with no arguments uses a null 74 byte (`'\0'`) as the terminator. An optional argument can be used to 75 specify the terminator -- a `ZString(36)`, for example, would be terminated 76 by `$`. When reading, the value returned is all bytes up to, but not 77 including, the first terminator byte. When writing, for compatibility, the 78 entire field should be written, using the terminator value for padding if 79 there is extra space. A second optional parameter can be used to specify 80 that the terminator is not required: `ZString(0, false)` can fill the 81 underlying field with no terminator. 82 83 `CouldWriteValue()` should return `true` if the value is no longer than the 84 field and the value does not *contain* any instances of the terminator 85 byte. 86 873. `PaddedString`, a padded string. A `PaddedString` with no arguments uses 88 space (`' '`, 32) as the padding value. An optional argument can be used to 89 specify the padding -- a `PaddedString(0)`, for example, would be padded 90 with null bytes. When reading, the end of the string is discovered by 91 walking *backwards* from the end until a non-padding byte is found, then 92 returning all bytes from the start of the string to the end. When writing, 93 any excess bytes will be filled with the padding value. 94 95 Although, technically, "at least one byte of padding" could be enforced by 96 making the `PaddedString` one byte shorter and following it with a one-byte 97 field whose value *must* be the padding byte, for convenience `PaddedString` 98 should take a second optional parameter to specify that the terminator *is* 99 required: `PaddedString(32, true)` must have at least one space at the end. 100 101 `CouldWriteValue()` should return `true` if the value is no longer than the 102 field and the value does not *end with* the padding byte. 103 104 105### String Constants 106 107String constants (used in constructs such as `[requires: this == "abcd"]`) may 108take two forms: 109 1101. `"A quoted string using C-style escapes like \n"` 111 112 In addition to standard C89 escapes (as interpreted by an ASCII Unix 113 compiler): 114 115 * `\0` => 0 116 * `\a` => 7 117 * `\b` => 8 118 * `\t` => 9 119 * `\n` => 10 120 * `\v` => 11 121 * `\f` => 12 122 * `\r` => 13 123 * `\"` => 34 124 * `\'` => 39 125 * `\?` => 63 (part of the C standard, but rarely used) 126 * `\\` => 92 127 * <code>\x*hh*</code> => 0x*hh* 128 129 The following non-C-standard escapes should be allowed: 130 131 * `\e` => 27 (not actually standard, but common) 132 * <code>\d*nnn*</code> => *nnn* 133 * <code>\x{*hh*}</code> => 0x*hh* 134 * <code>\d{*nnn*}</code> => *nnn* 135 136 Note that the standard C escape <code>\\*nnn*</code> is explicitly not 137 supported. C treats *nnn* as octal, which is often surprising, and modern 138 languages (the cut off date appears to be about 1993 -- right between Python 139 2 and Java) have largely dropped support for the octal escapes. 140 141 Based on a brief survey, only `\n`, `\t`, `\"`, `\\`, and `\'` appear to be 142 (nearly) universal among popular programming languages. <code>\x*hh*</code> 143 is very common, though not universal. <code>\u*nnnn*</code>, where *nnnn* 144 is a Unicode hex value to be encoded as UTF-8 or UTF-16, also appears to be 145 common, but only for text strings. 146 147 To avoid ambiguity, the un-braced <code>\x*hh*</code> escape should be 148 required to have 2 hex digits, and the <code>\d*nnn*</code> escape should be 149 required to have exactly 3 decimal digits. The braced versions -- 150 <code>\x{*hh*}</code> and <code>\d{*nnn*}</code> -- could have any number of 151 digits, but should be required to evaluate to a value in the range 0 to 255: 152 that is, `\d{000000100}` should be allowed, but `\d{256}` should not. 153 154 `\` characters should not be allowed outside of the escape sequences 155 specified here. 156 157 For now, only 7-bit ASCII printable characters (byte values 32 through 126) 158 should be allowed in `"quoted strings"`, even though `.emb` files generally 159 allow UTF-8. This requirement may be relaxed in the future. 160 1612. A list of bytes in `{}`, where each byte is either a single-quoted character 162 (`'a'`) or a numeric constant (e.g., `0x20` or `32`). 163 164 For ease of transition from existing `UInt:8[]` fields, explicit index 165 markers (`[8]:`) in the list should be allowed if the index exactly matches 166 the current cursor index; this matches output from the current Emboss text 167 format for `UInt:8[]`. 168 169The existing parameter system will need to be extended to allow default values, 170and to allow `external` types to accept parameters if they do not already. 171 172 173### String Field Methods (C++) 174 175#### C++ String Type Parameterization 176 177All methods that accept or return a string value should be templated on the C++ 178type to use (`std::string`, `std::string_view`, `char *`, etc.). 179 180For methods that accept a string parameter (`Write`, etc.), the template 181argument should be inferred, and they can be called without specifying the type. 182 183For methods that only return a string value (`Read`, etc.), the template 184argument would need to be specified: `Read<std::string_view>()`. 185 186`char *` should not be accepted as a return type, due to problems with ensuring 187that there is actually a null byte at the end of the string. 188 189As an input type, `char *` is like to need explicit specialization. 190 191In many (most? all?) cases, methods should have no problem with some types that 192are not really "string" types, such as `std::vector<char>`. 193 194String types that use `signed char` or `unsigned char` instead of `char` (e.g., 195`std::basic_string<unsigned char>`) should be explicitly supported. 196 197If the `BackingStorage` is not `ContiguousBuffer` (or some equivalent), it seems 198that it might be easy to hit undefined behavior with something like 199`Read<std::string_view>()`, since the iterator type returned by `begin()` and 200`end()` would not correctly model `std::contiguous_iterator`. The cautious 201approach would be to disable `Read()` and `UncheckedRead()` if the backing 202storage is not `ContiguousBuffer`; readout to something like `std::string` could 203still be explicitly performed using the `begin()`/`end()` iterators. 204Alternately, for non-`ContiguousBuffer` backing storage, `Read()` could be 205explicitly limited to a small set of known-good types, such as `std::string` and 206`std::vector<char>`. 207 208 209#### Methods 210 211`Read()`, `UncheckedRead()`, `Write()`, and `UncheckedWrite()` should be defined 212as one would expect. 213 214`ToString()` should be an alias for `Read()`, to ease conversion from 215`UInt:8[]`. 216 217`CouldWriteValue()` should be defined as specified in the previous section. 218 219`Ok()` should return `true` if the string has storage (though it could be 220zero-length storage) and the bytes match the requirements (e.g., if a terminator 221or padding byte is required, `Ok()` should only return `true` if such a byte is 222present). 223 224`Size()` should return the (logical) length of the string in bytes. 225 226`MaxSize()` should return `BackingStorage().SizeInBytes()` or 227`BackingStorage().SizeInBytes() - 1` if the string requires a padding or 228terminator byte. 229 230`begin()`, `end()`, `rbegin()`, `rend()` should be defined as expected for a 231C++ container type. 232 233`operator[]` should return the value of a single byte at the specified offset. 234 235 236#### `emboss::String` Type 237 238(This section should not be considered particularly authoritative; the actual 239implementation could differ greatly if another strategy is turns out to be 240easier or less complex in practice.) 241 242Because values retrieved from the different string types can be used 243interchangeably at the expression layer (e.g., `let s = condition ? z_string : 244fix_string`), there must be a way for all views over strings to return a common 245type. This is complicated by two requirements: 246 2471. `emboss::String` should not allocate memory. 2482. `emboss::String` needs to handle backing storage that is not 249 `ContiguousBuffer`. It also needs to handle constant strings (`let x = 250 "string"`), and be able to assign `Storage`-based strings to constant 251 strings and vice versa. 252 253To satisfy the first requirement, `emboss::String` will need to hold a reference 254to the underlying storage, not actually copy bytes. 255 256One way to satisfy the second requirement would be to simply copy the string's 257bytes out to a new buffer, but that conflicts with the first requirement. 258Instead, it should be a sum type over a `Storage` type parameter and a constant 259string, like: 260 261```c++ 262template <typename Storage> 263class String { 264 public: 265 String(); 266 String(const char *data, int size); 267 String(Storage); 268 // ... operator= ... 269 int size() constexpr; 270 char operator[](int index) constexpr { 271 return storage_.Index() == 0 ? backports::Get<0>(storage_)[index] 272 : backports::Get<1>(storage_).data()[index]; 273 } 274 // ... begin(), end(), etc. ... 275 276 private: 277 // TODO: replace backports::Variant with std::variant in 2027, when Emboss 278 // requires C++17. 279 backports::Variant<const char *, Storage> storage_; 280}; 281``` 282 283At least for now, `emboss::String` does not need to be exposed as a documented, 284supported API -- user code can use `Read<std::string_view>()` and similar 285operations as needed, with full knowledge of the underlying storage type. 286 287Comparisons and assignments between `emboss::String`s with different `Storage` 288type parameters do not need to be supported, since they cannot be generated by 289the code generator -- C++ codegen would only need those operations for 290`emboss::String`s that are derived from the same parent structure. 291 292 293### Handling in Other Languages 294 295C++ is unusual in that it does not differentiate at a language level between 296text strings and byte strings. Most other languages have different types for 297byte strings and text strings. 298 299For all languages that differentiate, Emboss strings should be treated as byte 300strings or byte arrays (Python3 `bytes`, Rust `Vec<u8>`, Proto `bytes`, etc.) 301 302Other than this caveat, Emboss string support should be straightforward in other 303languages. 304 305 306### Text Format 307 308Text format output should use the `"quoted string"` style. Byte values outside 309the range 32 through 126 should be emitted as escapes. Values with standard 310shorthand escapes (10 => `'\n'`, 0 => `'\0'`, etc.) should be emitted as such. 311For other values, hex escapes with exactly two digits (e.g., `\x06`, not `\x6`) 312should be emitted. It may be desirable to allow some `[text_format]` control 313over the output in the future. 314 315Text format input should allow both `"quoted string"` and list-of-bytes styles, 316with exactly the same rules as string constants in an `.emb` file, except that 317bytes > 126 might be allowed in a `"quoted string"`. 318 319 320### Expressions 321 322#### Type System Changes 323 324In order to facilitate `[requires]` on string types, the new types should have a 325new 'string' expression type. 326 327 328#### Runtime Representation 329 330In this proposal, no string manipulation are allowed, so temporary strings 331(which might require memory allocation) will not be necessary. 332 333 334#### String Attribute Representation 335 336Attributes values are currently represented by a special `AttributeValue` type 337which can hold either an `Expression` or a `String`. With a string expression 338type, `AttributeValue` can be replaced by a plain `Expression`. This will 339require changes to everything that touches `AttributeValue`. 340 341Alternately, `AttributeValue` could be left in the IR with only `Expression`, 342in which case only code that touches string attributes (`[byte_order]` and 343`[(cpp) namespace]`) needs to change. 344 345 346#### String Comparisons 347 348Comparison operations (`==`, `<`, `>`, `>=`, `<=`, `!=`) should be allowed, 349since these can be handled by passing references to existing memory. 350 351Equality and inequality (`==` and `!=`) should be defined in the expected way: 352two strings are equal iff they are the same length and the corresponding bytes 353in each string have the same value, and they are unequal if they are not equal. 354 355For ordering, strings should be compared lexically, using the binary value of 356each byte, with no regard for semantic collation. That is, `"Z" < "a"`, since 357`'Z'` is 90 and `'a'` is 97. 358 359When one string is a strict prefix of another string, the shorter string should 360be "less than" the longer; e.g., `"abc" < "abcdef"`. This is the same as the 361natural ordering for zero-terminated strings. 362 363 364#### Future String Operations 365 366It may be desirable, at some future point, to allow various string 367manipulations, such as concatenation or repetition, at least for compile-time 368strings. 369 370A substring operation should be possible without requiring memory allocation. 371 372Indexing into a string (`str[offset]`) should be allowed if/when indexing into 373an array is finally supported. 374 375 376### Arrays of Strings 377 378In some cases, it may be desirable to have an array of strings, like: 379 380``` 381struct Foo: 382 0 [+100] ZString[10] list 383``` 384 385Although somewhat awkward, the existing explicit-length syntax should work: 386 387``` 388struct Foo: 389 0 [+100] ZString:80[10] list # 10 10-byte (80-bit) strings 390``` 391