doc/design_docs/strings.md

# String Support for Emboss

GitHub Issue [#28](https://github.com/google/emboss/issues/28)

## Background

It is somewhat common to embed short strings into binary structures; examples
include serial numbers and firmware revisions, although in some cases even
things like IP addresses are encoded as ASCII text embedded in a larger binary
message.

Historically, we have modeled such fields in Emboss by using `UInt:8[]`; that
is, arrays of 8-bit uints.  This is more-or-less functional, but can be awkward
for things like text format output, and provides no way to add assertions to
string fields.

String support is complicated by the fact that there are several common ways of
delimiting strings:

1.  Length determined by another field -- that is, the size of the string is
    explicit.
2.  The string is *terminated* by a specific byte value, usually `'\0'`.  In
    this case, there may be additional "garbage" bytes after the terminator,
    which should not be considered to be part of the string.
3.  The string is *padded* by a specific byte value, usually 32 (`' '`).  In
    this case, the "padding" character can usually occur inside the string,
    and only trailing padding characters should be trimmed off.

For both terminated and padded strings, some formats allow the string to run to
the very end of its field, with no terminator/padding, and some require the
terminator/padding.  In general, it seems that terminated strings are more
likely to require the terminator, while padded strings can usually be entered
with no padding.

There are, no doubt, other ways of delimiting strings.  These seem to be rare
and sui generis, and can often be handled by modeling them as length-determined
strings, then applying the necessary logic in code.

There are also multiple *encodings* for strings, such as ASCII, ISO/IEC 8859-1
("Latin-1"), UTF-8, UTF-16, etc.  UTF-16 seems to be rare outside of
Windows-based software and Java.  Hardware almost always appears to use ASCII
(encoded as one character per byte, with the high bit always clear), although
Java ME-based systems may use UTF-16.


## Proposal

### Bytestrings Only

All strings in Emboss should be considered to be opaque blobs of bytes;
interpretation as ASCII, Latin-1, UTF-8, etc. should be left to the application.

UTF-16 strings are explicitly not handled by this proposal.  In principle, one
could add a "byte width" parameter to the string types, or use a prefix like `W`
to indicate "wide string" types, but it does not seem important for now.  This
decision can be revisited later.


### New Built-In Types

Add three new types to the Prelude (names subject to change):

1.  `FixString`, a string whose contents should be the entire field containing
    the `FixString`.  When writing to a `FixString`, the value must be exactly
    the same length as the field.

    `CouldWriteValue()` should return `true` for all strings that are exactly
    the correct length.

    `FixString` is very close to a notional `Blob` type or the current
    `UInt:8[]` type, except for differences in text format.

2.  `ZString`, a terminated string.  A `ZString` with no arguments uses a null
    byte (`'\0'`) as the terminator.  An optional argument can be used to
    specify the terminator -- a `ZString(36)`, for example, would be terminated
    by `$`.  When reading, the value returned is all bytes up to, but not
    including, the first terminator byte.  When writing, for compatibility, the
    entire field should be written, using the terminator value for padding if
    there is extra space.  A second optional parameter can be used to specify
    that the terminator is not required: `ZString(0, false)` can fill the
    underlying field with no terminator.

    `CouldWriteValue()` should return `true` if the value is no longer than the
    field and the value does not *contain* any instances of the terminator
    byte.

3.  `PaddedString`, a padded string.  A `PaddedString` with no arguments uses
    space (`' '`, 32) as the padding value.  An optional argument can be used to
    specify the padding -- a `PaddedString(0)`, for example, would be padded
    with null bytes.  When reading, the end of the string is discovered by
    walking *backwards* from the end until a non-padding byte is found, then
    returning all bytes from the start of the string to the end.  When writing,
    any excess bytes will be filled with the padding value.

    Although, technically, "at least one byte of padding" could be enforced by
    making the `PaddedString` one byte shorter and following it with a one-byte
    field whose value *must* be the padding byte, for convenience `PaddedString`
    should take a second optional parameter to specify that the terminator *is*
    required: `PaddedString(32, true)` must have at least one space at the end.

    `CouldWriteValue()` should return `true` if the value is no longer than the
    field and the value does not *end with* the padding byte.


### String Constants

String constants (used in constructs such as `[requires: this == "abcd"]`) may
take two forms:

1.  `"A quoted string using C-style escapes like \n"`

    In addition to standard C89 escapes (as interpreted by an ASCII Unix
    compiler):

    *   `\0` => 0
    *   `\a` => 7
    *   `\b` => 8
    *   `\t` => 9
    *   `\n` => 10
    *   `\v` => 11
    *   `\f` => 12
    *   `\r` => 13
    *   `\"` => 34
    *   `\'` => 39
    *   `\?` => 63 (part of the C standard, but rarely used)
    *   `\\` => 92
    *   <code>\x*hh*</code> => 0x*hh*

    The following non-C-standard escapes should be allowed:

    *   `\e` => 27 (not actually standard, but common)
    *   <code>\d*nnn*</code> => *nnn*
    *   <code>\x{*hh*}</code> => 0x*hh*
    *   <code>\d{*nnn*}</code> => *nnn*

    Note that the standard C escape <code>\\*nnn*</code> is explicitly not
    supported.  C treats *nnn* as octal, which is often surprising, and modern
    languages (the cut off date appears to be about 1993 -- right between Python
    2 and Java) have largely dropped support for the octal escapes.

    Based on a brief survey, only `\n`, `\t`, `\"`, `\\`, and `\'` appear to be
    (nearly) universal among popular programming languages.  <code>\x*hh*</code>
    is very common, though not universal.  <code>\u*nnnn*</code>, where *nnnn*
    is a Unicode hex value to be encoded as UTF-8 or UTF-16, also appears to be
    common, but only for text strings.

    To avoid ambiguity, the un-braced <code>\x*hh*</code> escape should be
    required to have 2 hex digits, and the <code>\d*nnn*</code> escape should be
    required to have exactly 3 decimal digits.  The braced versions --
    <code>\x{*hh*}</code> and <code>\d{*nnn*}</code> -- could have any number of
    digits, but should be required to evaluate to a value in the range 0 to 255:
    that is, `\d{000000100}` should be allowed, but `\d{256}` should not.

    `\` characters should not be allowed outside of the escape sequences
    specified here.

    For now, only 7-bit ASCII printable characters (byte values 32 through 126)
    should be allowed in `"quoted strings"`, even though `.emb` files generally
    allow UTF-8.  This requirement may be relaxed in the future.

2.  A list of bytes in `{}`, where each byte is either a single-quoted character
    (`'a'`) or a numeric constant (e.g., `0x20` or `32`).

    For ease of transition from existing `UInt:8[]` fields, explicit index
    markers (`[8]:`) in the list should be allowed if the index exactly matches
    the current cursor index; this matches output from the current Emboss text
    format for `UInt:8[]`.

The existing parameter system will need to be extended to allow default values,
and to allow `external` types to accept parameters if they do not already.


### String Field Methods (C++)

#### C++ String Type Parameterization

All methods that accept or return a string value should be templated on the C++
type to use (`std::string`, `std::string_view`, `char *`, etc.).

For methods that accept a string parameter (`Write`, etc.), the template
argument should be inferred, and they can be called without specifying the type.

For methods that only return a string value (`Read`, etc.), the template
argument would need to be specified: `Read<std::string_view>()`.

`char *` should not be accepted as a return type, due to problems with ensuring
that there is actually a null byte at the end of the string.

As an input type, `char *` is like to need explicit specialization.

In many (most? all?) cases, methods should have no problem with some types that
are not really "string" types, such as `std::vector<char>`.

String types that use `signed char` or `unsigned char` instead of `char` (e.g.,
`std::basic_string<unsigned char>`) should be explicitly supported.

If the `BackingStorage` is not `ContiguousBuffer` (or some equivalent), it seems
that it might be easy to hit undefined behavior with something like
`Read<std::string_view>()`, since the iterator type returned by `begin()` and
`end()` would not correctly model `std::contiguous_iterator`.  The cautious
approach would be to disable `Read()` and `UncheckedRead()` if the backing
storage is not `ContiguousBuffer`; readout to something like `std::string` could
still be explicitly performed using the `begin()`/`end()` iterators.
Alternately, for non-`ContiguousBuffer` backing storage, `Read()` could be
explicitly limited to a small set of known-good types, such as `std::string` and
`std::vector<char>`.


#### Methods

`Read()`, `UncheckedRead()`, `Write()`, and `UncheckedWrite()` should be defined
as one would expect.

`ToString()` should be an alias for `Read()`, to ease conversion from
`UInt:8[]`.

`CouldWriteValue()` should be defined as specified in the previous section.

`Ok()` should return `true` if the string has storage (though it could be
zero-length storage) and the bytes match the requirements (e.g., if a terminator
or padding byte is required, `Ok()` should only return `true` if such a byte is
present).

`Size()` should return the (logical) length of the string in bytes.

`MaxSize()` should return `BackingStorage().SizeInBytes()` or
`BackingStorage().SizeInBytes() - 1` if the string requires a padding or
terminator byte.

`begin()`, `end()`, `rbegin()`, `rend()` should be defined as expected for a
C++ container type.

`operator[]` should return the value of a single byte at the specified offset.


#### `emboss::String` Type

(This section should not be considered particularly authoritative; the actual
implementation could differ greatly if another strategy is turns out to be
easier or less complex in practice.)

Because values retrieved from the different string types can be used
interchangeably at the expression layer (e.g., `let s = condition ? z_string :
fix_string`), there must be a way for all views over strings to return a common
type.  This is complicated by two requirements:

1.  `emboss::String` should not allocate memory.
2.  `emboss::String` needs to handle backing storage that is not
    `ContiguousBuffer`.  It also needs to handle constant strings (`let x =
    "string"`), and be able to assign `Storage`-based strings to constant
    strings and vice versa.

To satisfy the first requirement, `emboss::String` will need to hold a reference
to the underlying storage, not actually copy bytes.

One way to satisfy the second requirement would be to simply copy the string's
bytes out to a new buffer, but that conflicts with the first requirement.
Instead, it should be a sum type over a `Storage` type parameter and a constant
string, like:

```c++
template <typename Storage>
class String {
 public:
  String();
  String(const char *data, int size);
  String(Storage);
  // ... operator= ...
  int size() constexpr;
  char operator[](int index) constexpr {
    return storage_.Index() == 0 ? backports::Get<0>(storage_)[index]
                                 : backports::Get<1>(storage_).data()[index];
  }
  // ... begin(), end(), etc. ...

 private:
  // TODO: replace backports::Variant with std::variant in 2027, when Emboss
  // requires C++17.
  backports::Variant<const char *, Storage> storage_;
};
```

At least for now, `emboss::String` does not need to be exposed as a documented,
supported API -- user code can use `Read<std::string_view>()` and similar
operations as needed, with full knowledge of the underlying storage type.

Comparisons and assignments between `emboss::String`s with different `Storage`
type parameters do not need to be supported, since they cannot be generated by
the code generator -- C++ codegen would only need those operations for
`emboss::String`s that are derived from the same parent structure.


### Handling in Other Languages

C++ is unusual in that it does not differentiate at a language level between
text strings and byte strings.  Most other languages have different types for
byte strings and text strings.

For all languages that differentiate, Emboss strings should be treated as byte
strings or byte arrays (Python3 `bytes`, Rust `Vec<u8>`, Proto `bytes`, etc.)

Other than this caveat, Emboss string support should be straightforward in other
languages.


### Text Format

Text format output should use the `"quoted string"` style.  Byte values outside
the range 32 through 126 should be emitted as escapes.  Values with standard
shorthand escapes (10 => `'\n'`, 0 => `'\0'`, etc.) should be emitted as such.
For other values, hex escapes with exactly two digits (e.g., `\x06`, not `\x6`)
should be emitted.  It may be desirable to allow some `[text_format]` control
over the output in the future.

Text format input should allow both `"quoted string"` and list-of-bytes styles,
with exactly the same rules as string constants in an `.emb` file, except that
bytes > 126 might be allowed in a `"quoted string"`.


### Expressions

#### Type System Changes

In order to facilitate `[requires]` on string types, the new types should have a
new 'string' expression type.


#### Runtime Representation

In this proposal, no string manipulation are allowed, so temporary strings
(which might require memory allocation) will not be necessary.


#### String Attribute Representation

Attributes values are currently represented by a special `AttributeValue` type
which can hold either an `Expression` or a `String`.  With a string expression
type, `AttributeValue` can be replaced by a plain `Expression`.  This will
require changes to everything that touches `AttributeValue`.

Alternately, `AttributeValue` could be left in the IR with only `Expression`,
in which case only code that touches string attributes (`[byte_order]` and
`[(cpp) namespace]`) needs to change.


#### String Comparisons

Comparison operations (`==`, `<`, `>`, `>=`, `<=`, `!=`) should be allowed,
since these can be handled by passing references to existing memory.

Equality and inequality (`==` and `!=`) should be defined in the expected way:
two strings are equal iff they are the same length and the corresponding bytes
in each string have the same value, and they are unequal if they are not equal.

For ordering, strings should be compared lexically, using the binary value of
each byte, with no regard for semantic collation.  That is, `"Z" < "a"`, since
`'Z'` is 90 and `'a'` is 97.

When one string is a strict prefix of another string, the shorter string should
be "less than" the longer; e.g., `"abc" < "abcdef"`.  This is the same as the
natural ordering for zero-terminated strings.


#### Future String Operations

It may be desirable, at some future point, to allow various string
manipulations, such as concatenation or repetition, at least for compile-time
strings.

A substring operation should be possible without requiring memory allocation.

Indexing into a string (`str[offset]`) should be allowed if/when indexing into
an array is finally supported.


### Arrays of Strings

In some cases, it may be desirable to have an array of strings, like:

```
struct Foo:
  0 [+100]  ZString[10]  list
```

Although somewhat awkward, the existing explicit-length syntax should work:

```
struct Foo:
  0 [+100]  ZString:80[10]  list  # 10 10-byte (80-bit) strings
```