1# Design Sketch: Protocol Buffers <=> Emboss Translation 2 3## Overview 4 5There are many tools that operate on Protocol Buffer objects ("Protos"). 6Providing a way to translate between Protos and Emboss structures would allow 7those tools to be used without writing a tedious translation layer. 8 9 10## Defining an Equivalent Proto `message` 11 12For each Emboss `struct`, `bits`, `enum`, and primitive type, there would need 13to be some equivalent Proto encoding -- likely a `message` for each `struct` or 14`bits`, a Proto `enum` inside a `message` for each `enum` (see below), and a 15Proto primitive type for each Emboss primitive type. 16 17There are two basic ways that the Proto definition could be generated: 18 191. Human-Authored `.proto` Definitions: 20 21 This requires more human effort when trying to use Emboss structures as 22 Protos, likely approaching the level of effort to just hand-write a 23 translation layer. It *might* make it easier to use an existing Proto 24 definition. 25 26 It would also require significantly more flexibility, and therefore more 27 complexity, in the Emboss compiler. 28 292. Emboss Generates a `.proto` File: 30 31 This option is likely to create slightly "unnatural" Proto definitions (see 32 below for more details), but requires very little human effort to create a 33 translation to a Proto. 34 35 Escape hatches for "partially hand-coded" translations should be 36 considered, even if they are not implemented in the first pass at Emboss 37 <=> Proto translation. 38 39Because a human always has the option to hand code their own translation, this 40document will assume option 2: the Emboss compiler generates a Proto 41definition. 42 43 44### Proto2 vs Proto3 45 46The current state of Google Protocol Buffers is a bit messy, with both "version 472" ("Proto2") and "version 3" ("Proto3") Protocol Buffers. Proto2 and Proto3 48can (mostly) freely interoperate -- Proto2 files can import and use messages 49from Proto3 files and vice versa -- and both have long-term support guarantees 50from Google. Differences between Proto2 and Proto3 are highlighted below: it 51is not clear whether Emboss should generate Proto2, Proto3, or both (via a flag 52or file-level property). 53 54 55### Primitive Types 56 57#### `Int`, `UInt` 58 59`Int` and `UInt` can map to Proto's `int32`, `int64`, `uint32`, and `uint64`. 60Smaller integers can be extended to the next-largest Proto integer size. 61 62 63#### `Float` 64 65`Float` maps to Proto's `float` and `double`. 66 67 68#### `Flag` 69 70`Flag` maps to Proto's `bool`. 71 72 73#### (Future) Emboss String/Blob Type 74 75A future Emboss string or blob type would translate to Proto's `string` or 76`bytes`. It is likely that an Emboss "string" will be `bytes` in Proto, since 77Emboss is unlikely to enforce UTF-8 compliance. 78 79Note that Proto (version 2 only?) C++ does not enforce UTF-8 compliance on 80`string`, which can lead to crashes when the message is decoded in Python, 81Java, or another language that properly enforces string encoding. 82 83 84### Arrays 85 86Unidimensional arrays map neatly to `repeated` Proto fields. 87 88Multidimensional arrays must be handled with a wrapper `message` at each 89dimension after the first. 90 91Because of the way that Proto wire format works (see [Translation Between 92Emboss View and Proto Wire Format](#between-emboss-view-and-proto-wire-format), 93below), there is a slight technical advantage to wrapping the outermost array 94in its own message. This does make the (Proto) API a bit awkward, but not too 95bad: 96 97```c++ 98auto element = structure.array_field().v(2); 99auto nested_element = structure.array_2d_field().v(2).v(1); 100``` 101 102vs 103 104```c++ 105auto element = structure.array_field(2); 106auto nested_element = structure.array_2d_field(2).v(1); 107``` 108 109 110### Conditional Fields 111 112In Proto2, conditional fields map fairly well to the concept of "presence" for 113fields. Proto2 allows non-present fields to be read -- returning the default 114value for that field -- but this is not an issue for Emboss, which can easily 115generate the appropriate <code>has_*field*()</code> calls. 116 117Proto3 does not track existence for primitive types the way that Proto2 does. 118The "recommended" workaround is to use standardized wrapper types 119(`google.protobuf.FloatValue`, `google.protobuf.Int32Value`, etc.), which 120introduce an extra layer. There is a second workaround, related to the slightly 121weird way that Proto handles `oneof`: if the primitive field is inside a 122`oneof`, then it is *not* always present. A `oneof` may contain a single 123member, so primitive-typed fields could be generated as something like: 124 125``` 126message Foo { 127 oneof field_1_oneof { 128 int32 field_1 = 1; 129 } 130} 131``` 132 133Note that in Emboss, changing a field from unconditionally present to 134conditionally present is (usually) a backwards-compatible change. 135 136 137### (Future) Emboss Union Construct 138 139An Emboss union construct would be necessary to take advantage of runtime space 140savings from using a Proto `oneof`. 141 142 143### `struct` and `bits` 144 145`struct` and `bits` map neatly to `message`, with few issues. 146 147 148#### Anonymous `bits` 149 150Anonymous `bits` get "flattened" so that their fields appear to be part of their 151enclosing structure. This should be handled reasonably well via treating 152read-write virtual fields as members of the `message`, and by suppressing the 153"private" fields, such as anonymous `bits`. 154 155 156#### Proto Field IDs 157 158Proto requires each field to have a unique tag ID. We propose that, for fields 159with a fixed start location, the start location + 1 is used for a default tag 160ID: since a change to a field's start location would be a breaking change to the 161Emboss definition, it should be reasonably stable. For fields with a variable 162start location, virtual fields, or where the programmer wants a specific tag, 163the attribute `[(proto) id]` can be used to specify the ID. 164 165The "+ 1" is required since `0` is not a valid Proto tag ID. 166 167 168### `enum` 169 170The Emboss `enum` construct does not map cleanly to the Proto `enum` construct, 171with different issues in Proto2 vs Proto3. 172 173Common to both, the names of Proto `enum` values are hoisted into the same 174namespace as the `enum` itself (consistent with the C's handling of `enum`), 175which means that multiple `enum`s in the same context cannot hold the same value 176name. This can be handled -- somewhat awkwardly -- by wrapping the `enum` in a 177"namespace" `message`, like: 178 179``` 180message SomeEnum { 181 enum SomeEnum { 182 VALUE1 = 1; 183 VALUE2 = 2; 184 } 185} 186``` 187 188Additionally, Proto `enum` values must fit in an `int32`, whereas Emboss `enum` 189values may require up to a `uint64`. 190 191Proto2: In Proto2, `enum`s are closed: unknown values are ignored on message 192parse, so `enum` fields can never have an unknown value at runtime. Emboss 193`enum`s, much like C `enum`s, can hold unknown values. 194 195Proto3: In Proto3, `enum`s are open, like Emboss `enum`s, but every Proto3 196`enum` must have a first entry whose value is `0`. In order to avoid 197compatibility issues, Emboss should emit a well-known name for the `0` value in 198every case. There is a second issue in Proto3: there is no "has" bit for enum 199fields, so conditional enum fields have to be wrapped in a struct. 200(TODO(bolms): are Proto3 `enum`s signed, unsigned, or either?) 201 202Thus, for Proto2, `enum`s would produce something like: 203 204``` 205message SomeEnum { 206 enum SomeEnum { 207 VALUE1 = 1; 208 VALUE2 = 2; 209 } 210 oneof { 211 SomeEnum value = 1; 212 int64 integer_value = 2; 213 } 214} 215``` 216 217which would be included in structures as: 218 219``` 220message SomeStruct { 221 optional SomeEnum some_enum = 1; // NOT SomeEnum.SomeEnum 222} 223``` 224 225For Proto3, the situation ends up similar: 226 227``` 228message SomeEnum { 229 enum SomeEnum { 230 DEFAULT = 0; 231 VALUE1 = 1; 232 VALUE2 = 2; 233 } 234 SomeEnum value = 1; 235} 236 237message SomeStruct { 238 optional SomeEnum some_enum = 1; // NOT SomeEnum.SomeEnum 239} 240``` 241 242 243#### `enum` Name Restrictions 244 245Proto enforces a (very slightly) stricter rule for the names of values within 246an `enum` than Emboss does: they must not collide *even when translated to 247CamelCase*. 248 249For example, Emboss allows: 250 251``` 252enum Foo: 253 BAR_1_1 = 2 254 BAR_11 = 11 255``` 256 257When translated to CamelCase, `BAR_1_1` and `BAR_11` both become `Bar11`, and 258thus are not allowed to be part of the same `enum` in Proto. 259 260It may be sufficient to require `.emb` authors to update their `enum`s when 261attempting to compile to Proto. 262 263 264### Bookkeeping Fields 265 266Emboss structures often have "bookkeeping" fields that are either irrelevant to 267typical Proto consumers, or place unusual restrictions. 268 269For example, fields which are used to calculate the offset of other fields are 270generally not useful to Proto consumers: 271 272``` 273struct Foo: 274 0 [+4] UInt header_length (h) 275 h [+4] UInt first_body_message 276``` 277 278**These fields would still need to be set correctly when translating *from* 279Proto to Emboss.** 280 281Some of the pain could likely be mitigated via a [default 282values](#default_values.md) feature, when implemented. 283 284Field-length fields are somewhat trickier: 285 286``` 287struct Foo: 288 0 [+4] UInt message_length (m) 289 4 [+m] UInt:8[] message_bytes 290``` 291 292In Proto, `message_length` becomes an implicit part of `message_bytes`, since 293`message_bytes` knows its own length. For simple fields cases, as above, we 294can likely have the Emboss compiler "just figure it out" and fold 295`message_length` into `message_bytes`. For more complex cases, we will 296probably need to have explicit annotations (`[(proto) set_length_by: x = 297some_expression]`), or just require applications using the Proto side to set 298length fields correctly. 299 300A similar problem happens with "message type" fields: 301 302``` 303struct Foo: 304 0 [+4] MessageType message_type (mt) 305 if mt == MessageType.BAR: 306 4 [+8] Bar bar 307 if mt == MessageType.BAZ: 308 4 [+16] Baz baz 309 # ... 310``` 311 312This will probably be easier to handle with a `union` construct in Emboss. 313Again, "complex" cases will probably have to be handled by application code. 314 315 316## Translation 317 318### Between Emboss View and Proto In-Memory Format 319 320Translation should be relatively straightforward; when going from Emboss to 321Proto, the problem is roughly equivalent to serializing a View to text, and for 322Proto to Emboss it is roughly equivalent to deserializing a View from text. 323 324One minor difference is that the *deserialization* from Proto must occur in 325dependency order, while serialization can happen in any order. In Emboss text 326format, *serialization* happens in dependency order, and deserialization happens 327in whatever order is specified in the text. 328 329As with deserialization from text, it is possible for the Proto message to 330include untranslatable entries (e.g., an Emboss `Int:16` would stored in a Proto 331`int32`; a too-large value in the Proto `message` should be rejected). 332 333 334### Between Emboss View and Proto Wire Format 335 336Since the Proto wire format is extremely stable and documented, it would be 337possible for Emboss to emit code to directly translate between Emboss structs 338and proto wire format. 339 340*Serialization* is relatively straightforward; except for arrays, the code 341structure is almost identical to the text serialization code structure. 342 343*Deserialization* is problematic. First and foremost, Proto does not specify an 344order in which the fields of a structure will be serialized, so it is entirely 345possible for the Emboss view to see a dependent field before its prerequisite 346(e.g., have a variable-offset field before the offset specifier field). 347Secondly, Proto repeated fields aren't really "arrays"; on the wire, other 348fields can appear *in between* elements of repeated fields. For Emboss, this 349means that every array in the structure would have to maintain a cursor during 350deserialization. 351 352It *may* still be desirable to support serialization without trying to support 353deserialization, or to support deserialization for a subset of structures, so 354that we can send protos to/from microcontrollers: this would be an alternative 355to Nanopb for some cases. 356 357 358### Between Emboss View and [Nanopb](https://github.com/nanopb/nanopb) 359 360In order to translate between Emboss views and Protos on microcontrollers and 361other limited-memory devices, it may make sense to generate Emboss <=> Nanopb 362code. On top of the standard Proto generator, we would have to implement a 363Nanopb options file generator, and translation code. 364 365 366## Miscellaneous Notes 367 368### Overlays 369 370Emboss was designed with the notion that some backends would need their own 371attributes -- for example, the `[(cpp) namespace]` attribute, and here there 372are a number of `[(proto)]` attributes. 373 374However, adding back-end-specific attributes still requires changes to be made 375directly to the `.emb` file, which may be inconvenient for `.emb`s from third 376parties. 377 378Ideally, one could write an "overlay file," like: 379 380``` 381message Foo 382 [(proto) attr = value] 383 384 field 385 [(proto) field_attr = value] 386``` 387 388This is not needed for a first pass at a Proto back end, but should be 389considered. 390 391 392### Generating an `.emb` From a `.proto` 393 394There are cases where it would be useful to generate a microcontroller-friendly 395representation of an existing Proto, rather than the other way around. 396 397For most `message`s, it would be relatively straightforward to generate a 398`struct`, like: 399 400``` 401message Foo { 402 optional int32 bar = 1; 403 optional bool baz = 2; 404 optional string qux = 3; 405} 406``` 407 408to: 409 410``` 411struct Foo: 412 0 [+4] bits: 413 0 [+1] Flag has_bar 414 1 [+1] Flag has_baz 415 if has_baz: 416 2 [+1] Flag baz 417 2 [+1] Flag has_qux 418 419 if has_bar: 420 4 [+4] Int:32 bar 421 422 if has_qux: 423 8 [+4] UInt:32 qux_offset 424 12 [+4] UInt:32 qux_length 425 qux_offset [+qux_length] UInt:8[] qux 426``` 427 428The main issue is that it would be difficult to maintain equivalent 429backwards-compatibility guarantees to the ones that Proto provides as messages 430evolve. 431 432Also note that this format is fairly close to the [Cap'n 433Proto](https://capnproto.org/) format. 434