xref: /aosp_15_r20/external/emboss/doc/design_docs/proto.md (revision 99e0aae7469b87d12f0ad23e61142c2d74c1ef70)
1# Design Sketch: Protocol Buffers <=> Emboss Translation
2
3## Overview
4
5There are many tools that operate on Protocol Buffer objects ("Protos").
6Providing a way to translate between Protos and Emboss structures would allow
7those tools to be used without writing a tedious translation layer.
8
9
10## Defining an Equivalent Proto `message`
11
12For each Emboss `struct`, `bits`, `enum`, and primitive type, there would need
13to be some equivalent Proto encoding -- likely a `message` for each `struct` or
14`bits`, a Proto `enum` inside a `message` for each `enum` (see below), and a
15Proto primitive type for each Emboss primitive type.
16
17There are two basic ways that the Proto definition could be generated:
18
191.  Human-Authored `.proto` Definitions:
20
21    This requires more human effort when trying to use Emboss structures as
22    Protos, likely approaching the level of effort to just hand-write a
23    translation layer.  It *might* make it easier to use an existing Proto
24    definition.
25
26    It would also require significantly more flexibility, and therefore more
27    complexity, in the Emboss compiler.
28
292.  Emboss Generates a `.proto` File:
30
31    This option is likely to create slightly "unnatural" Proto definitions (see
32    below for more details), but requires very little human effort to create a
33    translation to a Proto.
34
35    Escape hatches for "partially hand-coded" translations should be
36    considered, even if they are not implemented in the first pass at Emboss
37    <=> Proto translation.
38
39Because a human always has the option to hand code their own translation, this
40document will assume option 2: the Emboss compiler generates a Proto
41definition.
42
43
44### Proto2 vs Proto3
45
46The current state of Google Protocol Buffers is a bit messy, with both "version
472" ("Proto2") and "version 3" ("Proto3") Protocol Buffers.  Proto2 and Proto3
48can (mostly) freely interoperate -- Proto2 files can import and use messages
49from Proto3 files and vice versa -- and both have long-term support guarantees
50from Google.  Differences between Proto2 and Proto3 are highlighted below: it
51is not clear whether Emboss should generate Proto2, Proto3, or both (via a flag
52or file-level property).
53
54
55### Primitive Types
56
57#### `Int`, `UInt`
58
59`Int` and `UInt` can map to Proto's `int32`, `int64`, `uint32`, and `uint64`.
60Smaller integers can be extended to the next-largest Proto integer size.
61
62
63#### `Float`
64
65`Float` maps to Proto's `float` and `double`.
66
67
68#### `Flag`
69
70`Flag` maps to Proto's `bool`.
71
72
73#### (Future) Emboss String/Blob Type
74
75A future Emboss string or blob type would translate to Proto's `string` or
76`bytes`.  It is likely that an Emboss "string" will be `bytes` in Proto, since
77Emboss is unlikely to enforce UTF-8 compliance.
78
79Note that Proto (version 2 only?) C++ does not enforce UTF-8 compliance on
80`string`, which can lead to crashes when the message is decoded in Python,
81Java, or another language that properly enforces string encoding.
82
83
84### Arrays
85
86Unidimensional arrays map neatly to `repeated` Proto fields.
87
88Multidimensional arrays must be handled with a wrapper `message` at each
89dimension after the first.
90
91Because of the way that Proto wire format works (see [Translation Between
92Emboss View and Proto Wire Format](#between-emboss-view-and-proto-wire-format),
93below), there is a slight technical advantage to wrapping the outermost array
94in its own message.  This does make the (Proto) API a bit awkward, but not too
95bad:
96
97```c++
98auto element = structure.array_field().v(2);
99auto nested_element = structure.array_2d_field().v(2).v(1);
100```
101
102vs
103
104```c++
105auto element = structure.array_field(2);
106auto nested_element = structure.array_2d_field(2).v(1);
107```
108
109
110### Conditional Fields
111
112In Proto2, conditional fields map fairly well to the concept of "presence" for
113fields.  Proto2 allows non-present fields to be read -- returning the default
114value for that field -- but this is not an issue for Emboss, which can easily
115generate the appropriate <code>has_*field*()</code> calls.
116
117Proto3 does not track existence for primitive types the way that Proto2 does.
118The "recommended" workaround is to use standardized wrapper types
119(`google.protobuf.FloatValue`, `google.protobuf.Int32Value`, etc.), which
120introduce an extra layer.  There is a second workaround, related to the slightly
121weird way that Proto handles `oneof`: if the primitive field is inside a
122`oneof`, then it is *not* always present.  A `oneof` may contain a single
123member, so primitive-typed fields could be generated as something like:
124
125```
126message Foo {
127  oneof field_1_oneof {
128    int32 field_1 = 1;
129  }
130}
131```
132
133Note that in Emboss, changing a field from unconditionally present to
134conditionally present is (usually) a backwards-compatible change.
135
136
137### (Future) Emboss Union Construct
138
139An Emboss union construct would be necessary to take advantage of runtime space
140savings from using a Proto `oneof`.
141
142
143### `struct` and `bits`
144
145`struct` and `bits` map neatly to `message`, with few issues.
146
147
148#### Anonymous `bits`
149
150Anonymous `bits` get "flattened" so that their fields appear to be part of their
151enclosing structure.  This should be handled reasonably well via treating
152read-write virtual fields as members of the `message`, and by suppressing the
153"private" fields, such as anonymous `bits`.
154
155
156#### Proto Field IDs
157
158Proto requires each field to have a unique tag ID.  We propose that, for fields
159with a fixed start location, the start location + 1 is used for a default tag
160ID: since a change to a field's start location would be a breaking change to the
161Emboss definition, it should be reasonably stable.  For fields with a variable
162start location, virtual fields, or where the programmer wants a specific tag,
163the attribute `[(proto) id]` can be used to specify the ID.
164
165The "+ 1" is required since `0` is not a valid Proto tag ID.
166
167
168### `enum`
169
170The Emboss `enum` construct does not map cleanly to the Proto `enum` construct,
171with different issues in Proto2 vs Proto3.
172
173Common to both, the names of Proto `enum` values are hoisted into the same
174namespace as the `enum` itself (consistent with the C's handling of `enum`),
175which means that multiple `enum`s in the same context cannot hold the same value
176name.  This can be handled -- somewhat awkwardly -- by wrapping the `enum` in a
177"namespace" `message`, like:
178
179```
180message SomeEnum {
181  enum SomeEnum {
182    VALUE1 = 1;
183    VALUE2 = 2;
184  }
185}
186```
187
188Additionally, Proto `enum` values must fit in an `int32`, whereas Emboss `enum`
189values may require up to a `uint64`.
190
191Proto2: In Proto2, `enum`s are closed: unknown values are ignored on message
192parse, so `enum` fields can never have an unknown value at runtime.  Emboss
193`enum`s, much like C `enum`s, can hold unknown values.
194
195Proto3: In Proto3, `enum`s are open, like Emboss `enum`s, but every Proto3
196`enum` must have a first entry whose value is `0`.  In order to avoid
197compatibility issues, Emboss should emit a well-known name for the `0` value in
198every case.  There is a second issue in Proto3: there is no "has" bit for enum
199fields, so conditional enum fields have to be wrapped in a struct.
200(TODO(bolms): are Proto3 `enum`s signed, unsigned, or either?)
201
202Thus, for Proto2, `enum`s would produce something like:
203
204```
205message SomeEnum {
206  enum SomeEnum {
207    VALUE1 = 1;
208    VALUE2 = 2;
209  }
210  oneof {
211    SomeEnum value = 1;
212    int64 integer_value = 2;
213  }
214}
215```
216
217which would be included in structures as:
218
219```
220message SomeStruct {
221  optional SomeEnum some_enum = 1;  // NOT SomeEnum.SomeEnum
222}
223```
224
225For Proto3, the situation ends up similar:
226
227```
228message SomeEnum {
229  enum SomeEnum {
230    DEFAULT = 0;
231    VALUE1 = 1;
232    VALUE2 = 2;
233  }
234  SomeEnum value = 1;
235}
236
237message SomeStruct {
238  optional SomeEnum some_enum = 1;  // NOT SomeEnum.SomeEnum
239}
240```
241
242
243#### `enum` Name Restrictions
244
245Proto enforces a (very slightly) stricter rule for the names of values within
246an `enum` than Emboss does: they must not collide *even when translated to
247CamelCase*.
248
249For example, Emboss allows:
250
251```
252enum Foo:
253  BAR_1_1 = 2
254  BAR_11 = 11
255```
256
257When translated to CamelCase, `BAR_1_1` and `BAR_11` both become `Bar11`, and
258thus are not allowed to be part of the same `enum` in Proto.
259
260It may be sufficient to require `.emb` authors to update their `enum`s when
261attempting to compile to Proto.
262
263
264### Bookkeeping Fields
265
266Emboss structures often have "bookkeeping" fields that are either irrelevant to
267typical Proto consumers, or place unusual restrictions.
268
269For example, fields which are used to calculate the offset of other fields are
270generally not useful to Proto consumers:
271
272```
273struct Foo:
274  0 [+4]  UInt  header_length (h)
275  h [+4]  UInt  first_body_message
276```
277
278**These fields would still need to be set correctly when translating *from*
279Proto to Emboss.**
280
281Some of the pain could likely be mitigated via a [default
282values](#default_values.md) feature, when implemented.
283
284Field-length fields are somewhat trickier:
285
286```
287struct Foo:
288  0 [+4]  UInt      message_length (m)
289  4 [+m]  UInt:8[]  message_bytes
290```
291
292In Proto, `message_length` becomes an implicit part of `message_bytes`, since
293`message_bytes` knows its own length.  For simple fields cases, as above, we
294can likely have the Emboss compiler "just figure it out" and fold
295`message_length` into `message_bytes`.  For more complex cases, we will
296probably need to have explicit annotations (`[(proto) set_length_by: x =
297some_expression]`), or just require applications using the Proto side to set
298length fields correctly.
299
300A similar problem happens with "message type" fields:
301
302```
303struct Foo:
304  0 [+4]  MessageType  message_type (mt)
305  if mt == MessageType.BAR:
306    4 [+8]  Bar  bar
307  if mt == MessageType.BAZ:
308    4 [+16]  Baz  baz
309  # ...
310```
311
312This will probably be easier to handle with a `union` construct in Emboss.
313Again, "complex" cases will probably have to be handled by application code.
314
315
316## Translation
317
318### Between Emboss View and Proto In-Memory Format
319
320Translation should be relatively straightforward; when going from Emboss to
321Proto, the problem is roughly equivalent to serializing a View to text, and for
322Proto to Emboss it is roughly equivalent to deserializing a View from text.
323
324One minor difference is that the *deserialization* from Proto must occur in
325dependency order, while serialization can happen in any order.  In Emboss text
326format, *serialization* happens in dependency order, and deserialization happens
327in whatever order is specified in the text.
328
329As with deserialization from text, it is possible for the Proto message to
330include untranslatable entries (e.g., an Emboss `Int:16` would stored in a Proto
331`int32`; a too-large value in the Proto `message` should be rejected).
332
333
334### Between Emboss View and Proto Wire Format
335
336Since the Proto wire format is extremely stable and documented, it would be
337possible for Emboss to emit code to directly translate between Emboss structs
338and proto wire format.
339
340*Serialization* is relatively straightforward; except for arrays, the code
341structure is almost identical to the text serialization code structure.
342
343*Deserialization* is problematic.  First and foremost, Proto does not specify an
344order in which the fields of a structure will be serialized, so it is entirely
345possible for the Emboss view to see a dependent field before its prerequisite
346(e.g., have a variable-offset field before the offset specifier field).
347Secondly, Proto repeated fields aren't really "arrays"; on the wire, other
348fields can appear *in between* elements of repeated fields.  For Emboss, this
349means that every array in the structure would have to maintain a cursor during
350deserialization.
351
352It *may* still be desirable to support serialization without trying to support
353deserialization, or to support deserialization for a subset of structures, so
354that we can send protos to/from microcontrollers: this would be an alternative
355to Nanopb for some cases.
356
357
358### Between Emboss View and [Nanopb](https://github.com/nanopb/nanopb)
359
360In order to translate between Emboss views and Protos on microcontrollers and
361other limited-memory devices, it may make sense to generate Emboss <=> Nanopb
362code.  On top of the standard Proto generator, we would have to implement a
363Nanopb options file generator, and translation code.
364
365
366## Miscellaneous Notes
367
368### Overlays
369
370Emboss was designed with the notion that some backends would need their own
371attributes -- for example, the `[(cpp) namespace]` attribute, and here there
372are a number of `[(proto)]` attributes.
373
374However, adding back-end-specific attributes still requires changes to be made
375directly to the `.emb` file, which may be inconvenient for `.emb`s from third
376parties.
377
378Ideally, one could write an "overlay file," like:
379
380```
381message Foo
382  [(proto) attr = value]
383
384  field
385    [(proto) field_attr = value]
386```
387
388This is not needed for a first pass at a Proto back end, but should be
389considered.
390
391
392### Generating an `.emb` From a `.proto`
393
394There are cases where it would be useful to generate a microcontroller-friendly
395representation of an existing Proto, rather than the other way around.
396
397For most `message`s, it would be relatively straightforward to generate a
398`struct`, like:
399
400```
401message Foo {
402  optional int32 bar = 1;
403  optional bool baz = 2;
404  optional string qux = 3;
405}
406```
407
408to:
409
410```
411struct Foo:
412  0          [+4]             bits:
413    0 [+1]    Flag  has_bar
414    1 [+1]    Flag  has_baz
415    if has_baz:
416      2 [+1]  Flag  baz
417    2 [+1]    Flag  has_qux
418
419  if has_bar:
420    4          [+4]           Int:32    bar
421
422  if has_qux:
423    8          [+4]           UInt:32   qux_offset
424    12         [+4]           UInt:32   qux_length
425    qux_offset [+qux_length]  UInt:8[]  qux
426```
427
428The main issue is that it would be difficult to maintain equivalent
429backwards-compatibility guarantees to the ones that Proto provides as messages
430evolve.
431
432Also note that this format is fairly close to the [Cap'n
433Proto](https://capnproto.org/) format.
434