1
2<!---
3This document contains embedded graphviz diagrams inside ```dot blocks.
4
5To convert it to rendered form using render.py:
6  $ ./render.py wrapping-upb.in.md
7
8You can also live-preview this document with all diagrams using Markdown Preview Enhanced
9in Visual Studio Code:
10  https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
11--->
12
13# Building a protobuf library on upb
14
15This is a guide for creating a new protobuf implementation based on upb.  It
16starts from the beginning and walks you through the process, highlighting
17some important design choices you will need to make.
18
19## Overview
20
21A protobuf implementation consists of two main pieces:
22
231. a code generator, run at compile time, to turn `.proto` files into source
24   files in your language (we will call this "zlang", assuming an extension of ".z").
252. a runtime component, which implements the wire format and provides the data
26   structures for representing protobuf data and metadata.
27
28<br/>
29
30```dot {align="center"}
31digraph {
32    rankdir=LR;
33    newrank=true;
34    node [style="rounded,filled" shape=box]
35    "foo.proto" -> protoc;
36    "foo.proto" [shape=folder];
37    protoc [fillcolor=lightgrey];
38    protoc -> "protoc-gen-zlang";
39    "protoc-gen-zlang" -> "foo.z";
40    "protoc-gen-zlang" [fillcolor=palegreen3];
41    "foo.z" [shape=folder];
42    labelloc="b";
43    label="Compile Time";
44}
45```
46
47<br/>
48
49```dot {align="center"}
50digraph {
51    newrank=true;
52    node [style="rounded,filled" shape=box fillcolor=lightgrey]
53    "foo.z" -> "zlang/upb glue (FFI)";
54    "zlang/upb glue (FFI)" -> "upb (C)";
55    "zlang/upb glue (FFI)" [fillcolor=palegreen3];
56    labelloc="b";
57    label="Runtime";
58}
59```
60
61The parts in green are what you will need to implement.
62
63Note that your code generator (`protoc-gen-zlang`) does *not* need to generate
64any C code (eg. `foo.c`). While upb itself is written in C, upb's parsers and
65serializers are fully table-driven, which means there is never any need or even
66benefit to generating C code for each proto. upb is capable of full-speed
67parsing even when schema data is loaded at runtime from strings embedded into
68`foo.z`. This is a key benefit of upb compared with C++ protos, which have
69traditionally relied on generated parsers in `foo.pb.cc` files to achieve full
70parsing speed, and suffered a ~10x speed penalty in the parser when the schema
71data was loaded at runtime.
72
73## Prerequisites
74
75There are a few things that the language runtime must provide in order to wrap
76upb.
77
781.  **FFI**: To wrap upb, your language must be able to call into a C API
79    through a Foreign Function Interface (FFI). Most languages support FFI in
80    some form, either through "native extensions" (in which you write some C
81    code to implement new methods in your language) or through a direct FFI (in
82    which you can call into regular C functions directly from your language
83    using a special library).
842.  **Finalizers, Destructors, or Cleaners**: The runtime must provide
85    finalizers or destructors of some sort. There must be a way of triggering a
86    call to a C function when the language garbage collects or otherwise
87    destroys an object. We don't care much whether it is a finalizer, a
88    destructor, or a cleaner, as long as it gets called eventually when the
89    object is destroyed. upb allocates memory in C space, and a finalizer is our
90    only way of making sure that memory is freed and does not leak.
913.  **HashMap with weak values**: (optional) This is not a strong requirement,
92    but it is sometimes helpful to have a global hashmap with weak values to act
93    as a `upb_msg* -> wrapper` object cache. We want the values to be weak (not
94    the keys). There is some question about whether we want to continue to use
95    this pattern going forward.
96
97## Reflection vs. MiniTables
98
99The first key design decision you will need to make is whether your generated
100code will access message data via reflection or minitables. Generally more
101dynamic languages will want to use reflection and more static languages will
102want to use minitables.
103
104### Reflection
105
106Reflection-based data access makes the most sense in highly dynamic language
107interpreters, where method dispatch is generally resolved via strings and hash
108table lookups.
109
110In such languages, you can often implement a special method like `__getattr__`
111(Python) or `method_missing` (Ruby) that receives the method name as a string.
112Using upb's reflection, you can look up a field name using the method name,
113thereby using a hash table belonging to upb instead of one provided by the
114language.
115
116```python
117class FooMessage:
118  # Written in Python for illustration, but in practice we will want to
119  # implement this in C for speed.
120  def __getattr__(self, name):
121    field = FooMessage.descriptor.fields_by_name[name]
122    return field.get_value(self)
123```
124
125Using this design, we only need to attach a single `__getattr__` method to each
126message class, instead of defining a getter/setter for each field. In this way
127we can avoid duplicating hash tables between upb and the language interpreter,
128reducing memory usage.
129
130Reflection-based access requires loading full reflection at runtime. Your
131generated code will need to embed serialized descriptors (ie. a serialized
132message of `descriptor.proto`), which has some amount of size overhead and
133exposes all message/field names to the binary. It also forces a hash table
134lookup in the critical path of field access. If method calls in your language
135already have this overhead, then this is no added burden, but for statically
136dispatched languages it would cause extra overhead.
137
138If we take this path to its logical conclusion, all class creation can be
139performed fully dynamically, using only a binary descriptor as input. The
140"generated code" becomes little more than an embedded descriptor plus a
141library call to load it. Python has recently gone down this path. Generated
142code now looks something like this:
143
144```python
145# main_pb2.py
146from google3.net.proto2.python.internal import builder as _builder
147from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool
148
149DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>")
150_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
151_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals())
152```
153
154This is all the runtime needs to create all of the classes for messages defined
155in that serialized descriptor.  This code has no pretense of readability, but
156a separate `.pyi` stub file provides a fully expanded and readable list of all
157methods a user can expect to be available:
158
159```python
160# main_pb2.pyi
161from google3.net.proto2.python.public import descriptor as _descriptor
162from google3.net.proto2.python.public import message as _message
163from typing import ClassVar as _ClassVar, Optional as _Optional
164
165DESCRIPTOR: _descriptor.FileDescriptor
166
167class MyMessage(_message.Message):
168    __slots__ = ["my_field"]
169    MY_FIELD_FIELD_NUMBER: _ClassVar[int]
170    my_field: str
171    def __init__(self, my_field: _Optional[str] = ...) -> None: ...
172```
173
174To use reflection-based access:
175
1761. Load and access descriptor data using the interfaces in upb/def.h.
1772. Access message data using the interfaces in upb/reflection.h.
178
179### MiniTables
180
181MiniTables are a "lite" schema representation that are much smaller than
182reflection. MiniTables omit names, options, and almost everything else from the
183`.proto` file, retaining only enough information to parse and serialize binary
184format.
185
186MiniTables can be loaded into upb through *MiniDescriptors*. MiniDescriptors are
187a byte-oriented format that can be embedded into your generated code and passed
188to upb to construct MiniTables. MiniDescriptors only use printable characters,
189and therefore do not require escaping when embedding them into generated code
190strings. Overall the size savings of MiniDescriptors are ~60x compared with
191regular descriptors.
192
193MiniTables and MiniDescriptors are a natural choice for compiled languages that
194resolve method calls at compile time. For languages that are sometimes compiled
195and sometimes interpreted, there might not be an obvious choice. When a method
196call is statically bound, we want to remove as much overhead as possible,
197especially from accessors. In the extreme case, we can use unsafe APIs to read
198raw memory at a known offset:
199
200```java
201// Example of a maximally-optimized generated accessor.
202class FooMessage {
203    public long getBarField() {
204        // Using Unsafe should give us performance that is comparable to a
205        // native member access.
206        //
207        // The constant "24" is obtained from upb at compile time.
208        sun.misc.Unsafe.getLong(this.ptr, 24);
209    }
210}
211```
212
213This design is very low-level, and tightly couples the generated code to one
214specific version of the schema and compiler.  A slower but safer version would
215look up a field by field number:
216
217```java
218// Example of a more loosely-coupled accessor.
219class FooMessage {
220    public long getBarField() {
221        // The constant "2" is the field number.  Internally this will look
222        // up the number "2" in the MiniTable and use that to read the value
223        // from the message.
224        upb.glue.getLong(this.ptr, 2);
225    }
226}
227```
228
229One downside of MiniTables is that they cannot support parsing or serializing
230to JSON or TextFormat, because they do not know the field names.  It should be
231possible to generate reflection data "on the side", into separate generated
232code files, so that reflection is only pulled in if it is being used.  However
233APIs to do this do not exist yet.
234
235To use MiniTable-based access:
236
2371. Load and access MiniDescriptors data using the interfaces in upb/mini_table.h.
2382. Access message data using the interfaces in upb/msg_accessors.h.
239
240## Memory Management
241
242One of the core design challenges when wrapping upb is memory management.  Every
243language runtime will have some memory management system, whether it is
244garbage collection, reference counting, manual memory management, or some hybrid
245of these.  upb is written in C and uses arenas for memory management, but upb is
246designed to integrate with a wide variety of memory management schemes, and it
247provides a number of tools for making this integration as smooth as possible.
248
249### Arenas
250
251upb defines data structures in C to represent messages, arrays (repeated
252fields), and maps.  A protobuf message is a hierarchical tree of these objects.
253For example, a relatively simple protobuf tree might look something like this:
254
255```dot {align="center"}
256digraph G {
257  rankdir=LR;
258  newrank=true;
259  node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
260  upb_msg -> upb_msg2;
261  upb_msg -> upb_array;
262  upb_msg [label="upb Message" fillcolor=1]
263  upb_msg2 [label="upb Message"];
264  upb_array [label="upb Array"]
265}
266```
267
268All upb objects are allocated from an arena.  An arena lets you allocate objects
269individually, but you cannot free individual objects; you can only free the arena
270as a whole.  When the arena is freed, all of the individual objects allocated
271from that arena are freed together.
272
273```dot {align="center"}
274digraph G {
275  rankdir=LR;
276  newrank=true;
277  subgraph cluster_0 {
278    label = "upb Arena"
279    graph[style="rounded,filled" fillcolor=gray]
280    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
281    upb_msg -> upb_array;
282    upb_msg -> upb_msg2;
283    upb_msg [label="upb Message" fillcolor=1]
284    upb_msg2 [label="upb Message"];
285    upb_array [label="upb Array"];
286  }
287}
288```
289
290In simple cases, the entire tree of objects will all live in a single arena.
291This has the nice property that there cannot be any dangling pointers between
292objects, since all objects are freed at the same time.
293
294However upb allows you to create links between any two objects, whether or
295not they are in the same arena.  The library does not know or care what arenas
296the objects are in when you create links between them.
297
298```dot {align="center"}
299digraph G {
300  rankdir=LR;
301  newrank=true;
302  subgraph cluster_0 {
303    label = "upb Arena 1"
304    graph[style="rounded,filled" fillcolor=gray]
305    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
306    upb_msg -> upb_array;
307    upb_msg -> upb_msg2;
308    upb_msg [label="upb Message 1" fillcolor=1]
309    upb_msg2 [label="upb Message 2"];
310    upb_array [label="upb Array"];
311  }
312  subgraph cluster_1 {
313    label = "upb Arena 2"
314    graph[style="rounded,filled" fillcolor=gray]
315    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1]
316    upb_msg3;
317  }
318  upb_msg2 -> upb_msg3;
319  upb_msg3 [label="upb Message 3"];
320}
321```
322
323When objects are on separate arenas, it is the user's responsibility to ensure
324that there are no dangling pointers.  In the example above, this means Arena 2
325must outlive Message 1 and Message 2.
326
327### Integrating GC with upb
328
329In languages with automatic memory management, the goal is to handle all of the
330arenas behind the scenes, so that the user does not have to manage them manually
331or even know that they exist.
332
333We can achieve this goal if we set up the object graph in a particular way.  The
334general strategy is to create wrapper objects around all of the C objects,
335including the arena.  Our key goal is to make sure the arena wrapper is not
336GC'd until all of the C objects in that arena have become unreachable.
337
338For this example, we will assume we are wrapping upb in Python:
339
340```dot {align="center"}
341digraph G {
342  rankdir=LR;
343  newrank=true;
344  compound=true;
345
346  subgraph cluster_1 {
347    label = "upb Arena"
348    graph[style="rounded,filled" fillcolor=gray]
349    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
350    upb_msg -> upb_array [style=dashed];
351    upb_msg -> upb_msg2 [style=dashed];
352    upb_msg [label="upb Message" fillcolor=1]
353    upb_msg2 [label="upb Message"];
354    upb_array [label="upb Array"]
355    dummy [style=invis]
356  }
357  subgraph cluster_python {
358    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2]
359    peripheries=0
360    py_upb_msg [label="Python Message"];
361    py_upb_msg2 [label="Python Message"];
362    py_upb_arena [label="Python Arena"];
363  }
364  py_upb_msg -> upb_msg [style=dashed];
365  py_upb_msg2->upb_msg2 [style=dashed];
366  py_upb_msg2 -> py_upb_arena [color=springgreen4];
367  py_upb_msg -> py_upb_arena [color=springgreen4];
368  py_upb_arena -> dummy [lhead=cluster_1, color=red];
369  {
370     rank=same;
371     upb_msg;
372     py_upb_msg;
373  }
374  {
375     rank=same;
376     upb_array;
377     upb_msg2;
378     py_upb_msg2;
379  }
380  {  rank=same;
381     dummy;
382     py_upb_arena;
383  }
384  dummy->upb_array [style=invis];
385  dummy->upb_msg2 [style=invis];
386
387  subgraph cluster_01 {
388    node [shape=plaintext]
389    peripheries=0
390    key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
391      <tr><td align="right" port="i1">raw ptr</td></tr>
392      <tr><td align="right" port="i2">unique ptr</td></tr>
393      <tr><td align="right" port="i3">shared (GC) ptr</td></tr>
394      </table>>]
395    key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
396      <tr><td port="i1">&nbsp;</td></tr>
397      <tr><td port="i2">&nbsp;</td></tr>
398      <tr><td port="i3">&nbsp;</td></tr>
399      </table>>]
400    key:i1:e -> key2:i1:w [style=dashed]
401    key:i2:e -> key2:i2:w [color=red]
402    key:i3:e -> key2:i3:w [color=springgreen4]
403  }
404    key2:i1:w -> upb_msg [style=invis];
405  {
406    rank=same;
407    key;
408    upb_msg;
409  }
410}
411```
412
413In this example we have three different kinds of pointers:
414
415* **raw ptr**: This is a pointer that carries no ownership.
416* **unique ptr**: This is a pointer has *unique ownership* of the target.  The owner
417  will free the target in its destructor (or finalizer, or cleaner).  There can
418  only be a single unique pointer to a given object.
419* **shared (GC) ptr**: This is a pointer that has *shared ownership* of the
420  target.  Many objects can point to the target, and the target will be deleted
421  only when all such references are gone.  In a runtime with automatic memory
422  management (GC), this is a reference that participates in GC.  In Python such
423  references use reference counting, but in other VMs they may use mark and
424  sweep or some other form of GC instead.
425
426The Python Message wrappers have only raw pointers to the underlying message,
427but they contain a shared pointer to the arena that will ensure that the raw
428pointer remains valid.  Only when all message wrapper objects are destroyed
429will the Python Arena become unreachable, and the upb arena ultimately freed.
430
431### Links between arenas with "Fuse"
432
433The design given above works well for objects that live in a single arena. But
434what if a user wants to create a link between two objects in different arenas?
435
436TODO
437
438## UTF-8 vs. UTF-16
439
440TODO
441
442## Object Cache
443
444TODO
445